CN105095444A

CN105095444A - Information acquisition method and device

Info

Publication number: CN105095444A
Application number: CN201510441024.5A
Authority: CN
Inventors: 霍华荣; 马艳军; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2015-11-25

Abstract

The application discloses an information acquisition method and device. An embodiment of the method includes: acquiring a plurality of question-answer pairs in a data set, extracting at least one question word and at least one answer word of each question-answer pair; determining a context of the question word and the answer word; taking the question word, the answer word and the context as a training sample, training a preset model, and obtaining a word vector set; receiving question information to be responded; based on the word vector set, acquiring answer information matching the question information from the data set. The information acquisition method and device may train the word vector through evaluating a correlation of the question-answer pairs from a semantic aspect, and a training speed and accuracy of the word vector are improved. Based on that when the word vector acquires the matched information, a complicated supervised neural network training is not needed, the speed and accuracy of information acquisition can be improved.

Description

Information getting method and device

Technical field

The application relates to field of computer technology, is specifically related to field of terminal technology, particularly relates to a kind of information getting method and device.

Background technology

The approach of current people obtaining information is on the internet mainly search engine, and user needs to browse a large amount of webpages could obtain answer, and efficiency is lower.Degree of depth question and answer technology makes search more intelligent, provides answer more accurately to user, reduces the cost of user's obtaining information.Along with the development of online question and answer websites such as such as " Baidu are known ", create the data of the question and answer mode generated by user in a large number, they provide Data support for degree of depth question and answer.

But these question and answer are uneven to quality, mainly comprise following two kinds of problems: " escape " and " tediously long "." escape " refers to that the main contents answered are uncorrelated with problem, gives an irrelevant answer." tediously long " refers to that the content answered is long, except Answer Sentence, also containing non-immediate Answer Sentences such as a large amount of uncorrelated, supplementary notes.

Summary of the invention

In view of above-mentioned defect of the prior art or deficiency, expect to provide a kind of scheme improving acquisition of information speed and accuracy.In order to realize above-mentioned one or more object, this application provides a kind of information getting method and device.

First aspect, this application provides a kind of information getting method, and described method comprises: obtain the multiple question and answer pair of data centralization, extract at least one right problem word of each question and answer and at least one answer word; Determine the context of described problem word and described answer word; Using described problem word, answer word and context as training sample, training preset model, obtains term vector set; Receive problem information to be responded; Based on described term vector set, obtain the answer information of mating with described problem information from described data centralization.

Second aspect, this application provides a kind of information acquisition device, and described device comprises: extraction module, for obtaining the multiple question and answer pair of data centralization, extracts at least one right problem word of each question and answer and at least one answer word; Determination module, for determining the context of described problem word and described answer word; Training module, for using described problem word, answer word and context as training sample, training preset model, obtain term vector set; Receiver module, for receiving problem information to be responded; Acquisition module, for based on described term vector set, obtains the answer information of mating with described problem information from described data centralization.

The information getting method that the application provides and device, first the multiple question and answer pair of data centralization can be obtained, extract at least one right problem word of each question and answer and at least one answer word, then the context of described problem word and described answer word is determined, afterwards using described problem word, answer word and context as training sample, training preset model, obtain term vector set, finally receive problem information to be responded, and based on described term vector set, obtain the answer information of mating with described problem information from described data centralization.The application trains term vector by the correlativity evaluating question and answer right from semantic aspect, improves term vector training speed and accuracy.When obtaining match information based on term vector, there is supervision neural metwork training without the need to complexity, improve speed and the accuracy of acquisition of information.

Accompanying drawing explanation

By reading the detailed description done non-limiting example done with reference to the following drawings, the other features, objects and advantages of the application will become more obvious:

Fig. 1 shows the exemplary system architecture 100 can applying the embodiment of the present application;

Fig. 2 shows the process flow diagram of an embodiment of the information getting method provided according to the application;

Fig. 3 shows the process flow diagram of another embodiment of the information getting method provided according to the application;

Fig. 4 shows the process flow diagram of another embodiment of the information getting method provided according to the application;

Fig. 5 shows the process flow diagram of an embodiment of the method for the answer information of mating with problem information that obtains from data centralization provided according to the application;

Fig. 6 shows the functional module framework schematic diagram of an embodiment of the information acquisition device 600 provided according to the application; And

Fig. 7 shows the structural representation of the computer system 700 of terminal device or the server be suitable for for realizing the embodiment of the present application.

Embodiment

Below in conjunction with drawings and Examples, the application is described in further detail.Be understandable that, specific embodiment described herein is only for explaining related invention, but not the restriction to this invention.It also should be noted that, for convenience of description, in accompanying drawing, illustrate only the part relevant to Invention.

It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.

Please refer to Fig. 1, it illustrates the exemplary system architecture 100 can applying the embodiment of the present application.

As shown in Figure 1, system architecture 100 can comprise terminal device 101,102, network 103 and server 104.Network 103 in order to provide the medium of communication link between terminal device 101,102 and server 104.Network 103 can comprise various connection type, such as wired, wireless communication link or fiber optic cables etc.

User 110 can use terminal device 101,102 mutual by network 103 and server 104, to receive or to send message etc.Such as, user can obtain the answer information etc. of mating with problem information to be responded by terminal device 101,102 from server 104 by network 103.Terminal device 101,102 can be provided with the application of various telecommunication customer end, such as JICQ, mailbox client, social platform software etc.

Terminal device 101,102 can be various electronic equipment, includes but not limited to PC, smart mobile phone, panel computer, personal digital assistant etc.

Server 104 can be to provide the server of various service.The process such as server can store the data received, analysis, and result is fed back to terminal device.

It should be noted that, the information getting method that the embodiment of the present application provides can be performed by terminal device 101,102, also can be performed by server 104; Information acquisition device can be arranged in terminal device 101,102, also can be arranged in server 104.In certain embodiments, can train preset model in server 104, the term vector set obtained can be stored in terminal device 101,102, for obtaining the answer information of mating with problem information.Such as, if network 103 is unimpeded, return after can obtaining the answer information of mating with wait the problem information responded from data centralization by server 104, if do not have network or network 103 not smooth, directly can obtain the answer information of mating with problem information to be responded from data centralization by terminal device 101,102.

Should be appreciated that, the number of the terminal device in Fig. 1, network and server is only schematic.According to realizing needs, the terminal device of arbitrary number, network and server can be had.

With further reference to Fig. 2, it illustrates the flow process 200 of an embodiment of the information getting method provided according to the application.

As shown in Figure 2, in step 201, obtain the multiple question and answer pair of data centralization, extract at least one right problem word of each question and answer and at least one answer word.

Obtaining information to be carried out based on term vector, first need to collect sample to train term vector.In the present embodiment, the multiple question and answer of data centralization can first be obtained to the sample as training term vector.Above-mentioned data set such as the database built in advance, can contain a large amount of question and answer pair in this database.Such as, question and answer pair can be obtained from network, and be kept at data centralization.Above-mentioned data set can be kept at server or terminal.

Get the multiple question and answer of data centralization to rear, at least one right problem word of each question and answer and at least one answer word can be extracted further.Each question and answer are to being made up of a problem sentence and one or more answer sentence.In the present embodiment, the problem sentence of question and answer centering and answer sentence can be split, be split as the problem sentence and answer sentence that are made up of one or more word respectively.Therefore, the word in each problem sentence and answer sentence can be extracted, respectively as problem word or answer word.

In an optional implementation of the present embodiment, above-mentioned question and answer to can comprise voice response to word question and answer pair.When getting voice response pair, speech recognition technology can also be passed through, by voice response to being converted to word question and answer pair.Further, at least one right problem word of the word question and answer after conversion and at least one answer word is extracted.

In an optional implementation of the present embodiment, in order to distinguish problem word and answer word, the first prefix can be added for each problem word, for each answer word adds the second prefix.Such as, can be that each problem word adds prefix " Q ", for each answer word adds prefix " A ".

In step 202., the context of problem identificatioin word and answer word.

After extracting at least one right problem word of each question and answer and at least one answer word in step 201, can the context of problem identificatioin word and answer word further.Such as, by pre-defined rule, the problem word of same question and answer centering and answer word identical context can be set to, or also each problem word and each answer word the context of each problem word and each answer word can be set to by the context in former problem sentence and answer sentence.In one implementation, can the contextual length of offering question word and answer word.The context of problem word and answer word identical length (e.g., 7) can be set to, also the context of problem word and answer word different length can be set to.

In step 203, using problem word, answer word and context as training sample, training preset model, obtains term vector set.

Get the context of problem word, answer word and problem identificatioin word and answer word in data centralization after, using these data as sample data, the training pattern preset can be trained.Because the final purpose of training to determine term vector set, therefore term vector set can be regarded as a unknown parameter in preset model, then preset model be trained.When above-mentioned parameter can allow preset model meet specific training objective, just can think that parameter is now exactly the term vector set needing to determine.

In an optional implementation of the present embodiment, above-mentioned each term vector can be the low dimension real number vector that dimension is not more than 1000.Such as, the concrete form of the term vector finally determined can be following form: [0.355 ,-0.687 ,-0.168,0.103 ,-0.231 ...] low-dimensional real number vector, dimension is generally no more than the integer of 1000.If dimension very little, then fully can not represent the difference between each word, and dimension is too many, then calculated amount can be larger.Alternatively, the dimension of term vector between 50 to 1000, thus can take into account accuracy and counting yield simultaneously.

In step 204, problem information to be responded is received.

After obtaining term vector in step 203, obtaining information can be carried out based on term vector.First problem information to be responded can be received.Particularly, user can input its problem of wanting to inquire about in the search box of browser, and such as, the problems referred to above can be key word or complete sentence.Namely the content of user's input can be used as problem information to be responded.

In step 205, based on term vector set, obtain the answer information of mating with problem information from data centralization.

In the present embodiment, when search system receive user input after the problem information responded, can first search for this problem information in data centralization, to obtain the multiple answer information corresponding with this problem information.Then in multiple answer information, retrieve one or more answer information of mating with problem information.Such as, based on the matching degree of term vector set computational problem information and each answer information, matching degree can be met the answer information of pre-conditioned (e.g., being greater than 80%) as the answer information of mating with problem information.

In an optional implementation of the present embodiment, can the one or more answer information of mating with problem information obtained be presented in the terminal, check for user.

The information getting method that the present embodiment provides, first the multiple question and answer pair of data centralization can be obtained, extract at least one right problem word of each question and answer and at least one answer word, then the context of problem identificatioin word and answer word, afterwards by problem word, answer word and context are as training sample, training preset model, obtain term vector set, finally receive problem information to be responded, and based on term vector set, the answer information of mating with problem information is obtained from data centralization, term vector is trained by the correlativity evaluating question and answer right from semantic aspect, improve term vector training speed and accuracy.When obtaining match information based on term vector, there is supervision neural metwork training without the need to complexity, improve speed and the accuracy of acquisition of information.

With further reference to Fig. 3, it illustrates the flow process 300 of another embodiment of the information getting method provided according to the application.

As shown in Figure 3, in step 301, obtain the multiple question and answer pair of data centralization, extract at least one right problem word of each question and answer and at least one answer word.

In the present embodiment, the step 301 in above-mentioned realization flow 300 is substantially identical with the step 201 in aforementioned realization flow 200, does not repeat them here.

In step 302, the context of each problem word is determined.

In the present embodiment, the context of each problem word can first be determined.Such as, the former context of each problem word place problem sentence can be defined as the context of each problem word.As for following question and answer pair: " association's notebook is handy? ", " computer of association individual sensation is pretty good ", can the context of problem identificatioin word " notebook " be " associating handy ".

In step 303, the context of arbitrary problem word is defined as the context of the right all answer words of these problem word place question and answer.

In the present embodiment, the context of arbitrary problem word can be defined as the context of the right all answer words of these problem word place question and answer.A question and answer centering, problem sentence can corresponding one or more answer sentence, and namely each problem word can corresponding one or more answer word.In the present embodiment, can be that arbitrary problem word of a question and answer centering and all answer words of its correspondence arrange identical context.Particularly, the context of arbitrary problem word can be defined as the context of the right all answer words of these problem word place question and answer.Such as, for following question and answer pair: " association notebook handy? ", " computer of association individual sensation is pretty good ", when the context of problem word " notebook " is for " associating handy ", the context of each answer word " association ", " computer ", " individual ", " sensation ", " well " all can be set to the context identical with problem word " notebook " and " associates handy ".

In step 304, using problem word, answer word and context as training sample, training preset model, obtains term vector set.

In the present embodiment, preset model can be following function:

p = \underset{< q, a > &Element; D}{Σ} Σ_{i = 1}^{| q |} (\log p (q_{i} | C_{q i}) + Σ_{j = 1}^{| a |} (\log p (a_{j} | C_{q i}))

Wherein, <q, a> are question and answer pair in data set D, | q| is the number of this question and answer centering problem word, q _ifor the vector of this question and answer centering i-th problem word, C _qifor the context vector of this question and answer centering i-th problem word, | a| is the number of this question and answer centering answer word, a _jfor the vector of this question and answer centering jth answer word; P (q _i| C _qi) and p (a _j| C _qi) determined by following formula:

p (w | C_{w}) = \frac{\exp (w, C_{w})}{Σ_{u = 1}^{V} \exp (w_{u}, C_{w u})}

W is the vector of arbitrary word, C _wfor the context vector of this word, w _ufor the vector of u word in described data set D, C _wube the context vector of u word, V is the number containing word in described data set D.

As can be seen from above-mentioned formula, in the present embodiment, the vectorial q of a question and answer centering problem word _iwhen determining, the context vector of each answer word is all identical, and is all the context vector C of this problem word _qi.

In an optional implementation of the present embodiment, with above-mentioned function maximization for training objective, term vector set can be determined.

In step 305, problem information to be responded is received.

Within step 306, based on term vector set, obtain the answer information of mating with problem information from data centralization.

In the present embodiment, the step 305-306 in above-mentioned realization flow 300 is substantially identical with the step 204-205 in aforementioned realization flow 200 respectively, does not repeat them here.

In the present embodiment, can after the context of problem identificatioin word, the context of arbitrary problem word is defined as the context of the right all answer words of these problem word place question and answer, further using problem word, answer word and context as training sample, training preset model, obtain term vector set, the accuracy of term vector can be improved.

With further reference to Fig. 4, it illustrates the flow process 400 of another embodiment of the information getting method provided according to the application.

As shown in Figure 4, in step 401, obtain the multiple question and answer pair of data centralization, extract at least one right problem word of each question and answer and at least one answer word.

In the present embodiment, the step 401 in above-mentioned realization flow 400 is substantially identical with the step 201 in aforementioned realization flow 200, does not repeat them here.

In step 402, the context of each answer word is determined.

In the present embodiment, the context of each answer word can first be determined.Such as, the former context of each answer word place answer sentence can be defined as the context of each answer word.As for following question and answer pair: " association's notebook is handy? ", " computer of association individual sensation is pretty good ", can determine that the context of answer word " computer " is " individual of association feels good ".Alternatively or preferably, when arranging the contextual length of answer word (e.g., 7), the context of above-mentioned answer word " computer " can be " individual's sensation of association ".

In step 403, the context of arbitrary answer word is defined as the context of the right all problems word of these answer word place question and answer.

In the present embodiment, the context of arbitrary answer word can be defined as the context of the right all problems word of these answer word place question and answer.A question and answer centering, an answer sentence can corresponding one or more problem word.In the present embodiment, can be that arbitrary answer word of a question and answer centering and all problems word of its correspondence arrange identical context.Particularly, the context of arbitrary answer word can be defined as the context of the right all problems word of these answer word place question and answer.Such as, for following question and answer pair: " association notebook handy? ", " computer of association individual sensation is pretty good ", when the context of answer word " computer " can be " individual of association feels good ", the context of each problem word " association ", " notebook ", " handy " all can be set to the context " individual of association feel good " identical with answer word " computer ".Alternatively or preferably, when arranging the contextual length of answer word (e.g., 7), the context of above-mentioned answer word " computer " can be " individual's sensation of association ".Now, the context of each problem word " association ", " notebook ", " handy " all can be set to the context " individual of association feel " identical with answer word " computer ".

In step 404, using problem word, answer word and context as training sample, training preset model, obtains term vector set.

In the present embodiment, preset model can be following function:

p = \underset{< q, a > &Element; D}{Σ} Σ_{j = 1}^{| q |} (\log p (a_{j} | C_{a j}) + Σ_{i = 1}^{| a |} (\log p (q_{i} | C_{a j}))

Wherein, <q, a> are question and answer pair in data set D, | q| is the number of this question and answer centering problem word, q _ifor the vector of this question and answer centering i-th problem word, C _ajfor the context of this question and answer centering jth answer word, | a| is the number of this question and answer centering answer word, a _jfor the vector of this question and answer centering jth answer word; P (a _j| C _aj) and p (q _i| C _aj) determined by following formula:

p (w | C_{w}) = \frac{\exp (w, C_{w})}{Σ_{u = 1}^{V} \exp (w_{u}, C_{w u})}

W is the vector of arbitrary word, C _wfor the context vector of this word, w _ufor the vector of u word in data set D, C _wube the context vector of u word, V is the number containing word in data set D.

As can be seen from above-mentioned formula, in the present embodiment, the vectorial a of a question and answer centering answer word _jwhen determining, the context vector of each problem word is all identical, and is all the context vector C of this answer word _aj.

In step 405, problem information to be responded is received.

In a step 406, based on term vector set, obtain the answer information of mating with problem information from data centralization.

In the present embodiment, the step 405-406 in above-mentioned realization flow 400 is substantially identical with the step 204-205 in aforementioned realization flow 200 respectively, does not repeat them here.

In the present embodiment, can after the context determining answer word, the context of arbitrary answer word is defined as the context of the right all problems word of these answer word place question and answer, further using problem word, answer word and context as training sample, training preset model, obtain term vector set, the accuracy of term vector can be improved.

With further reference to Fig. 5, it illustrates the flow process 500 of an embodiment of the method for the answer information of mating with problem information that obtains from data centralization provided according to the application.

As shown in Figure 5, in step 501, according to term vector set, the Answer Sentence subvector of each answer information of problem sentence vector sum data centralization of construction problem information.

In the present embodiment, based on training the term vector obtained, the answer information of mating with problem information to be responded can be obtained from data centralization.Particularly, can first according to term vector set, the Answer Sentence subvector of each answer information of problem sentence vector sum data centralization of construction problem information.

In one implementation, can according to the Answer Sentence subvector of each answer information of problem sentence vector sum data centralization of following formula construction problem information:

s = Σ_{i = 1}^{m} \frac{1}{{logc}_{i}} * w_{i}

Wherein, s is the vector of arbitrary sentence, and m is the length of this sentence, w _ifor the term vector of i-th word in this sentence, c _ifor the number of times that i-th word in this sentence occurs in data centralization.

First, problem information and each answer information of data centralization can be split as multiple word.When splitting problem information or answer information, if the sentence that problem information or answer information are made up of multiple word, then can be split as multiple word according to general syntax rule; If problem information or answer information are words, then can regard this word as word after fractionation.Like this, each problem information or answer information can be split at least one word.Then, each word can be represented with training the vector obtained.Then, above-mentioned formula can be used to be configured to problem information to be responded and each answer information of data centralization comprise the problem sentence vector sum Answer Sentence subvector of one or more term vector.

In step 502, the correlativity of computational problem sentence vector and each Answer Sentence subvector.

When after the Answer Sentence subvector obtaining each answer information of problem sentence vector sum data centralization in step 501, can correlativity between the Answer Sentence subvector of each answer information of computational problem sentence vector sum data centralization.Correlativity can represent the degree that is associated between two sentence vectors, and the larger explanation of relevance values two sentence vectors are more close, and its span can be [-1,1].When correlativity is 1, can think that two sentence vectors are identical.And when similarity is-1, then can think that two sentence vectors are completely different.

In one implementation, can according to the correlativity of following formulae discovery problem sentence vector with Answer Sentence subvector:

S c o r e (s_{a}, s_{b}) = \frac{s_{a} \cdot s_{b}}{\sqrt{s_{a}^{2}} \sqrt{s_{b}^{2}}} + λ C (s_{a}, s_{b})

Wherein, s _awith s _bbe respectively the vector of problem sentence a and answer sentence b, Score (s _a, s _b) be problem sentence vector s _awith Answer Sentence subvector s _bcorrelativity, λ is the constant between 0.18-0.24, C (s _a, s _b) be the Term co-occurrence number of times between sentence a and sentence b.

In step 503, based on correlativity, determine the answer information of mating with problem information.

When in above-mentioned steps 502, after having calculated problem sentence vector and the correlativity of each Answer Sentence subvector, according to the concrete numerical value of correlativity, one or more answer information of mating with problem information can be determined.In a kind of possible implementation, answer information that can be corresponding by the Answer Sentence subvector the highest with the correlativity of problem sentence vector is defined as the answer information of mating with problem information.In another kind of implementation, can relevance threshold be pre-set, be more than or equal to the correlativity of problem sentence vector the answer information that answer information corresponding to the Answer Sentence subvector of above-mentioned threshold value be defined as mating with problem information.

Information getting method in the present embodiment, can the Answer Sentence subvector of the first each answer information of problem sentence vector sum data centralization of construction problem information, then the correlativity of computational problem sentence vector and each Answer Sentence subvector, last based on correlativity, determine the answer information of mating with problem information.When the application obtains match information based on term vector, there is supervision neural metwork training without the need to complexity, improve speed and the accuracy of acquisition of information.

Although it should be noted that the operation describing the inventive method in the accompanying drawings with particular order, this is not that requirement or hint must perform these operations according to this particular order, or must perform the result that all shown operation could realize expectation.On the contrary, the step described in process flow diagram can change execution sequence.Additionally or alternatively, some step can be omitted, multiple step be merged into a step and perform, and/or a step is decomposed into multiple step and perform.

With further reference to Fig. 6, it illustrates the functional module framework schematic diagram of an embodiment of the information acquisition device 600 provided according to the application.

As shown in Figure 6, the information acquisition device 600 that the present embodiment provides comprises: extraction module 610, determination module 620, training module 630, receiver module 640 and acquisition module 650.Wherein, extraction module 610, for obtaining the multiple question and answer pair of data centralization, extracts at least one right problem word of each question and answer and at least one answer word; Determination module 620 is for the context of problem identificatioin word and answer word; Training module 630 for using problem word, answer word and context as training sample, training preset model, obtain term vector set; Receiver module 640 is for receiving problem information to be responded; Acquisition module 650, for based on term vector set, obtains the answer information of mating with problem information from data centralization.

In an optional implementation of the present embodiment, determination module 620 is for the context of problem identificatioin word and answer word according to the following steps: the context determining each problem word; The context of arbitrary problem word is defined as the context of the right all answer words of these problem word place question and answer.

In another optional implementation of the present embodiment, preset model is with minor function:

p = \underset{< q, a > &Element; D}{Σ} Σ_{i = 1}^{| q |} (\log p (q_{i} | C_{q i}) + Σ_{j = 1}^{| a |} (\log p (a_{j} | C_{q i}))

p (w | C_{w}) = \frac{\exp (w, C_{w})}{Σ_{u = 1}^{V} \exp (w_{u}, C_{w u})}

In another optional implementation of the present embodiment, determination module 620 is for the context of problem identificatioin word and answer word according to the following steps: the context determining each answer word; The context of arbitrary answer word is defined as the context of the right all problems word of these answer word place question and answer.

p = \underset{< q, a > &Element; D}{Σ} Σ_{j = 1}^{| q |} (\log p (a_{j} | C_{a j}) + Σ_{i = 1}^{| a |} (\log p (q_{i} | C_{a j}))

p (w | C_{w}) = \frac{\exp (w, C_{w})}{Σ_{u = 1}^{V} \exp (w_{u}, C_{w u})}

In another optional implementation of the present embodiment, training module 630, for training preset model according to the following steps, obtains term vector set: turn to training objective so that preset model is maximum, determine term vector set.

In another optional implementation of the present embodiment, acquisition module 650 comprises: constructor module, for according to term vector set, and the Answer Sentence subvector of each answer information of problem sentence vector sum data centralization of construction problem information; Calculating sub module, for the correlativity of computational problem sentence vector with each Answer Sentence subvector; Determine submodule, for based on correlativity, determine the answer information of mating with problem information.

In another optional implementation of the present embodiment, constructor module is used for the Answer Sentence subvector according to each answer information of problem sentence vector sum data centralization of following formula construction problem information:

s = Σ_{i = 1}^{m} \frac{1}{{logc}_{i}} * w_{i}

In another optional implementation of the present embodiment, calculating sub module is used for according to the correlativity of following formulae discovery problem sentence vector with Answer Sentence subvector:

S c o r e (s_{a}, s_{b}) = \frac{s_{a} \cdot s_{b}}{\sqrt{s_{a}^{2}} \sqrt{s_{b}^{2}}} + λ C (s_{a}, s_{b})

In another optional implementation of the present embodiment, question and answer to comprise voice response to word question and answer pair; Device also comprises: modular converter, for by voice response to being converted to word question and answer pair.

Should be appreciated that all unit of recording in information acquisition device shown in Fig. 6 or module corresponding with each step in the method described with reference to figure 2-5.Thus, above for the module that operation and the feature of method description are equally applicable to the equipment shown in Fig. 6 and wherein comprise, do not repeat them here.

The information acquisition device that the present embodiment provides, first the multiple question and answer pair of data centralization can be obtained by extraction module, extract at least one right problem word of each question and answer and at least one answer word, then the context of determination module problem identificatioin word and answer word, training module is by problem word afterwards, answer word and context are as training sample, training preset model, obtain term vector set, last receiver module receives problem information to be responded, and by acquisition module based on term vector set, the answer information of mating with problem information is obtained from data centralization, term vector is trained by the correlativity evaluating question and answer right from semantic aspect, improve term vector training speed and accuracy.When obtaining match information based on term vector, there is supervision neural metwork training without the need to complexity, improve speed and the accuracy of acquisition of information.

Below with reference to Fig. 7, it illustrates the structural representation of the computer system 700 of terminal device or the server be suitable for for realizing the embodiment of the present application.

As shown in Figure 7, computer system 700 comprises CPU (central processing unit) (CPU) 701, and it or can be loaded into the program random access storage device (RAM) 703 from storage area 708 and perform various suitable action and process according to the program be stored in ROM (read-only memory) (ROM) 702.In RAM703, also store system 700 and operate required various program and data.CPU701, ROM702 and RAM703 are connected with each other by bus 704.I/O (I/O) interface 705 is also connected to bus 704.

I/O interface 705 is connected to: the importation 706 comprising keyboard, mouse etc. with lower component; Comprise the output 707 of such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.; Comprise the storage area 708 of hard disk etc.; And comprise the communications portion 709 of network interface unit of such as LAN card, modulator-demodular unit etc.Communications portion 709 is via the network executive communication process of such as the Internet.Driver 710 is also connected to I/O interface 705 as required.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc., be arranged on driver 710 as required, so that the computer program read from it is mounted into storage area 708 as required.

Especially, according to embodiment of the present disclosure, the process that reference flow sheet describes above may be implemented as computer software programs.Such as, embodiment of the present disclosure comprises a kind of computer program, and it comprises the computer program visibly comprised on a machine-readable medium, and described computer program comprises the program code for the method shown in flowchart.In such embodiments, this computer program can be downloaded and installed from network by communications portion 709, and/or is mounted from detachable media 711.

Process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of various embodiments of the invention, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.

Be described in unit module involved in the embodiment of the present application to be realized by the mode of software, also can be realized by the mode of hardware.Described unit module also can be arranged within a processor, such as, can be described as: a kind of processor comprises extraction module, determination module, training module, receiver module and acquisition module.Wherein, the title of these unit modules does not form the restriction to this unit module itself under certain conditions, such as, extraction module can also be described to " for obtaining the multiple question and answer pair of data centralization, extracting the module of at least one right problem word of each question and answer and at least one answer word ".

As another aspect, present invention also provides a kind of computer-readable recording medium, this computer-readable recording medium can be the computer-readable recording medium comprised in device described in above-described embodiment; Also can be individualism, be unkitted the computer-readable recording medium allocated in terminal.Described computer-readable recording medium stores more than one or one program, and described program is used for performance description in the information getting method of the application by one or more than one processor.

More than describe and be only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art are to be understood that, invention scope involved in the application, be not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, also should be encompassed in when not departing from described inventive concept, other technical scheme of being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed simultaneously.The technical characteristic that such as, disclosed in above-mentioned feature and the application (but being not limited to) has similar functions is replaced mutually and the technical scheme formed.

Claims

1. an information getting method, is characterized in that, described method comprises:

Obtain the multiple question and answer pair of data centralization, extract at least one right problem word of each question and answer and at least one answer word;

Determine the context of described problem word and described answer word;

Using described problem word, answer word and context as training sample, training preset model, obtains term vector set;

Receive problem information to be responded;

Based on described term vector set, obtain the answer information of mating with described problem information from described data centralization.

2. method according to claim 1, is characterized in that, describedly determines that the context of described problem word and described answer word comprises:

Determine the context of each problem word;

The context of arbitrary problem word is defined as the context of the right all answer words of these problem word place question and answer.

3. method according to claim 1, is characterized in that, describedly determines that the context of described problem word and described answer word comprises:

Determine the context of each answer word;

The context of arbitrary answer word is defined as the context of the right all problems word of these answer word place question and answer.

4. method according to claim 1, is characterized in that, described training preset model, obtains term vector set and comprises:

Turn to training objective so that described preset model is maximum, determine described term vector set.

5. method according to claim 1, is characterized in that, described based on described term vector set, obtains the answer information of mating with described problem information comprise from described data centralization:

According to described term vector set, the Answer Sentence subvector of each answer information of data centralization described in the problem sentence vector sum constructing described problem information;

Calculate the correlativity of described problem sentence vector and described each Answer Sentence subvector;

Based on described correlativity, determine and the answer information that described problem information mates.

6. method according to claim 5, is characterized in that, described in the problem sentence vector sum of the described problem information of described structure, the Answer Sentence subvector of each answer information of data centralization comprises:

According to the number of times that term vector and each word of each word in sentence occur in described data centralization, structure sentence vector.

7. method according to claim 5, is characterized in that, described calculating described problem sentence vector comprises with the correlativity of described Answer Sentence subvector:

Described correlativity is determined according to described problem sentence vector, described Answer Sentence subvector and the Term co-occurrence number of times between problem sentence and answer sentence.

8., according to the arbitrary described method of claim 1-7, it is characterized in that, described question and answer to comprise voice response to word question and answer pair;

Described method also comprises:

By described voice response to being converted to word question and answer pair.

9. method according to claim 8, is characterized in that, also comprises:

For each problem word adds the first prefix, for each answer word adds the second prefix.

10. method according to claim 9, is characterized in that, also comprises:

Present one or more answer information of mating with described problem information.

11. methods according to claim 10, is characterized in that, each term vector is the low dimension real number vector that dimension is not more than 1000.

12. 1 kinds of information acquisition devices, is characterized in that, described device comprises:

Extraction module, for obtaining the multiple question and answer pair of data centralization, extracts at least one right problem word of each question and answer and at least one answer word;

Determination module, for determining the context of described problem word and described answer word;

Training module, for using described problem word, answer word and context as training sample, training preset model, obtain term vector set;

Receiver module, for receiving problem information to be responded;

Acquisition module, for based on described term vector set, obtains the answer information of mating with described problem information from described data centralization.

13. devices according to claim 12, is characterized in that, described determination module is used for the context determining described problem word and described answer word according to the following steps:

Determine the context of each problem word;

14. devices according to claim 12, is characterized in that, described determination module is used for the context determining described problem word and described answer word according to the following steps:

Determine the context of each answer word;

15. devices according to claim 12, is characterized in that, described training module is used for training preset model according to the following steps, obtains term vector set:

16. devices according to claim 12, is characterized in that, described acquisition module comprises:

Constructor module, for according to described term vector set, the Answer Sentence subvector of each answer information of data centralization described in the problem sentence vector sum constructing described problem information;

Calculating sub module, for calculating the correlativity of described problem sentence vector and described each Answer Sentence subvector;

Determine submodule, for based on described correlativity, determine and the answer information that described problem information mates.

17., according to the arbitrary described device of claim 12-16, is characterized in that, described question and answer to comprise voice response to word question and answer pair;

Described device also comprises:

Modular converter, for by described voice response to being converted to word question and answer pair.