CN110390005A

CN110390005A - A kind of data processing method and device

Info

Publication number: CN110390005A
Application number: CN201910666576.4A
Authority: CN
Inventors: 吴玮
Original assignee: Beijing Shannon Huiyu Technology Co Ltd
Current assignee: Beijing Shannon Huiyu Technology Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-10-29

Abstract

The present invention provides a kind of data processing method and device, wherein this method comprises: handling the answer of problem and described problem, obtains the embedded expression of word of the word embedded expression and the answer of described problem；The embedded expression of the word of described problem is compressed, the embedded expression of word of compressed described problem is obtained；According to the embedded expression of word of the word of compressed described problem embedded expression and the answer, the matching value of answer and problem is calculated, and answer is ranked up according to obtained matching value.The data processing method and device provided through the embodiment of the present invention participate in the sequencer procedure of answer without artificial, and time saving and energy saving and sequence efficiency is high.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of computers, in particular to a data processing method and device.

Background

Currently, with the development of web2.0 technology, an internet product model in which content is generated by being dominated by a user is gradually flourishing. In the web community forum, people can freely ask various questions and answer questions of others.

Due to the increase in the number of questions and answers and the disparity in the quality of answers, it is necessary to manually check the quality of answers and rank the plurality of answers to the questions according to the quality of answers.

Manually checking the quality of each answer is time consuming, labor intensive and inefficient.

Disclosure of Invention

In order to solve the above problem, embodiments of the present invention provide a data processing method and apparatus.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

processing a question and an answer to the question to obtain a word-embedded representation of the question and a word-embedded representation of the answer;

compressing the word embedded representation of the problem to obtain a compressed word embedded representation of the problem;

and calculating the matching value of the answer and the question according to the compressed word embedded expression of the question and the compressed word embedded expression of the answer, and sequencing the answer according to the obtained matching value.

In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including:

the first processing module is used for processing the question and the answer of the question to obtain a word embedded representation of the question and a word embedded representation of the answer;

the second processing module is used for compressing the word embedded expression of the problem to obtain the compressed word embedded expression of the problem;

and the sorting module is used for calculating the matching value of the answer and the question according to the compressed word embedded expression of the question and the compressed word embedded expression of the answer, and sorting the answer according to the obtained matching value.

In the solutions provided in the first to second aspects of the embodiments of the present invention, the word-embedded representation of the question is compressed to obtain a compressed word-embedded representation of the question, then the matching value between the answer and the question is calculated according to the compressed word-embedded representation of the question and the word-embedded representation of the answer, and the answers are ranked according to the obtained matching values.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a data processing method according to embodiment 1 of the present invention;

fig. 2 is a schematic structural diagram of a data processing apparatus according to embodiment 2 of the present invention.

Detailed Description

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The initial design of the scheme aims to solve the problem of reordering answers in the community question answering.

With the development of web2.0 technology, internet product models (e.g., known hundreds, etc.) that are dominated by users to generate content are gradually flourishing. In the web community forum, people can freely ask various questions and answer questions of others. Due to the increase of the number of questions and answers and the irregularity of the quality of the answers, the process of manually checking the quality of the answers and ranking a plurality of answers to the questions according to the quality of the answers is time-consuming and labor-consuming.

Two characteristics of community question answering which are not possessed by the common question answering can be seen. First, the question includes both the subject matter that provides a brief overview of the question and the subject matter that describes the question in detail. Questioners often convey their primary focus and key information in the subject matter section of the question. They then provide more detailed information about the topic in the question content section, seeking help or expressing the emotions to the respondents. Second, redundancy and noise problems are common in community question-answering. Both the question and the answer may contain auxiliary sentences that do not provide meaningful information.

Previous studies have generally treated each word equally in question-answer expressions. However, due to redundant and noisy questions, only a portion of the text from the question and the answer is useful for determining the quality of the answer. Worse still, previous studies ignored the differences between the problem topic and the content parts and simply concatenated them into a problem representation. This simple connection may exacerbate the redundancy in question due to the above-described theme-content relationship.

Based on the above, the scheme provides a data processing method and device, the answers can be sorted only by calculating the matching values of the answers and the questions, manual participation is not needed in the whole sorting process of the answers, time and labor are saved, and sorting efficiency is high.

In the data processing method and the data processing device, the following steps are executed through a deep learning network. In the following embodiments, the parameters in the deep learning network are all set in the server, and when the parameters need to be used, the server may obtain the parameters from the server itself.

Example 1

The embodiment provides a data processing method, and an execution main body is a server.

The server may use any computing device capable of processing the text of the question and the answer to obtain the matching value between the answer and the question in the prior art, which is not described in detail herein.

Referring to a flow chart of a data processing method shown in fig. 1, the data processing method may include the following specific steps:

step 100, processing the question and the answer to the question to obtain a word-embedded representation of the question and a word-embedded representation of the answer.

In the above step 100, in order to obtain the word-embedded representation of the question and the word-embedded representation of the answer, the following steps (1) to (2) may be performed:

(1) inputting the text of the question and the text of the answer of the question into a dictionary to respectively obtain a word vector and a word vector of the question and a word vector of the answer;

(2) and splicing the word vector and the character vector of the question to obtain a word embedded representation of the question, and splicing the word vector and the character vector of the answer to obtain a word embedded representation of the answer.

In the step (1), the dictionary includes, but is not limited to: a GloVe word vector dictionary trained by the unmarked corpus and a word vector dictionary based on a convolutional neural network.

Because web text in the community question and answer forum is very different from standardized text in spelling and grammar, the specially trained GloVe vector can more accurately simulate single word interactions. Character embedding has proven to be very useful for unknown words, so it is particularly suitable for noisy web text in community question and answer forums.

Inputting the text of the problem into a GloVe word vector dictionary trained by the unmarked corpus to obtain a word vector of the problem; and inputting the text of the question into a word vector dictionary based on a convolutional neural network to obtain a word vector of the question.

Similarly, inputting the text of the answer into a GloVe word vector dictionary trained by the unmarked corpus to obtain the word vector of the answer; and inputting the text of the answer into a word vector dictionary based on a convolutional neural network to obtain a word vector of the answer.

And 102, compressing the word embedded expression of the problem to obtain the compressed word embedded expression of the problem.

The step 102 may specifically include the following steps (1) to (2):

(1) performing orthogonal decomposition on the word embedded expression of the problem to obtain a word embedded parallel component and an orthogonal component of the problem;

(2) and splicing the word-embedded parallel component and the orthogonal component of the problem to obtain the compressed word-embedded expression of the problem.

The word-embedded parallel component of the problem is obtained in step (1) above by the following formula:

wherein,the word representing the question is an embedded parallel component,a word-embedded representation of a body portion representing a question;a word-embedded representation of the ith word in the header section representing the question.

The word-embedded quadrature component of the problem is obtained by the following formula:

wherein,the word-embedded orthogonal component representing the question.

In step (2) above, the word-embedded parallel component and orthogonal component of the problem may be spliced using a fusion gate. This is the prior art, and is not described in detail in this embodiment.

In order to get a word-embedded representation of the problem after compression, the following steps (21) to (23) may be performed:

(21) computing an alignment score for the horizontal component based on the word-embedded parallel component of the question;

(22) calculating a summarized representation of the subject portion of the question from the subject portion of the question based on the alignment score of the horizontal component and the word-embedded representation of the subject portion of the question;

(23) and splicing the word embedded parallel component and the orthogonal component of the question according to the summarized representation of the question body part obtained according to the title part of the question to obtain the compressed word embedded representation of the question.

In the above step (21), the alignment score of the horizontal component is calculated by the following formula:

wherein,an alignment score representing the horizontal component, c an alignment parameter, W_p1And b_p1Respectively, are parameters in a deep learning network.

And the alignment parameters are preset in the server.

In step (22) above, a summarized representation of the subject part of the question, obtained from the title part of the question, is calculated by the following formula:

wherein,a summary representation of the subject part of the question, taken from the title part of the question, is shown.

In the step (23), the word-embedded parallel component and the orthogonal component of the problem are spliced by the following formula to obtain a compressed word-embedded representation of the problem:

F_para＝σ(W_p2S_emb+W_p3S_ap+b_p2)

S_para＝F_para⊙S_emb+(1-F_para)⊙S_ap

wherein, W_p2、W_p3And b_p2Representing parameters of a fusion gate in a deep learning network, F_paraThe representation represents the size of the fusion gate, S_paraWord-embedded representation, S, of parallel components of the problem after compression_embWord-embedded representation of the header part representing the question, S_apRepresenting a rootA summary representation of the subject part of the question taken from the title part of the question.

And 104, calculating the matching value of the answer and the question according to the compressed word embedded expression of the question and the compressed word embedded expression of the answer, and sequencing the answer according to the obtained matching value.

In order to calculate the matching value between the answer and the question and sort the answers according to the obtained matching value, the following steps (1) to (9) may be specifically performed:

(1) mapping words in the answers from a word vector space to an interactive space with the same dimension as the question representation to obtain a compressed word embedded representation of the answers;

(2) calculating the similarity of the question theme and the question content in the compressed word-embedded expression of the question according to the compressed word-embedded expression of the question and the compressed word-embedded expression of the answer;

(3) calculating the similarity between the question and the answer according to the similarity between the calculated question theme and the calculated question content;

(4) calculating a first similarity between the question and the answer from the aspect of the question based on the calculated similarity between the question and the answer and the compressed word-embedded expression of the answer;

(5) calculating a second similarity of the question and the answer from the aspect of the answer based on the calculated similarity of the question and the answer and the compressed word-embedded expression of the question;

(6) splicing the first similarity with the word embedded representation of the question to obtain a summarized representation of the question obtained according to the answer;

(7) splicing the second similarity with the word embedded representation of the answer to obtain a summarized representation of the answer obtained according to the question;

(8) calculating a matching value of the answer to the question based on the obtained summarized representation of the question obtained from the answer and the summarized representation of the answer obtained from the question;

(9) and sequencing the answers of the questions according to the obtained matching values.

In the step (1) above, the compressed word-embedded representation of the answer is obtained by the following formula:

C_rep＝σ(W_c1C_emb+b_c1)⊙

tanh(W_c2C_emb+b_c2)

wherein, C_repWord-embedded representation, W, representing the compressed answer_c1、Wc₂、b_c1And b_c2Is a parameter in a deep learning network, C_embA word-embedded representation representing the answer.

In the step (2) above, the similarity between the question subject and the question content in the compressed word-embedded representation of the question is calculated by the following formula:

wherein,representing the similarity of the subject and the content of the question, W_a1、W_a2And b_aAre parameters in a deep-learning network,a word-embedded representation of the question after compression,representing the mapped representation of the answer.

In the step (3), the similarity between the question and the answer is calculated by the following formula:

wherein, c represents an alignment parameter,representing the similarity of the question to the answer.

In the step (4) above, the first similarity of the question and the answer is calculated by the following formula:

wherein,representing a first degree of similarity of the question to the answer,representing the mapped representation of the answer.

In the above step (5), the second similarity of the question and the answer is calculated by the following formula:

wherein,a second degree of similarity representing the question and the answer,a word-embedded representation of the problem after compression.

In the step (8) above, the following steps (81) to (83) may be performed to calculate a matching value of the answer to the question:

(81) calculating a problem representation based on the summarized representation of the problem obtained from the answer;

(82) calculating an answer representation based on a summarized representation of answers obtained from the question;

(83) and calculating the matching value of the answer and the question through the question representation and the answer representation.

In the above step (81), the problem expression is calculated by the following formula:

A_s1＝W_s2tanh(W_s1S_att+b_s1)+b_s2

wherein s is_sumRepresenting a problem representation; a. the_s1Representing the result of attention matching when computing the problem representation; s_attA summarized representation representing the questions derived from the answers; w_s1、W_s2、b_s1And b_s2Are parameters in a deep learning network.

In the step (82), the answer expression is calculated by the following formula:

A_s2＝W_s2tanh(W_s1C_att+b_s1)+b_s2

wherein, c_sumRepresenting an answer representation; a. the_s2Representing the result of attention matching when calculating the answer representation; c_attA summarized representation representing answers to questions; w_s1、W_s2、b_s1And b_s2Are parameters in a deep learning network.

In the step (83), the matching value of the answer and the question is calculated by the following formula, and the probability that the answer belongs to "good answer, medium answer, or poor answer" is obtained:

Pr(y|S，B，C)＝s0ftmax(W₂tanh(W₁[s_sum；c_sum]+b₁)+b₂)

wherein, Pr (y | S, B, C) represents the matching value of the answer and the question; s_sumRepresenting a problem representation; c. C_sumRepresenting an answer representation; w₁、W₂、b₁And b₂Are parameters in a deep learning network.

In the step (9), the answers to the questions may be sorted in descending order of the calculated matching values.

In summary, in the data processing method provided in this embodiment, the word-embedded representation of the question is compressed to obtain the compressed word-embedded representation of the question, then the matching value between the answer and the question is calculated according to the compressed word-embedded representation of the question and the word-embedded representation of the answer, and the answers are ranked according to the obtained matching values.

Example 2

The present embodiment proposes a data processing apparatus for executing the data processing method of embodiment 1 described above.

Referring to the schematic structural diagram of the data processing apparatus shown in fig. 2, the present embodiment provides a data processing apparatus, including:

a first processing module 200, configured to process a question and an answer to the question to obtain a word-embedded representation of the question and a word-embedded representation of the answer;

a second processing module 202, configured to compress the word-embedded representation of the question to obtain a compressed word-embedded representation of the question;

and the sorting module 204 is configured to calculate a matching value between the answer and the question according to the compressed word-embedded representation of the question and the word-embedded representation of the answer, and sort the answer according to the obtained matching value.

In summary, the data processing apparatus provided in this embodiment compresses the word-embedded representation of the question to obtain a compressed word-embedded representation of the question, then calculates the matching values of the answers and the question according to the compressed word-embedded representation of the question and the word-embedded representation of the answer, and sorts the answers according to the obtained matching values.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein processing a question and an answer to the question to obtain a word-embedded representation of the question and a word-embedded representation of the answer comprises:

inputting the text of the question and the text of the answer of the question into a dictionary to respectively obtain a word vector and a word vector of the question and a word vector of the answer;

and splicing the word vector and the character vector of the question to obtain a word embedded representation of the question, and splicing the word vector and the character vector of the answer to obtain a word embedded representation of the answer.

3. The method of claim 1, wherein compressing the word-embedded representation of the question to obtain a compressed word-embedded representation of the question comprises:

performing orthogonal decomposition on the word embedded expression of the problem to obtain a word embedded parallel component and an orthogonal component of the problem;

and splicing the word-embedded parallel component and the orthogonal component of the problem to obtain the compressed word-embedded expression of the problem.

4. The method of claim 3, wherein orthogonally decomposing the word-embedded representation of the question to obtain parallel components of the word-embedded representation of the question comprises:

the word-embedded parallel component of the problem is obtained by the following formula:

5. The method of claim 4, wherein orthogonally decomposing the word-embedded representation of the question to obtain orthogonal components of the word-embedded representation of the question comprises:

wherein,the word-embedded orthogonal component representing the question.

6. The method of claim 4, wherein concatenating the word-embedded parallel component and the orthogonal component of the question to obtain a compressed word-embedded representation of the question comprises:

computing an alignment score for the horizontal component based on the word-embedded parallel component of the question;

calculating a summarized representation of the subject portion of the question from the subject portion of the question based on the alignment score of the horizontal component and the word-embedded representation of the subject portion of the question;

and splicing the word embedded parallel component and the orthogonal component of the question according to the summarized representation of the question body part obtained according to the title part of the question to obtain the compressed word embedded representation of the question.

7. The method of claim 6, wherein computing an alignment score for the horizontal component based on the parallel component of the word embedding of the question comprises:

calculating an alignment score of the horizontal component by the following formula:

8. The method of claim 7, wherein computing a summarized representation of the subject portion of the question based on the alignment score of the horizontal component and the word-embedded representation of the subject portion of the question comprises:

a summarized representation of the subject part of the question, obtained from the title part of the question, is calculated by the following formula:

9. The method of claim 7, wherein concatenating the word-embedded parallel component and the orthogonal component of the question according to the summarized representation of the question body part obtained from the question header part to obtain a compressed word-embedded representation of the question, comprises:

splicing the word-embedded parallel component and the orthogonal component of the problem by the following formula to obtain a compressed word-embedded representation of the problem:

F_para＝σ(W_p2S_emb+W_p3S_ap+b_p2)

S_para＝F_para⊙S_emb+(1-F_para)⊙S_ap

wherein, W_p2、W_p3And b_p2Indicating fusion gate depthLearning parameters in a network, F_paraThe representation represents the size of the fusion gate, S_paraWord-embedded representation, S, of parallel components of the problem after compression_embWord-embedded representation of the header part representing the question, S_apA summary representation of the subject part of the question, taken from the title part of the question, is shown.

10. The method of claim 1, wherein computing matching values for answers to questions based on the compressed word-embedded representations of the questions and the word-embedded representations of the answers, and ranking answers based on the resulting matching values comprises:

mapping words in the answers from a word vector space to an interactive space with the same dimension as the question representation to obtain a compressed word embedded representation of the answers;

calculating the similarity of the question theme and the question content in the compressed word-embedded expression of the question according to the compressed word-embedded expression of the question and the compressed word-embedded expression of the answer;

calculating the similarity between the question and the answer according to the similarity between the calculated question theme and the calculated question content;

calculating a first similarity between the question and the answer from the aspect of the question based on the calculated similarity between the question and the answer and the compressed word-embedded expression of the answer;

calculating a second similarity of the question and the answer from the aspect of the answer based on the calculated similarity of the question and the answer and the compressed word-embedded expression of the question;

splicing the first similarity with the word embedded representation of the question to obtain a summarized representation of the question obtained according to the answer;

splicing the second similarity with the word embedded representation of the answer to obtain a summarized representation of the answer obtained according to the question;

calculating a matching value of the answer to the question based on the obtained summarized representation of the question obtained from the answer and the summarized representation of the answer obtained from the question;

and sequencing the answers of the questions according to the obtained matching values.

11. The method of claim 10, wherein mapping words in an answer from a word vector space to an interaction space of the same dimension as a question representation, resulting in a compressed word-embedded representation of the answer, comprises:

obtaining a compressed word-embedded representation of the answer by:

C_rep＝σ(W_c1C_emb+b_c1)⊙tanh(W_c2C_emb+b_c2)

wherein, C_repWord-embedded representation, W, representing the compressed answer_c1、W_c2、b_c1And b_c2Is a parameter in a deep learning network, C_embA word-embedded representation representing the answer.

12. The method of claim 11, wherein calculating the similarity between the subject of the question and the content of the question in the compressed word-embedded representation of the question according to the compressed word-embedded representation of the question and the compressed word-embedded representation of the answer comprises:

calculating the similarity of the question subject and the question content in the compressed word-embedded representation of the question by the following formula:

13. The method of claim 12, wherein calculating the similarity between the question and the answer according to the calculated similarity between the question subject and the question content comprises:

the similarity of the question and the answer is calculated by the following formula:

14. The method of claim 13, wherein calculating a first similarity of a question and an answer from a question aspect based on the calculated similarity of the question and the answer and the compressed word-embedded representation of the answer comprises:

calculating a first similarity of the question and the answer by the following formula:

15. The method of claim 13, wherein computing a second similarity of the question to the answer from the answer perspective based on the computed similarity of the question to the answer and the compressed word-embedded representation of the question comprises:

calculating a second similarity of the question to the answer by:

16. The method of claim 13, wherein calculating a matching value for an answer to a question based on the resulting summarized representation of the question as a function of the answer and the resulting summarized representation of the answer as a function of the question comprises:

calculating a problem representation based on the summarized representation of the problem obtained from the answer;

calculating an answer representation based on a summarized representation of answers obtained from the question;

and calculating the matching value of the answer and the question through the question representation and the answer representation.

17. The method of claim 16, wherein computing a question representation based on a summarized representation of the question derived from the answers comprises:

the problem representation is calculated by the following formula:

A_s1＝W_s2tanh(W_s1S_att+b_s1)+b_s2

18. The method of claim 16, wherein computing a representation of the answer based on a summarized representation of the answer from the question comprises:

the answer representation is calculated by the following formula:

A_s2＝W_s2tanh(W_s1C_att+b_s1)+b_s2

19. The method of claim 16, wherein a matching value of an answer to a question is calculated from the question representation and the answer representation:

the matching value of the answer to the question is calculated by the following formula:

Pr(y|S，B，C)＝softmax(W₂ tanh(W₁[s_sum；c_sum]+b₁)+b₂)

20. A data processing apparatus, comprising: