CN111259115A

CN111259115A - Training method and device for content authenticity detection model and computing equipment

Info

Publication number: CN111259115A
Application number: CN202010042646.1A
Authority: CN
Inventors: 杨雷; 雷涛
Original assignee: CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Current assignee: CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-09
Anticipated expiration: 2040-01-15
Also published as: CN111259115B

Abstract

The invention discloses a training method of a content authenticity detection model, which is suitable for being executed in a computing device, wherein a knowledge base comprising a plurality of knowledge items is stored in the computing device, the model is suitable for outputting authenticity probability of answers, and the method comprises the following steps: acquiring a plurality of training samples with label data, wherein the training samples comprise questions, answers, attribute features related to the questions and the answers, and knowledge point features related to the questions and the attribute features, the label data are whether the answers are real, and the knowledge point features are n knowledge items searched in a knowledge base based on the attribute features and keywords of the questions; and inputting the training sample into a content authenticity detection model to be trained for processing to obtain the prediction probability of the training sample, and performing model training based on the label data of the training sample to obtain the trained content authenticity detection model. The invention also discloses a training device and a computing device of the corresponding content authenticity detection model.

Description

Training method and device for content authenticity detection model and computing equipment

Technical Field

The invention relates to the technical field of computers, in particular to a training method and a training device for a content authenticity detection model and computing equipment.

Background

Network knowledge question-answering gradually becomes a platform for mutual assistance between people. The netizens ask questions and answer questions of net friends by using own information, resources and experiences and by means of various network interactive questioning and answering platforms in a network questioning and answering mode. Due to the fact that the number of answering persons is large, the quality of answers is uneven, and much confusion and misleading are brought to questioners and browsers. Therefore, in the face of the rapidly growing demand for network question and answer, a corresponding detection technology is required to distinguish true from false, and the data quality of a network question and answer platform (forum, post bar, etc.) is improved.

Disclosure of Invention

In view of the above, the present invention provides a training method, apparatus and computing device for a content authenticity detection model, which seek to solve, or at least solve, the above existing problems.

According to an aspect of the present invention, there is provided a method of training a content authenticity detection model, adapted to be executed in a computing device having stored therein a knowledge base comprising a plurality of knowledge items, the model being adapted to output a probability of authenticity of an answer, the method comprising the steps of: acquiring a plurality of training samples with label data, wherein the training samples comprise questions, answers, attribute features related to the questions and the answers, and knowledge point features related to the questions and the attribute features, the label data are whether the answers are real, and the knowledge point features are n knowledge items searched in a knowledge base based on the attribute features and keywords of the questions; and inputting the training sample into a content authenticity detection model to be trained for processing to obtain the prediction probability of the training sample, and performing model training based on the label data of the training sample to obtain the trained content authenticity detection model.

Optionally, in the training method according to the present invention, the step of searching for knowledge point features includes: searching a knowledge base for a plurality of knowledge items related to the attribute characteristics; and sequencing the plurality of knowledge items based on the keywords of the problem to obtain the top n knowledge items as the knowledge point characteristics.

Optionally, in the training method according to the present invention, the attribute feature includes at least one of a question section, a question type, user information of a questioner and an answerer; the user information includes at least one of a member level, a number of postings, a number of replies, a length of replies, and an authenticity of replies.

Optionally, in the training method according to the present invention, a question-answer library is further stored in the computing device, and the questions and answers in the training samples are obtained from the question-answer library.

Optionally, in the training method according to the present invention, the content authenticity detection model includes: an encoder adapted to generate a word vector for each item of content in the training samples; the semantic extraction module is suitable for generating corresponding semantic vectors based on the word vectors of each item of content; the fusion module is suitable for splicing and fusing all semantic vectors; and the prediction module is suitable for predicting the authenticity probability of the answer from the spliced and fused semantic vectors.

Optionally, in the training method according to the present invention, the encoder includes: the first coder is suitable for respectively generating first to third word vectors corresponding to the question, the answer and the knowledge point characteristics; and the second encoder is suitable for generating a fourth word vector corresponding to the attribute features.

Optionally, in the training method according to the present invention, the semantic extraction module includes: the first semantic extraction module is suitable for generating a first semantic vector and a second semantic vector corresponding to the question and the answer based on the first word vector and the second word vector respectively; the second semantic extraction module is suitable for generating a third semantic vector corresponding to the knowledge point features based on the third word vector; and the first linear conversion module is suitable for generating a fourth semantic vector corresponding to the attribute feature based on the fourth word vector.

Optionally, in the training method according to the present invention, the first semantic extraction module includes: a first loop network adapted to extract semantic information of the question and the answer, respectively; and an attention network coupled to the first recurrent network and adapted to extract an association weight between the question and the answer and generate first and second semantic vectors based on the association weight.

Optionally, in the training method according to the present invention, the second semantic extracting module is a second cyclic network; the prediction module comprises a second linear transformation module and a Sigmoid function coupled to each other.

Optionally, in the training method according to the present invention, the first encoder is a word vector encoding, and the second encoder is a one-hot encoding; the first and second circulation networks are bidirectional-long-and-short-term memory networks, and the attention network is a bidirectional attention network; the first and second linear conversion modules are fully connected layers of the neural network.

According to another aspect of the present invention, there is provided a content authenticity detection method, adapted to be executed in a computing device, the method comprising the steps of: collecting a problem to be tested and an answer to be tested, and obtaining attribute characteristics to be tested related to the problem to be tested and the answer to be tested; searching corresponding to-be-detected knowledge point characteristics based on to-be-detected attribute characteristics and to-be-detected answers; taking the to-be-detected question, the to-be-detected answer, the to-be-detected attribute feature and the to-be-detected knowledge point feature as to-be-detected samples, and respectively inputting the to-be-detected samples into a plurality of trained content authenticity detection models to obtain a plurality of authenticity probability values; comparing the plurality of authenticity probability values to determine the authenticity of the answer to be detected; the content authenticity detection model is generated by training the content authenticity detection model by the training method of the content authenticity detection model.

According to another aspect of the present invention, there is provided an apparatus for training a content authenticity detection model, adapted to reside in a computing device having stored therein a knowledge base comprising a plurality of knowledge items, the model adapted to output a probability of authenticity of an answer, the apparatus comprising: the training set generation module is suitable for acquiring a plurality of training samples with label data, wherein the training samples comprise questions, answers, attribute characteristics related to the questions and the answers, and knowledge point characteristics related to the questions and the attribute characteristics, the label data are whether the answers are real, and the knowledge point characteristics are n knowledge items searched in the knowledge base based on the attribute characteristics and keywords of the questions; and the model training module is suitable for inputting the training samples into the content authenticity detection model to be trained for processing to obtain the prediction probability of the training samples, and performing model training based on the label data to obtain the trained content authenticity detection model.

According to another aspect of the present invention, there is provided a content authenticity detection apparatus adapted to reside in a computing device, the apparatus comprising: the to-be-detected set generation module is suitable for acquiring a to-be-detected question, a to-be-detected answer and to-be-detected attribute characteristics related to the to-be-detected question and the to-be-detected answer, and searching corresponding to-be-detected knowledge point characteristics from a knowledge base on the basis of the to-be-detected attribute characteristics and the to-be-detected answer; the authenticity prediction module is suitable for inputting the to-be-detected problem, the to-be-detected answer, the to-be-detected attribute feature and the to-be-detected knowledge point feature as to-be-detected samples into a plurality of trained content authenticity detection models respectively to obtain a plurality of authenticity probability values; the authenticity determining module is suitable for comparing the authenticity probability values to determine the authenticity of the answer to be detected; the content authenticity detection model is generated by training the content authenticity detection model by the training method of the content authenticity detection model.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs when executed by the processors implement the steps of the method as described above.

According to a further aspect of the invention there is provided a readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, implement the steps of the method as described above.

According to the technical scheme of the invention, the question-answering system for detecting the authenticity of the reply content of the forum based on the knowledge graph is generated. The first sorting extracts attribute features related to the questions and answers, such as user information including question boards, question types, questioners and respondents, and the like. And then, searching related knowledge items in the knowledge graph by using the attribute characteristics and the questions, for example, selecting the first n knowledge points as knowledge point characteristics, wherein the knowledge point characteristics can provide data support for the authenticity of the model identification answer. And finally, inputting the knowledge point characteristics, the attribute characteristics, the questions and the answers into a detection model, performing semantic fusion and reasoning on the multi-source information, and finally performing authenticity judgment. In addition, the method can train a plurality of models, perform model fusion, and take the output result of the fused models as the final judgment result, thereby improving the judgment accuracy.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 illustrates a flow diagram of a method 200 of training a content authenticity detection model according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of a content authenticity detection model according to one embodiment of the invention;

FIG. 4 shows a schematic diagram of a content authenticity detection model according to another embodiment of the invention;

FIG. 5 shows a flow diagram of a content authenticity detection method 500 according to one embodiment of the invention;

FIG. 6 illustrates a block diagram of a training apparatus 600 for a content authenticity detection model according to one embodiment of the present invention; and

fig. 7 illustrates a block diagram of a content authenticity detection apparatus 700 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 is a block diagram of a computing device 100 according to one embodiment of the invention. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. The program data 124 comprises instructions, and in the computing device 100 according to the invention the program data 124 comprises instructions for performing the training method 200 of the content authenticity detection model and/or the content authenticity detection method 500.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform the content authenticity detection model training method 200 and/or the content authenticity detection method 500.

Fig. 2 shows a flow diagram of a method 200 for training a content authenticity detection model according to an embodiment of the invention. The method 200 is performed in a computing device, such as the computing device 100. As shown in fig. 2, the method begins at step S210.

In step S210, a plurality of training samples having label data are obtained, where the training samples include a question, an answer, attribute features related to the question and the answer, and knowledge point features related to the question and the attribute features.

The label data is whether the answer is true or not, which may be represented by 0 or 1, and may be labeled manually. For example, when the answer is true, the tag data value is 1; when the answer is false, the tag data value is 0.

The questions and answers in the training sample may be obtained from a library of questions and answers stored in the computing device. The question-answer library comprises a plurality of pairs of questions and answers, wherein the questions and the answers are, for example, posted questions and returned contents, such as the question of 'what oil consumption is in the degree of rabdosis', and the answer of 'oil consumption is 6.9'. The question-answer library may also include the question section and question type to which each question belongs.

Based on the question and the answer, relevant attribute features may be found. The attribute characteristics include at least one of a question section, a question type, user information of a questioner and an answerer. The user information includes at least one of a member level, a number of postings, a number of replies, a length of replies, and an authenticity of replies. The attribute characteristics may be expressed as { question block, question type, questioner information, respondent information }, such as { question block: the degree forum, question type: maintenance type, questioner information: (authentication owner, 3-level member, number of posts 12, number of replies 23), responder information: (authentication owner, 3-level member, number of posts 12, number of replies 23) }. The introduction of the attribute characteristics can improve the reference quantity of information and train a more accurate model.

The knowledge point feature is n knowledge items found in the knowledge base based on the attribute feature and the keyword of the problem. The computing device stores a knowledge base, wherein the knowledge base comprises a plurality of knowledge items, and each knowledge item can be represented as a three-phrase. For example (2019 Lingdu manual version, oil consumption, 5.5), oil consumption for the 2019 New Lingdu manual version is 5.5. A triple is understood to be an entity entry, an entity relationship, and an entity relationship (including attributes, categories, etc.) if an entity is considered as a node and an entity relationship is considered as an edge, then a knowledge base containing a large number of triples becomes a huge knowledge graph. The knowledge base can exist in the form of a knowledge graph, and a large number of knowledge points in the field are stored in the knowledge graph and can provide powerful support for data analysis and processing. The knowledge graph is used as an external data source to be applied to the detection of the authenticity of the question answers, so that the accuracy of the model can be improved.

According to one embodiment, the step of finding knowledge point features comprises: searching a plurality of knowledge items related to the attribute characteristics in a knowledge base; and sequencing the plurality of knowledge items based on the keywords of the problem to obtain the top n knowledge items as the knowledge point characteristics. The knowledge items can be sorted based on the probability of the coincident words, and the coincident words can be coincident words between the problem full text and the knowledge items or coincident words between the problem keywords and the knowledge items, which is not limited by the invention.

Taking the attribute characteristics of the previous degree forum as an example, the maintenance problem of the degree-of-the-rank automobile can be asked according to the attribute characteristics, all knowledge items of the degree-of-the-rank automobile are roughly searched in the knowledge graph, and then detailed retrieval is carried out in the knowledge graph according to keywords of the problem. Of course, the knowledge point features in the training samples can also be found by manual selection.

Subsequently, in step S220, the training sample is input into the content authenticity detection model to be trained for processing, so as to obtain the prediction probability of the training sample, and model training is performed based on the label data thereof, so as to obtain the trained content authenticity detection model.

Fig. 3 shows an exemplary embodiment of a content authenticity detection model, and fig. 4 shows a preferred embodiment of the content authenticity detection model. As shown in the figure, the content authenticity detection model includes an encoder, a semantic extraction module, a fusion module, and a prediction module, corresponding to an embedding layer, a semantic layer, a fusion layer, and an output layer, respectively.

The encoder generates a word vector of each item of content in the training sample, the semantic extraction module generates a corresponding semantic vector based on the word vector of each item of content, the fusion module splices and fuses all the semantic vectors, and the prediction module predicts the authenticity probability of an answer from the spliced and fused semantic vectors.

Further, the encoder may include a first encoder and a second encoder. The first encoder generates first to third word vectors corresponding to the question, the answer and the knowledge point feature respectively. And the second encoder generates a fourth word vector corresponding to the attribute features. Preferably, the first encoder is Word Embedding Word vector encoding, and the second encoder is One Hot encoding. The attribute features are mainly considered to collect a plurality of discrete features of different aspects, and text semantic relations do not exist among the discrete features, so that the expression is more suitable by using the one-hot coding.

The semantic extraction module comprises a first semantic extraction module (not shown in the figure), a second semantic extraction module and a first linear conversion module. The first semantic extraction module generates first and second semantic vectors corresponding to the question and the answer based on the first and second word vectors respectively. And the second semantic extraction module generates a third semantic vector corresponding to the knowledge point features based on the third word vector. And the first linear conversion module generates a fourth semantic vector corresponding to the attribute feature based on the fourth word vector.

Specifically, the first semantic extraction module includes a first cycle network and an attention network coupled to each other. The first circulation network extracts semantic information of the question and the answer respectively, the attention network extracts association weight between the question and the answer, and generates a first semantic vector and a second semantic vector based on the association weight. The second semantic extraction module is a second circulation network and is used for extracting semantic information of the knowledge point characteristics.

Preferably, the first and second circulating networks are two-way-long-short-term memory networks, i.e. Bi-LSTM layers. The Attention network is a bidirectional Attention network, namely a Bi-Attention layer, which is an associated interaction layer of questions and answers and is mainly used for judging the strength of semantic association between the questions and the answers. For example, if the question is how much oil consumed in the degree of Ling and the answer is the whole division of Ling and the quality assurance for three years or 10 km, there is no correlation between the two.

The fusion module splices the first semantic vector, the second semantic vector, the third semantic vector, the fourth semantic vector and the fourth semantic vector, for example, the splicing sequence of the matrix is not limited by the invention. The prediction module comprises a second linear transformation module and a Sigmoid function coupled to each other. Preferably, the first and second Linear transformation modules (Linear) are fully connected layers of the neural network.

Thus, the question and the answer generate an embedded word vector after passing through a first encoder (word vector encoding), and then sequentially pass through a first circulation network (Bi-LSTM layer) and an Attention network (Bi-Attention layer) in a first semantic extraction module to respectively obtain an output sequence H_bAnd H_c. The attribute characteristics are subjected to One-Hot code generation by a second encoder (One-Hot code), and then subjected to a first Linear conversion module (Linear-neural network full connection) to generate a sequence H_a. The characteristics of the knowledge points are input into the first compiling unit in sequenceIn the decoder (word vector layer) and the second cyclic network (Bi-LSTM layer), the output sequence H is obtained_f。

Finally, the sequence H is output_a、H_f、H_bAnd H_cAfter splicing, linear conversion is carried out again through the second linear conversion module, namely full connection of the neural network is carried out, so that the vectors are mapped to corresponding dimensionalities of the output layer, subsequent Sigmoid function calculation is facilitated, and corresponding probability values between 0 and 1 are output. For example: if the output is 0.9, the answer is considered to be a true answer with a high probability; if the output is 0.1, then it is assumed that the answer is very likely not a true answer.

It should be noted that the Word Embedding Word vector encoding, One Hot unique encoding, Bi-LSTM and Bi-Attention semantic extraction, semantic fusion, Linear transformation and Sigmoid function prediction mentioned above are all mature technologies in the field, and those skilled in the art can set the structure and parameters of each part in the model by themselves according to the needs and train the model, and the details of the invention are not particularly limited. In the model training process, the question, the answer, the attribute feature and the knowledge point feature are respectively input to the corresponding parts of the model, are converted into corresponding semantic sequences and then are spliced, and the prediction probability value is output. And then, based on the actual label value, adjusting the hyper-parameters of the model, and iteratively updating for multiple times until the predicted probability value is closest to the actual label value, and the model loss function is the lowest, so that the trained model is obtained.

Fig. 5 shows a flow diagram of a content authenticity detection method 500 according to an embodiment of the invention. The method 500 is performed in a computing device, such as the computing device 100. As shown in fig. 5, the method begins at step S510.

In step S510, a to-be-tested question, an to-be-tested answer, and to-be-tested attribute features related to the to-be-tested question and the to-be-tested answer are obtained. The question layout block and the question type in the attribute feature to be detected can be determined by performing keyword analysis or semantic recognition on the question and combining with a question-answer library.

Subsequently, in step S520, based on the attribute feature to be tested and the answer to be tested, the corresponding knowledge point feature to be tested is searched, where the knowledge point feature is the n searched knowledge items. The searching method of the knowledge point characteristics comprises the following steps: and searching a plurality of knowledge items related to the attribute characteristics in a knowledge base, and sequencing the plurality of knowledge items based on the keywords of the problem to obtain the first n knowledge items.

Subsequently, in step S530, the question to be tested, the answer to be tested, the attribute feature to be tested, and the knowledge point feature to be tested are used as samples to be tested, and are respectively input into the trained content authenticity detection models, so as to obtain a plurality of authenticity probability values.

The multiple content authenticity detection models are trained by the method 200, multiple models can be stored in the training iteration process, the samples to be detected are respectively predicted based on the multiple models, and a prediction probability value is respectively obtained. Here, the question to be tested and the answer word to be tested are vector-coded and input into a circulating network to obtain semantic information, and the semantic vector information is obtained after information fusion is carried out through bidirectional Attention. And performing one-hot coding and linear conversion on the attribute characteristics to obtain an attribute vector. And after word vector coding is carried out on the knowledge point characteristics, the knowledge point characteristics are input into a circulating network to obtain knowledge point vectors. And splicing the semantic vector information, the knowledge point vector and the attribute vector information, inputting the spliced information into an output layer, and judging authenticity.

Subsequently, in step S540, the authenticity of the answer to be tested is determined by comparing the plurality of authenticity probability values. Therefore, after the same group of question-answer sentences are voted and judged by a plurality of training models, the accuracy of result judgment can be improved.

In one implementation mode, if at least half of the values in the plurality of authenticity probability values are greater than or equal to a preset probability threshold value, determining that the answer to be detected is a real answer; otherwise, the answer is false. The probability threshold may be, for example, 0.7, but is not limited thereto, and a person skilled in the art may set the value thereof by himself or herself. If there are five models, two of the output values of the five models are less than 0.7, and three are greater than 0.7, it represents that the answer to be tested is a real answer.

In another implementation, if an average value of all the prediction probability values is calculated, if the average value is greater than or equal to a probability threshold, the answer to be tested is considered to be true; otherwise, it is false. Of course, there are other comparison methods, and those skilled in the art can set the comparison rule by themselves according to the needs, and the present invention is not limited thereto.

In addition, for a plurality of trained models, a certain weight value can be set for each model, and the more the number of iterations of the obtained model is, that is, the closer to the final model is, the higher the weight of the model is. If five models are trained in sequence, the weights of the five models are gradually increased, and the probability values obtained by predicting the five models and the corresponding weights are weighted to be used as the actual probability values of the models. And then comparing the five actual probability values to determine whether the answer to be detected is a real answer.

Fig. 6 shows a block diagram of a training apparatus 600 for a content authenticity detection model according to an embodiment of the present invention, the model being adapted to output an authenticity probability of an answer. Apparatus 600 may reside in a computing device, such as computing device 100. As shown in FIG. 6, apparatus 600 includes a training set generation module 610 and a model training module 620.

The training set generation module 610 obtains a plurality of training samples with label data, the training samples including a question, an answer, attribute features related to the question and the answer, and knowledge point features related to the question and the attribute features. And the knowledge point characteristics are n knowledge items searched in a knowledge base based on the attribute characteristics and the keywords of the question. When searching for a knowledge point, the training set generating module 610 may search for a plurality of knowledge items related to the attribute feature in a knowledge base, and rank the plurality of knowledge items based on a keyword of a problem to obtain n knowledge items as the knowledge point feature. The training set generation module 610 may perform processing corresponding to the processing described above in step S210, and details thereof are not repeated here.

The model training module 620 inputs the training samples into the content authenticity detection model to be trained for processing to obtain the prediction probability of the training samples, and performs model training based on the label data to obtain the trained content authenticity detection model. The model training module 620 may perform processing corresponding to the processing described above in step S220, and the detailed description thereof is omitted.

Fig. 7 shows a block diagram of a content authenticity detection apparatus 700 according to an embodiment of the invention, which apparatus 700 may reside in a computing device, such as computing device 100. As shown in fig. 7, the apparatus includes a to-be-tested set generation module 710, an authenticity prediction module 720, and an authenticity determination module 730.

The to-be-tested set generating module 710 collects the to-be-tested question, the to-be-tested answer, and the to-be-tested attribute feature related to the to-be-tested question and the to-be-tested answer, and searches the corresponding to-be-tested knowledge point feature from the knowledge base based on the to-be-tested attribute feature and the to-be-tested answer. When searching for a knowledge point, the to-be-detected-set generating module 710 may search for a plurality of knowledge items related to the attribute feature in the knowledge base, and rank the plurality of knowledge items based on the keyword of the problem to obtain the first n knowledge items as the knowledge point feature. The to-be-measured-set generating module 710 may perform processing corresponding to the processing described in steps S510 and S520 above, and details thereof are not repeated here.

The authenticity prediction module 720 takes the to-be-detected question, the to-be-detected answer, the to-be-detected attribute feature and the to-be-detected knowledge point feature as to-be-detected samples, and inputs the to-be-detected samples into the trained content authenticity detection models respectively to obtain a plurality of authenticity probability values. The authenticity prediction module 720 may perform processing corresponding to the processing described above in step S530, and a detailed description thereof will not be repeated.

The authenticity determination module 730 determines the authenticity of the answer to be tested by comparing the plurality of authenticity probability values. If at least half of the values in the multiple authenticity probability values are larger than or equal to a preset probability threshold, judging that the answer to be detected is a real answer; otherwise, the answer is false. The authenticity determination module 730 may perform processing corresponding to the processing described above in step S540, and the detailed description thereof will not be repeated here.

According to the technical scheme of the invention, the reference quantity of information is improved by introducing the attribute characteristics, and the authenticity of the reply content of the forum is detected by utilizing external data (a knowledge map, a knowledge base and the like), so that the accuracy of model prediction is improved. By adding voting judgment of multiple groups of models, the authenticity of the detection result is improved.

A5, the method of a1, wherein the content authenticity detection model comprises: an encoder adapted to generate a word vector for each item of content in the training samples; the semantic extraction module is suitable for generating corresponding semantic vectors based on the word vectors of each item of content; the fusion module is suitable for splicing and fusing all semantic vectors; and the prediction module is suitable for predicting the authenticity probability of the answer from the spliced and fused semantic vectors. A6, the method as in a5, wherein the encoder comprises: the first coder is suitable for respectively generating first to third word vectors corresponding to the question, the answer and the knowledge point characteristics; and the second encoder is suitable for generating a fourth word vector corresponding to the attribute feature.

A7, the method of A6, wherein the semantic extraction module comprises: the first semantic extraction module is suitable for generating a first semantic vector and a second semantic vector corresponding to the question and the answer based on the first word vector and the second word vector respectively; the second semantic extraction module is suitable for generating a third semantic vector corresponding to the knowledge point features based on the third word vector; and the first linear conversion module is suitable for generating a fourth semantic vector corresponding to the attribute feature based on the fourth word vector. A8, the method of A7, wherein the first semantic extraction module comprises: a first circulation network adapted to extract semantic information of the question and the answer, respectively; and an attention network coupled to the first recurrent network and adapted to extract an association weight between the question and the answer and generate the first and second semantic vectors based on the association weight.

A9, the method as in a7 or A8, wherein the second semantic extraction module is a second loop network; the prediction module comprises a second linear transformation module and a Sigmoid function coupled to each other. A10, the method of a9, wherein the first encoder is word vector encoding and the second encoder is one-hot encoding; the first and second cyclic networks are bidirectional-long-and-short-term memory networks, and the attention network is a bidirectional attention network; the first linear conversion module and the second linear conversion module are fully connected layers of a neural network.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A method of training a content authenticity detection model adapted to be executed in a computing device having stored therein a knowledge base comprising a plurality of knowledge items, the model being adapted to output a probability of authenticity of an answer, the method comprising the steps of:

acquiring a plurality of training samples with label data, wherein the training samples comprise questions, answers, attribute features related to the questions and the answers, and knowledge point features related to the questions and the attribute features, the label data are whether the answers are real, and the knowledge point features are n knowledge items found in the knowledge base based on the attribute features and keywords of the questions; and

and inputting the training sample into a content authenticity detection model to be trained for processing to obtain the prediction probability of the training sample, and performing model training based on the label data to obtain the trained content authenticity detection model.

2. The method of claim 1, wherein the step of finding the knowledge point characteristic comprises:

searching a plurality of knowledge items related to the attribute characteristics in the knowledge base;

and sequencing the plurality of knowledge items based on the keywords of the problem to obtain the top n knowledge items as the knowledge point characteristics.

3. The method of claim 1, wherein,

the attribute features comprise at least one of question sections, question types, user information of questioners and respondents;

the user information includes at least one of a member level, a number of postings, a number of replies, a length of replies, and an authenticity of replies.

4. The method of claim 1, wherein the computing device further stores a question-answer library from which questions and answers in the training sample are obtained.

5. A content authenticity detection method, adapted to be executed in a computing device, the method comprising the steps of:

obtaining a problem to be tested, an answer to be tested and attribute characteristics to be tested related to the problem to be tested and the answer to be tested;

searching corresponding to-be-detected knowledge point characteristics based on the to-be-detected attribute characteristics and the to-be-detected answers;

the problem to be detected, the answer to be detected, the attribute characteristic to be detected and the knowledge point characteristic to be detected are used as samples to be detected and are respectively input into a plurality of trained content authenticity detection models to obtain a plurality of authenticity probability values; and

determining the authenticity of the answer to be tested by comparing the plurality of authenticity probability values;

wherein the content authenticity detection model is generated by training using the method according to any of claims 1-10.

6. The method of claim 5, wherein said comparing the authenticity probability values to determine the authenticity of the answer to be tested comprises:

if at least half of the values in the plurality of authenticity probability values are larger than or equal to a preset probability threshold value, judging that the answer to be detected is a real answer; otherwise, the answer is false.

7. An apparatus for training a content authenticity detection model adapted to reside in a computing device having stored therein a knowledge base comprising a plurality of knowledge items, the model adapted to output a probability of authenticity of an answer, the apparatus comprising:

the training set generation module is suitable for acquiring a plurality of training samples with label data, wherein the training samples comprise questions, answers, attribute features related to the questions and the answers, and knowledge point features related to the questions and the attribute features, the label data are whether the answers are real, and the knowledge point features are n knowledge items searched in the knowledge base based on the attribute features and keywords of the questions; and

and the model training module is suitable for inputting the training sample into a content authenticity detection model to be trained for processing to obtain the prediction probability of the training sample, and performing model training based on the label data to obtain the trained content authenticity detection model.

8. A content authenticity detection apparatus adapted to reside in a computing device, comprising:

the to-be-detected set generation module is suitable for acquiring a to-be-detected question, a to-be-detected answer and to-be-detected attribute characteristics related to the to-be-detected question and the to-be-detected answer, and searching corresponding to-be-detected knowledge point characteristics from a knowledge base on the basis of the to-be-detected attribute characteristics and the to-be-detected answer;

the authenticity prediction module is suitable for inputting the to-be-detected question, the to-be-detected answer, the to-be-detected attribute feature and the to-be-detected knowledge point feature as to-be-detected samples into a plurality of trained content authenticity detection models respectively to obtain a plurality of authenticity probability values; and

the authenticity determining module is suitable for comparing the plurality of authenticity probability values to determine the authenticity of the answer to be detected;

9. A computing device, comprising:

a memory;

one or more processors;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-6.

10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-6.