一种基于自然语义理解的行为识别方法及相关设备A behavior recognition method and related equipment based on natural semantic understanding
本申请要求于2019年06月18日提交中国专利局、申请号为201910529267.2、申请名称为“一种基于自然语义理解的反作弊方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 18, 2019, the application number is 201910529267.2, and the application name is "an anti-cheating method and related equipment based on natural semantic understanding". The reference is incorporated in this application.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种基于自然语义理解的行为识别方法及相关设备。This application relates to the field of computer technology, in particular to a behavior recognition method and related equipment based on natural semantic understanding.
背景技术Background technique
目前很多招聘都存在笔试环节,一直以来笔试环节不合规(如作弊)的行为屡见不鲜,目前很多企业都是通过人工筛查比对的方式甄别不合规行为,然而针对应聘数量较少的情况可以人工甄别,针对应聘数量较多的情况则无法人工甄别。随着人工智能的发展,已经有一些企业尝试通过计算机识别不合规行为,目前计算机识别的原理是直接将两个文档进行对比,如果两个文档的内容一样则存在不合规,如果不一样则不存在不合规行为,针对这种确定不合规的方式,犯规者是很容易避免被发现的,例如,犯规者对答案稍作关键词改动,如同义词替换;再如,犯规者对文档的语句顺序稍作改动,等等。关键词改动和句子顺序调整之后,计算机就不认为存在不合规行为,而实际不合规是客观存在的。如何通过计算机更精准高效地甄别作不合规行为是本领域的技术人员正在研究的技术问题。At present, many recruitments have written examinations. Non-compliance (such as cheating) in written examinations has been common. At present, many companies use manual screening and comparison to identify non-compliant behaviors. However, the number of applicants is small. It can be screened manually, but cannot be screened manually for a large number of applicants. With the development of artificial intelligence, some companies have tried to identify non-compliant behaviors through computers. The current principle of computer recognition is to directly compare two documents. If the contents of the two documents are the same, there is non-compliance. If they are not the same Then there is no non-compliance. For this method of determining non-compliance, it is easy for the offender to avoid being detected. For example, the offender changes the answer slightly, such as the substitution of synonyms; The order of statements in the document is slightly changed, and so on. After the keywords are changed and the sentence sequence is adjusted, the computer does not think that there is non-compliance, but the actual non-compliance is objective. How to identify non-compliance behaviors more accurately and efficiently through computers is a technical problem being studied by those skilled in the art.
发明内容Summary of the invention
本申请实施例公开了一种基于自然语义理解的行为识别方法及设备,能够更准确地确定出作弊行为。The embodiments of the present application disclose a behavior recognition method and device based on natural semantic understanding, which can more accurately determine cheating behavior.
第一方面,本申请实施例提供了一种基于自然语义理解的行为识别方法,该方法包括:In the first aspect, the embodiments of the present application provide a behavior recognition method based on natural semantic understanding, and the method includes:
通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量,其中,每一个句子中的文字特征构成一个第一向量;Extracting text features in multiple sentences in the first document through a word segmentation algorithm in the self-encoding model to form multiple first vectors, where the text features in each sentence form a first vector;
通过所述自编码模型中的注意力网络训练所述多个第一向量以获得所述多个第一向量中每个第一向量的注意力权重;Training the plurality of first vectors through the attention network in the auto-encoding model to obtain the attention weight of each first vector in the plurality of first vectors;
将所述多个第一向量和所述多个第一向量中每个第一向量的注意力权重输入到所述自编码模型中的长短期记忆网络(Long Short-Term Memory,LSTM)训练,以生成第一语义向量;Input the plurality of first vectors and the attention weight of each first vector of the plurality of first vectors into the Long Short-Term Memory (LSTM) training of the self-encoding model, To generate the first semantic vector;
通过所述LSTM解码所述第一语义向量以获得多个第一解码向量;Decoding the first semantic vector by the LSTM to obtain a plurality of first decoding vectors;
若所述多个第一解码向量与所述多个第一向量满足预设相似条件,则将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为。If the plurality of first decoding vectors and the plurality of first vectors satisfy a preset similarity condition, the first semantic vector is compared with the second semantic vector of the second document to determine whether there is a target behavior.
通过实施上述方法,提取的文字特征更体现句子本身语义。编码层还采用LSTM生成语义向量,能够更好地刻画文档的语义。Through the implementation of the above method, the extracted text features more reflect the semantics of the sentence itself. The coding layer also uses LSTM to generate semantic vectors, which can better describe the semantics of the document.
第二方面,本申请实施例提供一种基于自然语义理解的行为识别设备,该设备包括:In the second aspect, an embodiment of the present application provides a behavior recognition device based on natural semantic understanding, and the device includes:
第一提取单元,用于通过自编码模型中的分词算法提取第一文档中的多个句子中的文 字特征以构成多个第一向量,其中,每一个句子中的文字特征构成一个第一向量;The first extraction unit is used to extract the text features in multiple sentences in the first document through the word segmentation algorithm in the self-encoding model to form multiple first vectors, wherein the text features in each sentence form a first vector ;
第一训练单元,用于通过所述自编码模型中的注意力网络训练所述多个第一向量以获得所述多个第一向量中每个第一向量的注意力权重;A first training unit, configured to train the multiple first vectors through the attention network in the self-encoding model to obtain the attention weight of each first vector in the multiple first vectors;
第一生成单元,用于将所述多个第一向量和所述多个第一向量中每个第一向量的注意力权重输入到所述自编码模型中的长短期记忆网络LSTM训练,以生成第一语义向量;The first generating unit is configured to input the plurality of first vectors and the attention weight of each first vector of the plurality of first vectors into the long and short-term memory network LSTM training in the self-encoding model to Generate the first semantic vector;
第一解码单元,用于通过所述LSTM解码所述第一语义向量以获得多个第一解码向量;A first decoding unit, configured to decode the first semantic vector through the LSTM to obtain multiple first decoding vectors;
对比单元,用于若所述多个第一解码向量与所述多个第一向量满足预设相似条件,则将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为。The comparison unit is configured to compare the first semantic vector with the second semantic vector of the second document if the plurality of first decoded vectors and the plurality of first vectors satisfy a preset similarity condition to determine whether There is a target behavior.
通过运行上述单元,提取的文字特征更体现句子本身语义。编码层还采用LSTM生成语义向量,能够更好地刻画文档的语义。By running the above unit, the extracted text features more reflect the semantics of the sentence itself. The coding layer also uses LSTM to generate semantic vectors, which can better describe the semantics of the document.
第三方面,本申请实施例提供一种设备,所述设备包括处理器、存储器,其中,所述存储器用于存储指令,当所述指令在处理器上运行时,实现第一方面,或者第一方面的任一可能的实现方式所描述的方法。In a third aspect, an embodiment of the present application provides a device that includes a processor and a memory, wherein the memory is used to store instructions, and when the instructions run on the processor, the first aspect or the first aspect is implemented. The method described in any possible implementation of one aspect.
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,实现第一方面,或者第一方面的任一可能的实现方式所描述的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, and when it runs on a processor, it implements the first aspect or any one of the first aspect. Possible implementation methods described.
第五方面,本申请实施例提供一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质中存储有指令,当其在处理器上运行时,实现第一方面,或者第一方面的任一可能的实现方式所描述的方法。In a fifth aspect, an embodiment of the present application provides a non-volatile computer-readable storage medium having instructions stored in the non-volatile computer-readable storage medium, and when it runs on a processor, the first aspect is implemented , Or the method described in any possible implementation of the first aspect.
第六方面,本申请实施例提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,实现第一方面,或者第一方面的任一可能的实现方式所描述的方法。In a sixth aspect, the embodiments of the present application provide a computer program product, which, when the computer program product runs on a processor, implements the first aspect or the method described in any possible implementation manner of the first aspect.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图作简单地介绍。In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the embodiments of the present application or the background technology.
图1是本申请实施例提供的一种基于自然语义理解的行为识别方法的流程示意图;FIG. 1 is a schematic flowchart of a behavior recognition method based on natural semantic understanding provided by an embodiment of the present application;
图2是本申请实施例提供的一种设备的结构示意图;Figure 2 is a schematic structural diagram of a device provided by an embodiment of the present application;
图3是本申请实施例提供的又一种设备的结构示意图。Fig. 3 is a schematic structural diagram of another device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings.
本申请实施例的主要思想是通过自编码模型(Autoencoder,AE)获得文档的语义向量,然后将两个文档的语义向量进行比对,如果这两个语义向量比较接近则代表两个文档类似,从而确定存在目标行为。该自编码模型包括编码层和解码层,其中,编码层包括分词算法(例如,卷积神经网络(Convolutional Neural Networks,CNN))、注意力网络Attention和长短期记忆网络(Long Short-Term Memory,LSTM);解码层包括LSTM。The main idea of the embodiments of this application is to obtain the semantic vector of the document through an autoencoder (AE), and then compare the semantic vectors of the two documents. If the two semantic vectors are relatively close, it means that the two documents are similar. To determine the existence of the target behavior. The self-encoding model includes an encoding layer and a decoding layer. The encoding layer includes word segmentation algorithms (for example, Convolutional Neural Networks (CNN)), attention network Attention, and Long Short-Term Memory (Long Short-Term Memory). LSTM); The decoding layer includes LSTM.
其中,分词算法用于以句子为单位从文档中提取文字特征构成文字向量。注意力网络attention用于对多个文字向量进行训练,从而获得多个文字向量中每个文字向量的注意力权重,一般来说,若文字特征代表的词语比较重要,那么通常可以获得较高的注意力权 重。LSTM用于在编码层根据各个文字特征和各个文字特征的注意力权重训练出语义向量,该LSTM还用于在解码层将该语义向量进行解码,解码之后得到的向量可以称为解码向量。该自编码模型的目标是使得最终解码出的解码向量尽可能地向编码环节中的词语向量收敛,如果收敛得到一定的程度,那么说明该自编码模型中的LSTM编码得到的语义向量基本能够代表相应文本的语义。Among them, the word segmentation algorithm is used to extract text features from a document in sentence units to form text vectors. The attention network attention is used to train multiple text vectors to obtain the attention weight of each text vector in the multiple text vectors. Generally speaking, if the words represented by the text features are more important, you can usually get a higher Attention weight. The LSTM is used to train a semantic vector at the encoding layer according to each character feature and the attention weight of each character feature. The LSTM is also used to decode the semantic vector at the decoding layer. The vector obtained after decoding can be called a decoding vector. The goal of the self-encoding model is to make the finally decoded decoding vector converge to the word vector in the encoding link as much as possible. If the convergence reaches a certain degree, then the semantic vector obtained by the LSTM encoding in the self-encoding model can basically represent The semantics of the corresponding text.
在甄别目标行为(如作弊行为)的过程中,通常涉及将两个文档(例如,两个应聘者各自的答卷、一个应聘者的答卷与标准答案,等)进行比较,后面称这两个文档为第一文档和第二文档以方便描述。In the process of identifying the target behavior (such as cheating), it usually involves comparing two documents (for example, two applicants’ respective answers, one applicant’s answer and standard answers, etc.). These two documents will be referred to hereinafter The first document and the second document for ease of description.
请参见图1,图1是本申请实施例提供的一种基于自然语义理解的行为识别方法,该方法可以基于图1所示的自编码模型来实现,执行该方法的设备可以为一个硬件设备(如服务器)或者多个硬件设备构成的集群(如服务器集群),该方法包括但不限于如下步骤:Please refer to Figure 1. Figure 1 is a behavior recognition method based on natural semantic understanding provided by an embodiment of the present application. The method can be implemented based on the self-encoding model shown in Figure 1, and the device that executes the method can be a hardware device. (Such as a server) or a cluster composed of multiple hardware devices (such as a server cluster), the method includes but is not limited to the following steps:
步骤S101:设备通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量。Step S101: The device extracts text features in multiple sentences in the first document through the word segmentation algorithm in the self-encoding model to form multiple first vectors.
具体地,该分词算法可以为卷积神经网络CNN,卷积神经网络可以较好地对句子进行降噪和去除冗余(滤除句子中没有影响较小的字或词)。另外,该分词算法中的模型参数可以包括先前通过对大量其他文档进行训练得到的参数,也可以包括人为配置的参数。Specifically, the word segmentation algorithm can be a convolutional neural network CNN, and the convolutional neural network can better denoise and remove redundancy in sentences (filter out words or words that have no small impact in the sentence). In addition, the model parameters in the word segmentation algorithm may include parameters previously obtained by training a large number of other documents, or may include artificially configured parameters.
本申请实施例以句子为单位从第一文档中提取文字特征来构成特征向量,例如,假若第一文档包括20个句子,那么可以从其中每个句子分别提取文字特征,每一个句子中的文字特征构成一个特征向量,为了与后续从第二文档中提取的特征向量区分,可以称从第一文档中提取的文字特征构成的特征向量为第一向量,称从第二文档中提取的文字特征构成的特征向量为第二向量。可选的,假若第一文档包括20个句子,也可以从仅从其中部分句子(例如其中的18个句子,这18个可以是通过预先定义的算法从这20个句子总选择的)提取文字特征,依旧是每个句子中的文字特征构成一个特征向量。In the embodiment of this application, text features are extracted from the first document in sentence units to form a feature vector. For example, if the first document includes 20 sentences, then text features can be extracted from each sentence, and the text in each sentence The feature constitutes a feature vector. In order to distinguish it from the feature vector extracted from the second document, the feature vector composed of the text features extracted from the first document can be called the first vector, and the text feature extracted from the second document The constructed feature vector is the second vector. Optionally, if the first document contains 20 sentences, text can also be extracted from only some of the sentences (for example, 18 sentences among them, these 18 can be selected from these 20 sentences by a predefined algorithm) Feature is still a feature vector composed of text features in each sentence.
举例来说,假若第一文档中存在“我的爱好是打篮球和乒乓球”这样一句话,且通过分词算法从这句话中提取的文字特征为“我”、“的”、“爱好”、“是”、“打”、“篮球”、“和”、“乒乓球”,在通过这些文字特征(即词)确定第一向量时,可以直接使用全部的词,也可以选择部分的词。词到向量转化的方式,可以使用One-hot也可以使用预训练的词向量。可选的,假若使用全部的词来转化为向量,那么根据这8个文字特征得到的特征向量可以为一个第一向量X11=(t1,t2,t3,t4,t5,t6,t7,t8),其中,t1表示“我”,t2表示“的”,t3表示“爱好”,t4表示“是”,t5表示“打”,t6表示“篮球”,t7表示“和”,t8表示“乒乓球”。通过这种方式可以得到多个第一向量。For example, if there is a sentence such as "My hobby is playing basketball and table tennis" in the first document, and the text features extracted from this sentence through the word segmentation algorithm are "I", "的", "Hobby" , "Yes", "Hit", "Basketball", "He", "Ping Pong", when determining the first vector by these character features (ie words), you can directly use all words or select some words . The word-to-vector conversion method can use One-hot or pre-trained word vectors. Optionally, if all words are used to convert into vectors, then the feature vector obtained from these 8 character features can be a first vector X11=(t1, t2, t3, t4, t5, t6, t7, t8) , Where t1 means "I", t2 means "of", t3 means "hobby", t4 means "yes", t5 means "play", t6 means "basketball", t7 means "and", t8 means "table tennis" ". In this way, multiple first vectors can be obtained.
步骤S102:设备通过该自编码模型中的注意力网络训练该多个第一向量以获得该多个第一向量中每个第一向量的注意力权重。Step S102: The device trains the multiple first vectors through the attention network in the self-encoding model to obtain the attention weight of each first vector in the multiple first vectors.
具体地,该注意力网络用于刻画不同的第一向量的重要性,注意力网络的模型参数可以包括通过对其他大量向量(包括重要的向量和不重要的向量)进行训练得到的参数,也可以包括人为设置的参数,因此当把上述多个第一向量输入到该注意力往后,可以获得该多个第一向量中每个第一向量的注意力权重,注意力权重越高的第一向量在体现语义时作用越大。Specifically, the attention network is used to characterize the importance of different first vectors, and the model parameters of the attention network may include parameters obtained by training a large number of other vectors (including important vectors and unimportant vectors). It may include artificially set parameters. Therefore, when the above-mentioned multiple first vectors are input to the attention, the attention weight of each first vector in the multiple first vectors can be obtained. A vector is more effective in embodying semantics.
举例来说,假若该多个第一向量分别为:X11,X12,X13,X14,X15,X16,X17,X18,X19,X10,通过注意力网络训练得到的这些第一向量的注意力权重如表1所示:For example, if the multiple first vectors are: X11, X12, X13, X14, X15, X16, X17, X18, X19, X10, the attention weights of these first vectors obtained through attention network training are as follows Table 1 shows:
表1Table 1
第一向量First vector
|
注意力权重Attention weight
|
X11X11
|
0.010.01
|
X12X12
|
0.050.05
|
X13X13
|
0.10.1
|
X14X14
|
0.20.2
|
X15X15
|
0.050.05
|
X16X16
|
0.090.09
|
X17X17
|
0.0910.091
|
X18X18
|
0.0090.009
|
X19X19
|
0.30.3
|
X10X10
|
0.10.1
|
从表1可以看出,第一X19、X14、X13、X10的注意力权重较大,因此预计这几个第一向量相比其他第一向量而言,在表达第一文档的语义方便具有更多的信息量。It can be seen from Table 1 that the attention weights of the first X19, X14, X13, and X10 are larger. Therefore, it is expected that these first vectors are more convenient to express the semantics of the first document than other first vectors. A lot of information.
步骤S103:设备将该多个第一向量和该多个第一向量中每个第一向量的注意力权重输入到该自编码模型中的长短期记忆网络LSTM训练,以生成第一语义向量。Step S103: The device inputs the plurality of first vectors and the attention weight of each first vector of the plurality of first vectors into the long short-term memory network LSTM training in the self-encoding model to generate a first semantic vector.
具体地,该LSTM能够根据代表词语的特征向量生成语义向量,本申请实施例的LSTM在生成第一语义向量的过程中不仅要依据输入的各个第一向量,还要依据各个第一向量的注意力权重。在刻画语义时向注意力权重大的第一向量倾向更多。举例来说,第一向量X19主要表达“喜欢”一类的意思,而第一向量X15主要表达“讨厌”一类的意思,并且第一向量X19的注意力权重远大于第二向量X15的注意力权重,那么生成的第一语义向量更倾向于表达“喜欢”的意思。Specifically, the LSTM can generate a semantic vector based on the feature vector of a representative word. The LSTM in the embodiment of the present application not only needs to generate the first semantic vector based on the input of each first vector, but also based on the attention of each first vector. Power weight. When describing semantics, there is a greater tendency to focus on the first vector with greater attention power. For example, the first vector X19 mainly expresses the meaning of "like", and the first vector X15 mainly expresses the meaning of "hate", and the attention weight of the first vector X19 is much greater than the attention weight of the second vector X15 Power weight, then the generated first semantic vector is more inclined to express the meaning of "like".
LSTM根据多个第一向量和相应的注意力权重得到第一语义向量可以看做是一个编码过程,编码之前是多个向量,编码之后得到一个向量,表2例举性地示意了编码前后的向量。LSTM obtains the first semantic vector according to multiple first vectors and corresponding attention weights, which can be regarded as an encoding process. Before encoding, there are multiple vectors, and after encoding, a vector is obtained. Table 2 exemplarily shows the before and after encoding. vector.
表2Table 2
步骤S104:设备通过该LSTM解码该第一语义向量以获得多个第一解码向量。Step S104: The device decodes the first semantic vector through the LSTM to obtain multiple first decoded vectors.
具体地,在编码层通过LSTM获得第一语义向量之后,在解码层还要通过LSTM解码该第一语义向量,可以称解码获得的向量为第一解码向量以方便后续描述。解码之前是一个向量,解码之后是多个向量,表3例举性地示意了解码前后的向量。Specifically, after the encoding layer obtains the first semantic vector through LSTM, the first semantic vector is also decoded through LSTM in the decoding layer, and the vector obtained by decoding may be called the first decoding vector to facilitate subsequent description. Before decoding is a vector, after decoding, there are multiple vectors. Table 3 exemplarily shows the vectors before and after decoding.
表3table 3
本申请实施例中的自编码器的目标是使得解码层的LSTM解码得到的多个第一解码向量向通过分词算法得到的多个第一向量收敛,即使得多个第一解码向量尽可能接近多个第一向量(可以通过预先定义损失函数(loss function)来规定收敛的到什么程度)。通常来说需要多次执行上述步骤S101-S104,每次执行完步骤S101-S104之后,若多个第一解码向量与多个第一向量无法达到预期的相似条件,则对自编码模型中分词算法、注意力网络和LSTM中至少一项的模型参数进行优化,优化之后再次执行步骤S101-S104;如此循环,直至多个第一解码向量与多个第一向量无法达到预期的相似条件。The goal of the self-encoder in the embodiment of this application is to make the multiple first decoded vectors obtained by LSTM decoding of the decoding layer converge to the multiple first vectors obtained by the word segmentation algorithm, that is, make the multiple first decoded vectors as close as possible Multiple first vectors (loss function can be defined in advance to specify the degree of convergence). Generally speaking, it is necessary to perform the above steps S101-S104 multiple times. After each execution of the steps S101-S104, if the multiple first decoding vectors and multiple first vectors cannot meet the expected similar conditions, the word segmentation in the self-encoding model The model parameters of at least one of the algorithm, the attention network, and the LSTM are optimized, and steps S101-S104 are executed again after the optimization; and the loop is repeated until the multiple first decoded vectors and the multiple first vectors cannot meet the expected similar conditions.
该预期的相似条件(也称预设相似条件)可以通过配置对自编码模型进配置,使得该自编码模型具有判断是否达到预期的相似条件的能力。下面通过较简单的案例来讲述多个第一解码向量与多个第一向量无法达到预期的相似条件的情形(实际应用中可以配置更复杂的规则)。The expected similarity condition (also called the preset similarity condition) can be configured to configure the self-encoding model, so that the self-encoding model has the ability to judge whether the expected similarity condition is reached. In the following, a simpler case is used to describe the situation where multiple first decoding vectors and multiple first vectors cannot meet the expected similar conditions (more complicated rules can be configured in actual applications).
举例来说,定义解码后的多个第一解码向量中有70%以上的第一解码向量与第一向量相同,则认为该多个第一解码向量与多个第一向量满足预期的相似条件。那么假若有10个第一向量,解码后有10个第一解码向量,其中有8个第一向量与8个第一解码向量一一对应相同,只有剩余2个第一解码向量没有对应相同的第一向量,相同率达到了80%,大于规定的70%,因此认为这10个第一解码向量与这10个第一向量满足预期的相似条件。For example, it is defined that more than 70% of the plurality of first decoding vectors after decoding are the same as the first vector, it is considered that the plurality of first decoding vectors and the plurality of first vectors meet the expected similarity condition . So if there are 10 first vectors, there are 10 first decoded vectors after decoding, of which 8 first vectors correspond to the 8 first decoded vectors one to one, and only the remaining 2 first decoded vectors do not correspond to the same The first vector has the same rate of 80%, which is greater than the prescribed 70%. Therefore, it is considered that the 10 first decoded vectors and the 10 first vectors meet the expected similarity condition.
步骤S105:设备通过该自编码模型中的分词算法提取第二文档中的多个句子中的文字特征以构成多个第二向量。Step S105: The device extracts the text features in multiple sentences in the second document through the word segmentation algorithm in the self-encoding model to form multiple second vectors.
具体地,本申请实施例以句子为单位从第二文档中提取文字特征来构成特征向量,例如,假若第一文档包括16个句子,那么可以从其中每个句子分别提取文字特征,每一个句子中的文字特征构成一个特征向量,为了与后续从第一文档中提取的特征向量区分,可以称从第二文档中提取的文字特征构成的特征向量为第二向量,称从第一文档中提取的文字特征构成的特征向量为第一向量。可选的,假若第二文档包括16个句子,也可以从仅从其中部分句子(例如其中的15个句子,这15个可以是通过预先定义的算法从这16个句子总选择的)提取文字特征,依旧是每个句子中的文字特征构成一个特征向量。Specifically, the embodiment of the present application extracts text features from the second document in sentence units to form a feature vector. For example, if the first document includes 16 sentences, then the text features can be extracted from each sentence separately, and each sentence In order to distinguish from the feature vector extracted from the first document, the feature vector composed of the text feature extracted from the second document can be called the second vector, which is called extracted from the first document The feature vector formed by the character features of is the first vector. Optionally, if the second document contains 16 sentences, text can also be extracted from only some of the sentences (for example, 15 of the sentences, these 15 can be selected from the 16 sentences through a predefined algorithm) Feature is still a feature vector composed of text features in each sentence.
举例来说,假若第二文档中存在“我的爱好是打篮球和羽毛球”这样一句话,且通过分词算法从这句话中提取的文字特征为“爱好”、“篮球”、“羽毛”,那么根据这三个文字特征得到的特征向量可以为一个第二向量X21=(t1,t2,t4),其中,t1表示“爱好”,t2表示“篮球”,t4表示“羽毛球”。通过这种方式可以得到多个第二向量。For example, if there is a sentence like "My hobby is playing basketball and badminton" in the second document, and the text features extracted from this sentence through the word segmentation algorithm are "hobby", "basketball", and "feather", Then the feature vector obtained from these three character features can be a second vector X21=(t1, t2, t4), where t1 represents "hobby", t2 represents "basketball", and t4 represents "badminton". In this way, multiple second vectors can be obtained.
步骤S106:通过该自编码模型中的注意力网络训练该多个第二向量以获得该多个第二向量中每个第二向量的注意力权重。Step S106: Train the multiple second vectors through the attention network in the self-encoding model to obtain the attention weight of each of the multiple second vectors.
具体地,该注意力网络用于刻画不同的第二向量的重要性,注意力网络的模型参数可以包括通过对其他大量向量(包括重要的向量和不重要的向量)进行训练得到的参数,也可以包括人为设置的参数,因此当把上述多个第二向量输入到该注意力往后,可以获得该多个第二向量中每个第二向量的注意力权重,注意力权重越高的第二向量在体现语义时作用越大。Specifically, the attention network is used to characterize the importance of different second vectors, and the model parameters of the attention network can include parameters obtained by training a large number of other vectors (including important vectors and unimportant vectors). It can include artificially set parameters. Therefore, when the multiple second vectors are input to the attention, the attention weight of each second vector in the multiple second vectors can be obtained. The two vectors are more effective in embodying semantics.
举例来说,假若该多个第二向量分别为:X21,X22,X23,X24,X25,X26,X27,X28,X29,X210,通过注意力网络训练得到的这些第二向量的注意力权重如表1所示:For example, if the multiple second vectors are: X21, X22, X23, X24, X25, X26, X27, X28, X29, X210, the attention weights of these second vectors obtained through attention network training are as Table 1 shows:
表4Table 4
第二向量Second vector
|
注意力权重Attention weight
|
X21X21
|
0.020.02
|
X22X22
|
0.040.04
|
X23X23
|
0.150.15
|
X24X24
|
0.150.15
|
X25X25
|
0.040.04
|
X26X26
|
0.10.1
|
X27X27
|
0.090.09
|
X28X28
|
0.010.01
|
X29X29
|
0.30.3
|
X20X20
|
0.10.1
|
从表4可以看出,第二X29、X24、X23、X20的注意力权重较大,因此预计这几个第二向量相比其他第二向量而言,在表达第二文档的语义方便具有更多的信息量。It can be seen from Table 4 that the second X29, X24, X23, and X20 have larger attention weights. Therefore, it is expected that these second vectors are more convenient to express the semantics of the second document than other second vectors. A lot of information.
步骤S107:将该多个第二向量和该多个第二向量中每个第二向量的注意力权重输入到该自编码模型中的长短期记忆网络LSTM训练,以生成第二语义向量。Step S107: Input the plurality of second vectors and the attention weight of each second vector in the plurality of second vectors into the long-short-term memory network LSTM training in the self-encoding model to generate a second semantic vector.
具体地,该LSTM能够根据代表词语的特征向量生成语义向量,本申请实施例的LSTM 在生成第二语义向量的过程中不仅要依据输入的各个第二向量,还要依据各个第二向量的注意力权重。在刻画语义时向注意力权重大的第二向量倾向更多。举例来说,第二向量X29主要表达“开心”一类的意思,而第二向量X25主要表达“烦躁”一类的意思,并且第二向量X29的注意力权重远大于第二向量X25的注意力权重,那么生成的第二语义向量更倾向于表达“开心”的意思。Specifically, the LSTM can generate semantic vectors based on feature vectors representing words. In the process of generating second semantic vectors, the LSTM in the embodiment of this application must not only rely on the input of each second vector, but also based on the attention of each second vector. Power weight. When describing semantics, there is more tendency to focus on the second vector that has the greatest attention power. For example, the second vector X29 mainly expresses the meaning of "happy", and the second vector X25 mainly expresses the meaning of "irritable", and the attention weight of the second vector X29 is much greater than the attention weight of the second vector X25 Power weight, then the generated second semantic vector is more inclined to express the meaning of "happy".
LSTM根据多个第二向量和相应的注意力权重得到第二语义向量可以看作是一个编码过程,编码之前是多个向量,编码之后得到一个向量,表5例举性地示意了编码前后的向量。LSTM obtains the second semantic vector according to multiple second vectors and corresponding attention weights, which can be regarded as an encoding process. Before encoding, there are multiple vectors, and after encoding, a vector is obtained. Table 5 exemplarily shows the before and after encoding. vector.
表5table 5
步骤S108:通过该LSTM解码该第二语义向量以获得多个第二解码向量。Step S108: Decode the second semantic vector through the LSTM to obtain multiple second decoded vectors.
具体地,在编码层通过LSTM获得第二语义向量之后,在解码层还要通过LSTM解码该第二语义向量,可以称解码获得的向量为第二解码向量以方便后续描述。解码之前是一个向量,解码之后是多个向量,表6例举性地示意了解码前后的向量。Specifically, after the encoding layer obtains the second semantic vector through the LSTM, the second semantic vector is also decoded through the LSTM in the decoding layer. The vector obtained by decoding may be called the second decoding vector to facilitate subsequent description. Before decoding is a vector, after decoding, there are multiple vectors. Table 6 exemplarily shows the vectors before and after decoding.
表6Table 6
本申请实施例中的自编码器的目标是使得解码层的LSTM解码得到的多个第二解码向 量向通过分词算法得到的多个第二向量收敛,即使得多个第二解码向量尽可能接近多个第二向量。通常来说需要多次执行上述步骤S105-S108,每次执行完步骤S105-S108之后,若多个第二解码向量与多个第二向量无法达到预期的相似条件,则对自编码模型中分词算法、注意力网络和LSTM中至少一项的模型参数进行优化,优化之后再次执行步骤S105-S108;如此循环,直至多个第二解码向量与多个第二向量无法达到预期的相似条件。The goal of the self-encoder in the embodiment of this application is to make the multiple second decoding vectors obtained by LSTM decoding of the decoding layer converge to multiple second vectors obtained through the word segmentation algorithm, that is, to make the multiple second decoding vectors as close as possible Multiple second vectors. Generally speaking, it is necessary to perform the above steps S105-S108 multiple times. After each execution of the steps S105-S108, if the multiple second decoding vectors and multiple second vectors cannot meet the expected similar conditions, the word segmentation in the self-encoding model The model parameters of at least one of the algorithm, the attention network and the LSTM are optimized, and steps S105-S108 are executed again after the optimization; and the loop is repeated until the multiple second decoded vectors and the multiple second vectors cannot meet the expected similar conditions.
该预期的相似条件(也称预设相似条件)可以通过配置对自编码模型进配置,使得该自编码模型具有判断是否达到预期的相似条件的能力。下面通过较简单的案例来讲述多个第二解码向量与多个第二向量无法达到预期的相似条件的情形(实际应用中可以配置更复杂的规则)。The expected similarity condition (also called the preset similarity condition) can be configured to configure the self-encoding model, so that the self-encoding model has the ability to judge whether the expected similarity condition is reached. The following uses a simpler case to describe the situation where multiple second decoding vectors and multiple second vectors cannot meet the expected similar conditions (more complicated rules can be configured in practical applications).
举例来说,定义解码后的多个第二解码向量中有70%以上的第二解码向量与第二向量相同,则认为该多个第二解码向量与多个第二向量满足预期的相似条件。那么假若有10个第二向量,解码后有10个第二解码向量,其中有8个第二向量与8个第二解码向量一一对应相同,只有剩余2个第二解码向量没有对应相同的第二向量,相同率达到了80%,大于规定的70%,因此认为这10个第二解码向量与这10个第二向量满足预期的相似条件。For example, it is defined that more than 70% of the second decoded vectors after decoding are the same as the second vector, it is considered that the plurality of second decoded vectors and the plurality of second vectors meet the expected similarity condition . So if there are 10 second vectors, there are 10 second decoded vectors after decoding, of which 8 second vectors correspond to the 8 second decoded vectors one-to-one, and only the remaining 2 second decoded vectors do not correspond to the same For the second vector, the same rate reaches 80%, which is greater than the prescribed 70%. Therefore, it is considered that the 10 second decoded vectors and the 10 second vectors meet the expected similarity condition.
步骤S109:设备将该第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为(如作弊行为)。Step S109: The device compares the first semantic vector with the second semantic vector of the second document to determine whether there is a target behavior (such as a cheating behavior).
具体地,在该多个第一解码向量与该多个第一向量满足预设相似条件,第一语义向量能够很好地反映第一文档的语义;在该多个第二解码向量与该多个第二向量满足预设相似条件的情况下,第二语义向量能够很好地反映第二文档的语义;因此,在该多个第一解码向量与该多个第一向量满足预设相似条件,以及该多个第二解码向量与该多个第二向量满足预设相似条件的情况下,比对第一语义向量与第二语义向量的相似度,即可反映第一文档与第二文档的相似度。比对第一语义向量与第二语义向量的相似度的方式有很多,下面举例说明。Specifically, when the plurality of first decoding vectors and the plurality of first vectors satisfy the preset similarity condition, the first semantic vector can well reflect the semantics of the first document; and the plurality of second decoding vectors and the plurality of When the second vector meets the preset similarity condition, the second semantic vector can well reflect the semantics of the second document; therefore, when the plurality of first decoded vectors and the plurality of first vectors meet the preset similarity condition , And when the plurality of second decoding vectors and the plurality of second vectors meet the preset similarity condition, the similarity between the first semantic vector and the second semantic vector can be compared to reflect the first document and the second document The similarity. There are many ways to compare the similarity between the first semantic vector and the second semantic vector, which will be illustrated below.
例如,该将该第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为,可以具体为:确定该第一语义向量与该第二语义向量的余弦值;若该余弦值大于或等于预设阈值则认为第一文档与第二文档的语义很相似,因此确定存在目标行为。该预设阈值的大小可以根据实际需要来继续设置,可选的,可以设置为0.6-0.9之间的值。For example, comparing the first semantic vector with the second semantic vector of the second document to determine whether there is a target behavior may be specifically: determining the cosine value of the first semantic vector and the second semantic vector; if the cosine If the value is greater than or equal to the preset threshold, it is considered that the semantics of the first document and the second document are very similar, so it is determined that there is a target behavior. The size of the preset threshold can be continuously set according to actual needs, and optionally can be set to a value between 0.6-0.9.
在一种可选的方案中,在执行步骤S101之前,第一文档被执行过关键词替换,在执行步骤S102之前,第二文档被执行过关键词替换。需要说明的是,一些同义关键词被替换之后更有利于设备提取分词,也更有利于不同文档之间的对比。例如,假若第一文档中有一句“我精通前端开发”,第二文档中有一句“我擅长前端开发”,实质上这两句中的“擅长”和“精通”是同义词,这两句话的语义相同,如果不进行同义词替换,这两个句子相对于设备而言,有一定的风险被识别为不同含义。In an optional solution, before step S101 is performed, keyword replacement has been performed on the first document, and keyword replacement has been performed on the second document before step S102 is performed. It should be noted that after some synonymous keywords are replaced, it is more conducive to device extraction and word segmentation, and it is also more conducive to comparison between different documents. For example, if there is a sentence "I am proficient in front-end development" in the first document and a sentence "I am good at front-end development" in the second document, in essence, "good at" and "proficient" in these two sentences are synonyms. The semantics of is the same. If no synonym replacement is performed, these two sentences have a certain risk of being recognized as different meanings relative to the device.
需要说明的是,上面的第一文档和第二文档可以分别为两个不同应聘者的应聘答卷,也可以为考试过程中两个不同考生的答卷,也可以是其他场景中具有两个具有可比性的文档。It should be noted that the first document and the second document above can be the application answer sheets of two different candidates respectively, or the answer sheets of two different candidates during the examination process, or they can be two comparable in other scenarios. Sexual documentation.
通过实施上述方法,以文档中的句子为单位提取词语特征,从而为每一个句子分别生成一个特征向量,采用这种方式能够尽可能地保留各个各自中的重要语义,使得后续生成 语义向量时语义向量更能反映该文档的语义。Through the implementation of the above method, the feature of the word is extracted in the unit of the sentence in the document, thereby generating a feature vector for each sentence. In this way, the important semantics of each can be retained as much as possible, so that the subsequent generation of the semantic vector will be semantic The vector can better reflect the semantics of the document.
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。The foregoing describes the method of the embodiment of the present application in detail, and the device of the embodiment of the present application is provided below.
请参见图2,图2是本申请实施例提供的一种设备20的结构示意图,该设备20可以包括第一提取单元201、第一训练单元202、第一生成单元203、第一解码单元204和对比单元205,其中,各个单元的详细描述如下。Please refer to FIG. 2, which is a schematic structural diagram of a device 20 according to an embodiment of the present application. The device 20 may include a first extraction unit 201, a first training unit 202, a first generation unit 203, and a first decoding unit 204. With the comparison unit 205, the detailed description of each unit is as follows.
第一提取单元201用于通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量,其中,每一个句子中的文字特征构成一个第一向量;The first extraction unit 201 is used for extracting text features in multiple sentences in the first document through the word segmentation algorithm in the self-encoding model to form multiple first vectors, wherein the text features in each sentence form a first vector ;
第一训练单元202用于通过所述自编码模型中的注意力网络训练所述多个第一向量以获得所述多个第一向量中每个第一向量的注意力权重;The first training unit 202 is configured to train the multiple first vectors through the attention network in the self-encoding model to obtain the attention weight of each first vector in the multiple first vectors;
第一生成单元203用于将所述多个第一向量和所述多个第一向量中每个第一向量的注意力权重输入到所述自编码模型中的长短期记忆网络LSTM训练,以生成第一语义向量;The first generating unit 203 is configured to input the plurality of first vectors and the attention weight of each first vector of the plurality of first vectors into the long and short-term memory network LSTM training in the self-encoding model to Generate the first semantic vector;
第一解码单元204用于通过所述LSTM解码所述第一语义向量以获得多个第一解码向量;The first decoding unit 204 is configured to decode the first semantic vector through the LSTM to obtain multiple first decoding vectors;
对比单元205用于若所述多个第一解码向量与所述多个第一向量满足预设相似条件,则将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为。The comparing unit 205 is configured to compare the first semantic vector with the second semantic vector of the second document if the plurality of first decoded vectors and the plurality of first vectors satisfy a preset similarity condition to determine whether There is a target behavior.
通过运行上述单元,以文档中的句子为单位提取词语特征,从而为每一个句子分别生成一个特征向量,而不是根据整个文档中的词语特征构成一个特征向量,采用这种方式能够尽可能地保留各个各自中的重要语义,使得后续生成语义向量时语义向量更能反映该文档的语义。另外,自编码模型的编码层采用CNN提取词语特征,CNN具有很好的降噪和去冗余性能,因此提取的文字特征更体现句子本身语义。除此之外,编码层的注意力网络以特征向量为单位训练各个特征向量的注意力权重,而不是以词语特征为单位训练各个特征的注意力权重,能够明显降注意力权重的训练压力,提高注意力权重的训练效率,也使得训练出的注意力权重更具有参考价值。编码层还采用LSTM生成语义向量,能够更好地刻画文档的语义。By running the above unit, extract word features in the unit of sentence in the document, thereby generating a feature vector for each sentence, instead of constructing a feature vector based on the word features in the entire document, this method can retain as much as possible The important semantics in each makes the semantic vector better reflect the semantics of the document when the semantic vector is subsequently generated. In addition, the coding layer of the self-encoding model uses CNN to extract word features. CNN has good noise reduction and de-redundancy performance, so the extracted text features better reflect the semantics of the sentence itself. In addition, the attention network of the coding layer trains the attention weight of each feature vector in the unit of feature vector, instead of training the attention weight of each feature in the unit of word feature, which can significantly reduce the training pressure of attention weight. Improving the training efficiency of attention weights also makes the trained attention weights more valuable. The coding layer also uses LSTM to generate semantic vectors, which can better describe the semantics of the document.
在一种可能的实现方式中,设备20还包括:In a possible implementation manner, the device 20 further includes:
第二提取单元,用于通过所述自编码模型中的分词算法提取第二文档中的多个句子中的文字特征以构成多个第二向量,其中,每一个句子中的文字特征构成一个第二向量;The second extraction unit is used to extract text features in multiple sentences in the second document through the word segmentation algorithm in the self-encoding model to form multiple second vectors, wherein the text features in each sentence constitute a first vector Two vectors
第二训练单元,用于通过所述自编码模型中的注意力网络训练所述多个第二向量以获得所述多个第二向量中每个第二向量的注意力权重;A second training unit, configured to train the multiple second vectors through the attention network in the self-encoding model to obtain the attention weight of each second vector in the multiple second vectors;
第二生成单元,用于将所述多个第二向量和所述多个第二向量中每个第二向量的注意力权重输入到所述自编码模型中的长短期记忆网络LSTM训练,以生成第二语义向量;The second generating unit is configured to input the plurality of second vectors and the attention weight of each second vector of the plurality of second vectors into the long and short-term memory network LSTM training in the self-encoding model to Generate the second semantic vector;
第二解码单元,用于通过所述LSTM解码所述第二语义向量以获得多个第二解码向量,其中,所述多个第二解码向量与所述多个第二向量满足预设相似条件。The second decoding unit is configured to decode the second semantic vector through the LSTM to obtain a plurality of second decoding vectors, wherein the plurality of second decoding vectors and the plurality of second vectors satisfy a preset similarity condition .
在又一种可能的实现方式中,所述对比单元将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为,包括:In another possible implementation manner, the comparing unit compares the first semantic vector with the second semantic vector of the second document to determine whether there is a target behavior, including:
确定所述第一语义向量与所述第二语义向量的余弦值;Determining the cosine values of the first semantic vector and the second semantic vector;
若所述余弦值大于或等于预设阈值则确定存在目标行为。If the cosine value is greater than or equal to the preset threshold, it is determined that there is a target behavior.
在又一种可能的实现方式中,还包括:In yet another possible implementation manner, it also includes:
调整单元,用于在所述第一提取单元通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量之前,调整所述自编码模型中的所述分词算法、所述注意力网络和所述LSTM中至少一项的参数,以使所述自编码模型的输出向所述自编码模型的输入收敛。The adjustment unit is configured to adjust the word features in the multiple sentences in the first document to form multiple first vectors by the first extraction unit using the word segmentation algorithm in the self-encoding model to adjust the Parameters of at least one of the word segmentation algorithm, the attention network, and the LSTM, so that the output of the self-encoding model converges to the input of the self-encoding model.
在又一种可能的实现方式中,所述第一提取单元,用于通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量,具体为:In another possible implementation manner, the first extraction unit is configured to extract text features in multiple sentences in the first document through a word segmentation algorithm in the self-encoding model to form multiple first vectors, specifically :
通过自编码模型中的卷积神经网络CNN提取第一文档中的多个句子中的文字特征以构成多个第一向量。The text features in the multiple sentences in the first document are extracted through the convolutional neural network CNN in the self-encoding model to form multiple first vectors.
需要说明的是,各个单元的实现还可以对应参照图1所示的方法实施例的相应描述。It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in FIG. 1.
请参见图3,图3是本申请实施例提供的一种设备30,该设备30包括处理器301、存储器302和通信接口303,所述处理器301、存储器302和通信接口303通过总线相互连接。Please refer to FIG. 3, which is a device 30 provided by an embodiment of the present application. The device 30 includes a processor 301, a memory 302, and a communication interface 303. The processor 301, the memory 302, and the communication interface 303 are connected to each other through a bus. .
存储器302包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器302用于相关指令及数据。通信接口303用于接收和发送数据。The memory 302 includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read-only memory, EPROM), or A portable read-only memory (compact disc read-only memory, CD-ROM), the memory 302 is used for related instructions and data. The communication interface 303 is used to receive and send data.
处理器301可以是一个或多个中央处理器(central processing unit,CPU),在处理器301是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。The processor 301 may be one or more central processing units (CPUs). When the processor 301 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
该设备30中的处理器301用于读取所述存储器302中存储的程序代码,执行以下操作:The processor 301 in the device 30 is configured to read the program code stored in the memory 302 and perform the following operations:
通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量,其中,每一个句子中的文字特征构成一个第一向量;Extracting text features in multiple sentences in the first document through a word segmentation algorithm in the self-encoding model to form multiple first vectors, where the text features in each sentence form a first vector;
通过所述自编码模型中的注意力网络训练所述多个第一向量以获得所述多个第一向量中每个第一向量的注意力权重;Training the plurality of first vectors through the attention network in the auto-encoding model to obtain the attention weight of each first vector in the plurality of first vectors;
将所述多个第一向量和所述多个第一向量中每个第一向量的注意力权重输入到所述自编码模型中的长短期记忆网络LSTM训练,以生成第一语义向量;Inputting the plurality of first vectors and the attention weight of each of the plurality of first vectors into the long-term short-term memory network LSTM training in the self-encoding model to generate a first semantic vector;
通过所述LSTM解码所述第一语义向量以获得多个第一解码向量;Decoding the first semantic vector by the LSTM to obtain a plurality of first decoding vectors;
若所述多个第一解码向量与所述多个第一向量满足预设相似条件,则将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为。If the plurality of first decoding vectors and the plurality of first vectors satisfy a preset similarity condition, the first semantic vector is compared with the second semantic vector of the second document to determine whether there is a target behavior.
通过实施上述方法,以文档中的句子为单位提取词语特征,从而为每一个句子分别生成一个特征向量,而不是根据整个文档中的词语特征构成一个特征向量,采用这种方式能够尽可能地保留各个各自中的重要语义,使得后续生成语义向量时语义向量更能反映该文档的语义。另外,自编码模型的编码层采用CNN提取词语特征,CNN具有很好的降噪和去冗余性能,因此提取的文字特征更体现句子本身语义。除此之外,编码层的注意力网络以特征向量为单位训练各个特征向量的注意力权重,而不是以词语特征为单位训练各个特征的注意力权重,能够明显降注意力权重的训练压力,提高注意力权重的训练效率,也使得训练出的注意力权重更具有参考价值。编码层还采用LSTM生成语义向量,能够更好地刻画文档的语义。Through the implementation of the above method, the word features are extracted in the unit of the sentence in the document, thereby generating a feature vector for each sentence, instead of forming a feature vector based on the word features in the entire document, this method can be retained as much as possible The important semantics in each makes the semantic vector better reflect the semantics of the document when the semantic vector is subsequently generated. In addition, the coding layer of the self-encoding model uses CNN to extract word features. CNN has good noise reduction and de-redundancy performance, so the extracted text features better reflect the semantics of the sentence itself. In addition, the attention network of the coding layer trains the attention weight of each feature vector in the unit of feature vector, instead of training the attention weight of each feature in the unit of word feature, which can significantly reduce the training pressure of attention weight. Improving the training efficiency of attention weights also makes the trained attention weights more valuable. The coding layer also uses LSTM to generate semantic vectors, which can better describe the semantics of the document.
在一种可能的实现方式中,所述处理器将所述第一语义向量与第二文档的第二语义向 量比较,以确定是否存在目标行为之前,还用于:In a possible implementation manner, before the processor compares the first semantic vector with the second semantic vector of the second document to determine whether there is a target behavior, it is also used to:
通过所述自编码模型中的分词算法提取第二文档中的多个句子中的文字特征以构成多个第二向量,其中,每一个句子中的文字特征构成一个第二向量;Extracting text features in multiple sentences in the second document by the word segmentation algorithm in the self-encoding model to form multiple second vectors, wherein the text features in each sentence form a second vector;
通过所述自编码模型中的注意力网络训练所述多个第二向量以获得所述多个第二向量中每个第二向量的注意力权重;Training the plurality of second vectors through the attention network in the self-encoding model to obtain the attention weight of each second vector in the plurality of second vectors;
将所述多个第二向量和所述多个第二向量中每个第二向量的注意力权重输入到所述自编码模型中的长短期记忆网络LSTM训练,以生成第二语义向量;Inputting the plurality of second vectors and the attention weight of each second vector of the plurality of second vectors into the long short-term memory network LSTM training in the self-encoding model to generate a second semantic vector;
通过所述LSTM解码所述第二语义向量以获得多个第二解码向量,其中,所述多个第二解码向量与所述多个第二向量满足预设相似条件。The second semantic vector is decoded by the LSTM to obtain a plurality of second decoding vectors, wherein the plurality of second decoding vectors and the plurality of second vectors satisfy a preset similarity condition.
在又一种可能的实现方式中,所述处理器将所述第一语义向量与第二文档的第二语义向量比较,以确定是否存在目标行为,具体为:In another possible implementation manner, the processor compares the first semantic vector with the second semantic vector of the second document to determine whether there is a target behavior, specifically:
确定所述第一语义向量与所述第二语义向量的余弦值;Determining the cosine values of the first semantic vector and the second semantic vector;
若所述余弦值大于或等于预设阈值则确定存在目标行为。If the cosine value is greater than or equal to the preset threshold, it is determined that there is a target behavior.
在又一种可能的实现方式中,所述处理器通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量之前,还用于:In another possible implementation manner, before the processor extracts the text features in the multiple sentences in the first document through the word segmentation algorithm in the self-encoding model to form multiple first vectors, it is also used to:
调整所述自编码模型中的所述分词算法、所述注意力网络和所述LSTM中至少一项的参数,以使所述自编码模型的输出向所述自编码模型的输入收敛。Adjusting the parameters of at least one of the word segmentation algorithm, the attention network and the LSTM in the self-encoding model, so that the output of the self-encoding model converges to the input of the self-encoding model.
在又一种可能的实现方式中,所述处理器通过自编码模型中的分词算法提取第一文档中的多个句子中的文字特征以构成多个第一向量,具体为:In another possible implementation manner, the processor extracts the text features in the multiple sentences in the first document through the word segmentation algorithm in the self-encoding model to form multiple first vectors, specifically:
通过自编码模型中的卷积神经网络CNN提取第一文档中的多个句子中的文字特征以构成多个第一向量。The text features in the multiple sentences in the first document are extracted through the convolutional neural network CNN in the self-encoding model to form multiple first vectors.
需要说明的是,各个操作的实现还可以对应参照图1所示的方法实施例的相应描述It should be noted that the implementation of each operation can also correspond to the corresponding description of the method embodiment shown in FIG. 1
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,图1所示的方法流程得以实现。An embodiment of the present application also provides a computer-readable storage medium, which stores instructions in the computer-readable storage medium, and when it runs on a processor, the method flow shown in FIG. 1 is implemented.
本申请实施例还提供一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质中存储有指令,当其在处理器上运行时,图1所示的方法流程得以实现。The embodiment of the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores instructions, and when it runs on a processor, the method flow shown in FIG. 1 Achieved.
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,图1所示的方法流程得以实现。The embodiment of the present application also provides a computer program product. When the computer program product runs on a processor, the method flow shown in FIG. 1 is realized.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,该的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. When the program is executed, , May include the processes of the above-mentioned method embodiments. The aforementioned storage media include: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.