WO2021051503A1 - Semantic representation model-based text classification method and apparatus, and computer device - Google Patents

Semantic representation model-based text classification method and apparatus, and computer device Download PDF

Info

Publication number
WO2021051503A1
WO2021051503A1 PCT/CN2019/116339 CN2019116339W WO2021051503A1 WO 2021051503 A1 WO2021051503 A1 WO 2021051503A1 CN 2019116339 W CN2019116339 W CN 2019116339W WO 2021051503 A1 WO2021051503 A1 WO 2021051503A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
sequence
embedding vector
layer
Prior art date
Application number
PCT/CN2019/116339
Other languages
French (fr)
Chinese (zh)
Inventor
邓悦
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051503A1 publication Critical patent/WO2021051503A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • This application relates to the computer field, and in particular to a text classification method, device, computer equipment and storage medium based on a semantic representation model.
  • Text classification is an important part of natural language processing, and text classification models are generally used for text classification.
  • the performance of the text classification model largely depends on its semantic representation model.
  • Common semantic representation models such as models based on the word2vec algorithm, models based on the two-way LSTM network, etc., only consider the relationship between the word itself and/or the context.
  • the performance is
  • the questions that appear in the interview are professional (professional vocabulary, professional relationship expression, etc.), and often examine whether the candidate has a clear grasp of a certain concept or a certain definition, that is, the problem has a knowledge background, so traditional
  • the semantic representation model cannot accurately reflect the professional vocabulary and the relationship between the professional vocabulary (ie entity and entity relationship), so it cannot accurately reflect the input text, thereby reducing the accuracy of the final text classification.
  • the main purpose of this application is to provide a text classification method, device, computer equipment and storage medium based on a semantic representation model, aiming to improve the accuracy of text classification.
  • this application proposes a text classification method based on a semantic representation model, which includes the following steps:
  • the final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing to obtain a text classification result.
  • This application proposes a text classification device based on a semantic representation model, including:
  • the text acquisition unit is configured to acquire the input original text and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
  • the text embedding vector wi corresponding to the i-th word, where the word vector ai, the sentence segmentation vector bi and the position vector ci have the same dimensions;
  • the first sequence generating unit is used to generate a text embedding vector sequence ⁇ w1, w2,..., wn ⁇ , wherein there are a total of n words in the word sequence;
  • the second sequence generating unit is used to input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence ⁇ e1, e2,..., en ⁇ , where en is the entity embedding vector corresponding to the nth word ;
  • the intermediate sequence generating unit is used to input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M The layer word granularity encoder and the preset N layer knowledge granularity encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
  • the final sequence generating unit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding output by the last layer of knowledge granularity encoder Vector sequence and final entity embedding vector sequence;
  • the text classification unit is used to input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
  • the present application provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of any one of the above methods when the computer program is executed.
  • the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the above-mentioned methods are implemented.
  • FIG. 1 is a schematic flowchart of a text classification method based on a semantic representation model according to an embodiment of this application;
  • FIG. 2 is a schematic block diagram of the structure of a text classification device based on a semantic representation model according to an embodiment of the application;
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a method for text classification based on a semantic representation model, including the following steps:
  • This application introduces the entity embedding vector sequence into the semantic representation model, so that the semantic representation model and text classification model can be competent for more complex situations (such as processing text with professional vocabulary and the relationship between professional vocabulary), and improve the final text Accuracy of classification.
  • the input original text is obtained, and the original text is preprocessed to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division.
  • the original text may include multiple sentences, and each sentence includes multiple words. Therefore, the word sequence is obtained by preprocessing including at least sentence division and word division.
  • sentence division and word division can use open source division tools, such as jieba tools, SnowNLP tools, etc.
  • the original text may be any feasible text, preferably text with designated words, wherein the designated words are knowledge nodes in a preset knowledge graph, and the designated words are professional vocabulary in a preset field.
  • the word vector generation method can adopt any feasible method, such as querying a preset word vector library to obtain the word vector corresponding to the words in the word sequence.
  • the word vector library can be an existing database or
  • the word2vec model can be used to train the collected corpus; or, the word vector generation method is, for example: before the training of the semantic representation model, the word vector corresponding to each word is initialized to a random value, and then in the training process , Optimize together with other network parameters to obtain the word vector corresponding to each word. Since the text embedding vector wi is not only composed of the word vector ai, but also composed of the sentence segmentation vector bi and the position vector ci, it can also reflect the sentence position and word position of the i-th word.
  • a text embedding vector sequence ⁇ w1, w2,...,wn ⁇ is generated, wherein there are a total of n words in the word sequence.
  • the text embedding vector sequence ⁇ w1,w2,...,wn ⁇ is composed of the text embedding vectors corresponding to n words, and the text embedding vector is displayed in the form of a column vector, so the text embedding vector sequence ⁇ w1,w2,...,wn ⁇ is also Treated as a matrix with n columns;
  • the word sequence is input into the preset knowledge embedding model to obtain the entity embedding vector sequence ⁇ e1, e2,...,en ⁇ , where en is the entity embedding vector corresponding to the nth word.
  • the knowledge embedding model is, for example, the TransE model, which can extract the entities and relationships in the knowledge graph in the form of vectors, and because the knowledge nodes and relationships in the knowledge graph are more professional (you can select the appropriate Knowledge graph) to obtain the entity embedding vector corresponding to each word.
  • the knowledge embedding model such as the TransE model, is a traditional model, and will not be repeated here. Further, if there is a word that is not an entity, the entity embedding vector corresponding to the word is set to 0.
  • the text embedding vector sequence is input into the preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M The layer word granularity encoder and the preset N layer knowledge granularity encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2.
  • step S6 input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding output by the last layer of knowledge granularity encoder
  • the vector sequence and the final entity are embedded in the vector sequence.
  • the calculation process in the N-layer knowledge granular encoder is, for example: inputting the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer of the first-layer knowledge granular encoder To get the first vector sequence And the second vector sequence
  • the first vector sequence and the second vector sequence are input to the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word, where the information
  • the final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing, and a text classification result is obtained.
  • the classification model may be any feasible classification model, such as a softmax classifier. Since the final text embedding vector sequence and the final entity embedding vector sequence utilize the entity embedding vector, the final text classification result is more suitable for professional situations and the classification is more accurate.
  • each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the text embedding vector sequence is input to the preset M layer word granularity.
  • the step S5 of calculating in the encoder to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder includes:
  • the text embedding vector sequence is respectively multiplied by the trained h first parameter matrix groups to obtain the first matrix ⁇ Q1, Q2 ,...,Qh ⁇ , the second matrix ⁇ K1,K2,...,Kh ⁇ and the third matrix ⁇ V1,V2,...,Vh ⁇ , where each first parameter matrix group includes three q ⁇ k first Parameter matrix
  • Multihead( ⁇ w 1 ,w 2 ,...,w n ⁇ ) Concat(head 1 ,head 2 ,...,head h )W, calculate the multihead self-attention matrix Multihead, where W is the preset
  • the second parameter matrix of, Concat function refers to the direct splicing of the matrix in the column direction;
  • the temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence, and the temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text output by the last layer of word granularity encoder is obtained Embedding a sequence of vectors.
  • each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected sequentially, the relationship between words and words (contextual relationship) is reflected.
  • the multi-head self-attention matrix is then input into the feedforward fully connected layer to obtain a temporary text embedding vector, and the temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence. Therefore, the output of the first-level word granularity encoder is the temporary text embedding vector sequence. Since this application is provided with an M-layer word granularity encoder, repeat the above calculation process to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder.
  • each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and the intermediate text embedding vector sequence and the entity embedding vector sequence are input to the N
  • the step S6 of obtaining the final text embedding vector sequence and the final entity embedding vector sequence outputted by the knowledge granularity encoder of the last layer of knowledge granularity encoder is:
  • Each level of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer.
  • the calculation method of the multi-head self-attention mechanism layer is the same as the calculation method of the multi-head self-attention mechanism layer in the aforementioned word granular encoder. The same, but because the parameter matrix used is trained, the parameter matrix can be different.
  • the information aggregation layer is used to obtain the final text embedding vector mj and the final entity embedding vector pj by using the activation function gelu.
  • the calculation formula in the information aggregation layer is:
  • the first text embedding vector sequence ⁇ m1, m2,...,mn ⁇ and the first entity embedding vector sequence ⁇ m1, m2,...,mn ⁇ output by the first-level knowledge granularity encoder can be obtained.
  • the calculation process of the knowledge granular encoder is repeated until the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granular encoder.
  • total loss function value the first loss function value+the second loss function value, calculate the total loss function value, and determine whether the total loss function value is greater than a preset loss function threshold;
  • the training semantic representation model is realized.
  • the step S42 of generating a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generating method includes:
  • the preset word vector library the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector, and the position and position of the i-th word in the training word sequence Correspondence of the vectors, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
  • the training text embedding vector sequence corresponding to the training text is generated according to the preset text embedding vector sequence generating method.
  • the random words in the training text are replaced with mask marks, and the training text after the mask marks is preprocessed to obtain the training word sequence, that is, the training is performed by mask embedding, It is expected that the model can predict the corresponding words at the mask mark based on the context. Since the semantic representation model is trained, the preprocessing method and the method of generating the text embedding vector sequence for training are the same as the preprocessing method and the method of generating the text embedding vector sequence when the semantic representation model operates normally.
  • the method for generating a text embedding vector sequence according to a preset generates a training text embedding vector sequence corresponding to the training text, and inputs the training text embedding vector sequence into a preset M
  • the first sub-attention matrix output by the M-layer word granular encoder is calculated, and then the first sub-attention matrix is input into the preset first loss function to obtain Before step S42 of the first loss function value, it includes:
  • the method includes:
  • the output of the designated answer sentence is realized. Since this application is particularly suitable for the interview question and answer process in a professional context, the original text should be the interviewer's answer to the question, and the text classification result is the analysis of the answer. Since it is an interview question and answer process, this application also adopts the method of obtaining the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence; outputting the designated answer sentence to complete the question and answer process In the final interaction with the interviewer. Wherein, the designated answer sentence is, for example, congratulations, passed the interview, etc.
  • an embodiment of the present application provides a text classification device based on a semantic representation model, including:
  • the text acquisition unit 10 is configured to acquire the input original text and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
  • the first sequence generating unit 30 is configured to generate a text embedding vector sequence ⁇ w1, w2,..., wn ⁇ , wherein there are a total of n words in the word sequence;
  • the second sequence generating unit 40 is configured to input the word sequence into a preset knowledge embedding model to obtain an entity embedding vector sequence ⁇ e1, e2,..., en ⁇ , where en is the entity embedding corresponding to the nth word vector;
  • the intermediate sequence generating unit 50 is configured to input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein The M-layer word granular encoder and the preset N-layer knowledge granular encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
  • the final sequence generating unit 60 is configured to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text output by the last layer of knowledge granularity encoder Embedding vector sequence and final entity embedding vector sequence;
  • the text classification unit 70 is configured to input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
  • each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence
  • the intermediate sequence generating unit 50 includes:
  • the matrix group calculation subunit is used to multiply the text embedding vector sequence by the trained h first parameter matrix groups in the multi-head self-attention mechanism layer of the first-layer word granularity encoder to obtain the first parameter matrix group.
  • the first matrix obtains subunits, which are used according to the formula: Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
  • the second matrix obtains subunits, which are used according to the formula:
  • Multihead( ⁇ w 1 ,w 2 ,...,w n ⁇ ) Concat(head 1 ,head 2 ,...,head h )W, calculate the multihead self-attention matrix Multihead, where W is the preset second parameter matrix , Concat function refers to directly splicing the matrix in the column direction;
  • the intermediate sequence generation subunit is used to compose the temporary text embedding vector corresponding to all words into a temporary text embedding vector sequence, and input the temporary text embedding vector sequence into the next layer of word granularity encoder until the last layer of word granularity is obtained.
  • the intermediate text output by the encoder is embedded in the vector sequence.
  • each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer
  • the final sequence generating unit 60 includes:
  • the first acquisition subunit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-level knowledge granularity encoder, thereby obtaining the first vector sequence And the second vector sequence
  • the first calculation subunit is used to input the first vector sequence and the second vector sequence into the information aggregation layer in the first-layer knowledge granularity encoder, so as to obtain the final text embedding vector mj and mj corresponding to the j-th word.
  • the final entity embedding vector pj where the calculation formula in the information aggregation layer is:
  • the second calculation subunit is used to generate the first text embedding vector sequence ⁇ m1, m2,...,mn ⁇ and the first entity embedding vector sequence ⁇ m1,m2,...,mn ⁇ , and embed the first text into the vector
  • the sequence ⁇ m1,m2,...,mn ⁇ and the first entity embedding vector sequence ⁇ m1,m2,...,mn ⁇ are input into the next level of knowledge granularity encoder until the final text embedding output by the last layer of knowledge granularity encoder is obtained
  • the vector sequence and the final entity are embedded in the vector sequence.
  • the device includes:
  • the text calling unit is used to call the pre-collected training text
  • the first acquiring unit is configured to generate a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generation method, and input the training text embedding vector sequence into a preset M layer
  • the word granularity encoder performs calculations to obtain the first sub-attention matrix output by the M-layer word granular encoder, and then inputs the first sub-attention matrix into the preset first loss function, thereby obtaining the first sub-attention matrix A loss function value;
  • the second acquiring unit is configured to generate a training entity embedding vector sequence corresponding to the training text according to a preset entity embedding vector sequence generating method, and embed the training entity into the vector sequence and the training text
  • the embedding vector sequence is input into the preset N-layer knowledge granular encoder for calculation, so as to obtain the second sub-attention matrix output by the N-layer knowledge granular encoder, and then the second sub-attention matrix is input into the preset In the second loss function, the value of the second loss function is thus obtained;
  • the adjustment unit is configured to, if the total loss function value is greater than a preset loss function threshold, adjust the semantic representation model parameters so that the total loss function value is less than the loss function threshold.
  • the first acquiring unit includes:
  • the second acquisition subunit is used to replace random words in the training text with mask marks, and preprocess the training text after the mask marks to obtain a training word sequence, wherein Preprocessing includes at least sentence division and word division;
  • the third acquisition subunit is used for the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector according to the preset word vector library, and the position of the i-th word in the training word Correspondence between the position in the sequence and the position vector, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
  • the fifth acquisition subunit is used to generate a training text embedding vector sequence ⁇ t1, t2,..., tn ⁇ , wherein there are a total of n words in the training word sequence.
  • the device includes:
  • An attention matrix, Xi is the first sub-attention matrix
  • the device includes:
  • the sentence acquisition unit is configured to acquire the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence;
  • the sentence output unit is used to output the specified answer sentence.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in the figure.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store the data used in the text classification method based on the semantic representation model.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a text classification method based on the semantic representation model.
  • the above-mentioned processor executes the above-mentioned semantic representation model-based text classification method, wherein the steps included in the method respectively correspond to the steps of executing the semantic representation model-based text classification method of the foregoing embodiment, and will not be repeated here.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a computer program is executed by a processor, a method for text classification based on a semantic representation model is realized, wherein the steps included in the method are respectively the same as those in the foregoing
  • the steps of the text classification method based on the semantic representation model of the embodiment correspond one-to-one, and will not be repeated here.
  • the computer-readable storage medium is, for example, a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.

Abstract

A semantic representation model-based text classification method and apparatus, a computer device and a storage medium. The method comprises: acquiring inputted original text, and preprocessing the original text so as to obtain a word sequence; calculating to obtain a vector wi; generating a text embedding vector sequence {w1, w2,..., wn}; inputting the word sequence into a preset knowledge embedding model to acquire an entity embedding vector sequence {e1, e2,..., en}; inputting the text embedding vector sequence into a M-layer word granularity encoder for calculation to obtain an intermediate text embedding vector sequence; inputting the intermediate text embedding vector sequence and the entity embedding vector sequence into a N-layer knowledge granularity encoder for calculation to obtain a final text embedding vector sequence and a final entity embedding vector sequence; and inputting the final text embedding vector sequence and the final entity embedding vector sequence into a classification model to obtain a text classification result. Thus, the accuracy of text classification is improved.

Description

基于语义表征模型的文本分类方法、装置和计算机设备Text classification method, device and computer equipment based on semantic representation model
本申请要求于2019年9月19日提交中国专利局、申请号为2019108866221,发明名称为“基于语义表征模型的文本分类方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 2019108866221, and the invention title is "Text Classification Method, Apparatus and Computer Equipment Based on Semantic Representation Model". The entire content of the patent application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及到计算机领域,特别是涉及到一种基于语义表征模型的文本分类方法、装置、计算机设备和存储介质。This application relates to the computer field, and in particular to a text classification method, device, computer equipment and storage medium based on a semantic representation model.
背景技术Background technique
文本分类是自然语言处理中的重要组成部分,一般采用文本分类模型进行文本分类。文本分类模型的表现很大程度上取决于其语义表征模型。常见的语义表征模型,例如基于word2vec算法的模型、基于双向LSTM网络的模型等,只考虑单词本身、和/或上下文的关系,当处于专业问答情境时,例如在专业面试的过程中,表现在面试中出现的问题具有一定专业性(专业词汇、专业的关系表述等),且往往会考察候选人对某一概念或某一定义把握得是否清晰,即,问题是有知识背景的,因此传统的语义表征模型无法准确反应专业词汇以及专业词汇间的关系(即实体以及实体关系),因此无法准确反应出输入的文本,从而降低最终的文本分类的准确度。Text classification is an important part of natural language processing, and text classification models are generally used for text classification. The performance of the text classification model largely depends on its semantic representation model. Common semantic representation models, such as models based on the word2vec algorithm, models based on the two-way LSTM network, etc., only consider the relationship between the word itself and/or the context. When in a professional question and answer context, such as in the process of a professional interview, the performance is The questions that appear in the interview are professional (professional vocabulary, professional relationship expression, etc.), and often examine whether the candidate has a clear grasp of a certain concept or a certain definition, that is, the problem has a knowledge background, so traditional The semantic representation model cannot accurately reflect the professional vocabulary and the relationship between the professional vocabulary (ie entity and entity relationship), so it cannot accurately reflect the input text, thereby reducing the accuracy of the final text classification.
技术问题technical problem
本申请的主要目的为提供一种基于语义表征模型的文本分类方法、装置、计算机设备和存储介质,旨在提高文本分类的准确度。The main purpose of this application is to provide a text classification method, device, computer equipment and storage medium based on a semantic representation model, aiming to improve the accuracy of text classification.
技术解决方案Technical solutions
为了实现上述目的,本申请提出一种基于语义表征模型的文本分类方法,包括以下步骤:In order to achieve the above objective, this application proposes a text classification method based on a semantic representation model, which includes the following steps:
获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;Acquiring the input original text, and preprocessing the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;According to the preset word vector generation method, the corresponding relationship between the position of the i-th word in the original text and the sentence segmentation vector, and the corresponding relationship between the position of the i-th word in the word sequence and the position vector, corresponding Obtain the word vector ai, sentence segmentation vector bi, and position vector ci corresponding to the i-th word in the word sequence, and calculate the text embedding corresponding to the i-th word according to the formula: wi=ai+bi+ci Vector wi, where the word vector ai, sentence segmentation vector bi and position vector ci have the same dimensions;
生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;Generate a text embedding vector sequence {w1, w2,..., wn}, where there are a total of n words in the word sequence;
将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;Input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word;
将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;Input the text embedding vector sequence into the preset M-layer word granularity encoder for calculation, thereby obtaining the intermediate text embedding vector sequence output by the last layer word granularity encoder; wherein the M-layer word granularity encoder and the pre- Suppose that the N-level knowledge granularity encoders are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, thereby obtaining the final text embedding vector sequence and the final entity embedding vector output by the last layer of knowledge granularity encoder sequence;
将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文 本分类结果。The final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing to obtain a text classification result.
本申请提出一种基于语义表征模型的文本分类装置,包括:This application proposes a text classification device based on a semantic representation model, including:
文本获取单元,用于获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;The text acquisition unit is configured to acquire the input original text and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
第一嵌入计算单元,用于根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;The first embedding calculation unit is used to generate the word vector according to the preset method, the corresponding relationship between the position of the sentence to which the i-th word belongs in the original text and the sentence segmentation vector, and the position of the i-th word in the word sequence Correspondence with the position vector, correspondingly obtain the word vector ai, sentence segmentation vector bi and position vector ci corresponding to the i-th word in the word sequence, and calculate according to the formula: wi=ai+bi+ci The text embedding vector wi corresponding to the i-th word, where the word vector ai, the sentence segmentation vector bi and the position vector ci have the same dimensions;
第一序列生成单元,用于生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;The first sequence generating unit is used to generate a text embedding vector sequence {w1, w2,..., wn}, wherein there are a total of n words in the word sequence;
第二序列生成单元,用于将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;The second sequence generating unit is used to input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word ;
中间序列生成单元,用于将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;The intermediate sequence generating unit is used to input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M The layer word granularity encoder and the preset N layer knowledge granularity encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
最终序列生成单元,用于将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;The final sequence generating unit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding output by the last layer of knowledge granularity encoder Vector sequence and final entity embedding vector sequence;
文本分类单元,用于将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The text classification unit is used to input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
本申请提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述任一项所述方法的步骤。The present application provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of any one of the above methods when the computer program is executed.
本申请提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一项所述的方法的步骤。The present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the above-mentioned methods are implemented.
有益效果Beneficial effect
本申请的基于语义表征模型的文本分类方法、装置、计算机设备和存储介质,获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列;获取词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到向量wi;生成文本嵌入向量序列{w1,w2,…,wn};将所述单词序列输入预设的知识嵌入模型中,获取实体嵌入向量序列{e1,e2,…,en};将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到中间文本嵌入向量序列;将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最终文本嵌入向量序列和最终实体嵌入向量序列;将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。从而将实体嵌入向量引入分类过程中,提高了文本分类的准确度。The text classification method, device, computer equipment and storage medium based on the semantic representation model of the present application obtain the input original text and preprocess the original text to obtain the word sequence; obtain the word vector ai and sentence segmentation vector bi and position vector ci, and calculate vector wi according to the formula: wi=ai+bi+ci; generate text embedding vector sequence {w1,w2,...,wn}; input the word sequence into the preset knowledge embedding model , Obtain the entity embedding vector sequence {e1, e2,..., en}; input the text embedding vector sequence into the preset M-layer word granularity encoder for calculation, thereby obtaining the intermediate text embedding vector sequence; The intermediate text embedding vector sequence and the entity embedding vector sequence are input into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding vector sequence and the final entity embedding vector sequence; the final text embedding vector sequence and the final The entity embedding vector sequence is input into the preset classification model for processing, and the text classification result is obtained. Thus, the entity embedding vector is introduced into the classification process, which improves the accuracy of text classification.
附图说明Description of the drawings
图1为本申请一实施例的基于语义表征模型的文本分类方法的流程示意图;FIG. 1 is a schematic flowchart of a text classification method based on a semantic representation model according to an embodiment of this application;
图2为本申请一实施例的基于语义表征模型的文本分类装置的结构示意框图;2 is a schematic block diagram of the structure of a text classification device based on a semantic representation model according to an embodiment of the application;
图3为本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
本申请的最佳实施方式The best implementation of this application
参照图1,本申请实施例提供一种基于语义表征模型的文本分类方法,包括以下步骤:1, an embodiment of the present application provides a method for text classification based on a semantic representation model, including the following steps:
S1、获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;S1. Obtain the input original text, and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
S2、根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;S2. According to the preset word vector generation method, the corresponding relationship between the position of the sentence to which the i-th word belongs in the original text and the sentence segmentation vector, and the corresponding relationship between the position of the i-th word in the word sequence and the position vector , Correspondingly obtain the word vector ai, the sentence segmentation vector bi and the position vector ci corresponding to the i-th word in the word sequence, and calculate according to the formula: wi=ai+bi+ci to obtain the i-th word corresponding to The text embedding vector wi, where the word vector ai, sentence segmentation vector bi and position vector ci have the same dimensions;
S3、生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;S3. Generate a text embedding vector sequence {w1, w2,...,wn}, wherein there are n words in the word sequence;
S4、将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;S4. Input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word;
S5、将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;S5. Input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M-layer word granularity encoder Connect with the preset N-layer knowledge granularity encoder sequentially to form a semantic representation model, where M and N are both greater than or equal to 2;
S6、将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;S6. Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding vector sequence and the final entity output by the last layer of knowledge granularity encoder Embedded vector sequence;
S7、将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。S7. Input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
本申请通过将实体嵌入向量序列引入语义表征模型,从而使语义表征模型以及文本分类模型能够胜任更复杂的情境(例如处理具有专业词汇以及专业词汇之间的相互关系的文本),提高最终的文本分类的准确度。This application introduces the entity embedding vector sequence into the semantic representation model, so that the semantic representation model and text classification model can be competent for more complex situations (such as processing text with professional vocabulary and the relationship between professional vocabulary), and improve the final text Accuracy of classification.
如上述步骤S1所述,获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分。其中原始文本可能包括有多个句子,每个句子包括多个单词,因此通过至少包括句子划分和单词划分的预处理,从而得到单词序列。其中句子划分和单词划分可利用开源的划分工具,例如jieba工具、SnowNLP工具等。其中,所述原始文本可以为任意可行文本,优选具有指定单词的文本,其中所述指定单词是预设的知识图谱中的知识节点,并且所述指定单词是预设领域中的专业词汇。As described in the above step S1, the input original text is obtained, and the original text is preprocessed to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division. The original text may include multiple sentences, and each sentence includes multiple words. Therefore, the word sequence is obtained by preprocessing including at least sentence division and word division. Among them, sentence division and word division can use open source division tools, such as jieba tools, SnowNLP tools, etc. Wherein, the original text may be any feasible text, preferably text with designated words, wherein the designated words are knowledge nodes in a preset knowledge graph, and the designated words are professional vocabulary in a preset field.
如上述步骤S2所述,根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具 有相同的维度。其中所述词向量生成方法可以采用任意可行方法,例如利用查询预设的词向量库,从而获取与所述单词序列中的单词对应的词向量,其中词向量库可以采用现有的数据库,也可使用例如word2vec模型对收集的语料进行训练得到;或者,所述词向量生成方法例如为:在语义表征模型的训练之前,将每个单词对应的词向量初始化为随机值,然后在训练过程中,随其他网络参数一起优化,从而获取每个单词对应的词向量。由于文本嵌入向量wi不仅由词向量ai构成,还由句子切分向量bi和位置向量ci,因此还能反应第i个单词的句子位置和单词位置。As described in step S2 above, according to the preset word vector generation method, the corresponding relationship between the position of the sentence to which the i-th word belongs in the original text and the sentence segmentation vector, the position of the i-th word in the word sequence and Correspondence of the position vector, corresponding to the word vector ai, sentence segmentation vector bi and position vector ci corresponding to the i-th word in the word sequence, and according to the formula: wi=ai+bi+ci, the first The text embedding vector wi corresponding to i words, where the word vector ai, the sentence segmentation vector bi and the position vector ci have the same dimensions. The word vector generation method can adopt any feasible method, such as querying a preset word vector library to obtain the word vector corresponding to the words in the word sequence. The word vector library can be an existing database or For example, the word2vec model can be used to train the collected corpus; or, the word vector generation method is, for example: before the training of the semantic representation model, the word vector corresponding to each word is initialized to a random value, and then in the training process , Optimize together with other network parameters to obtain the word vector corresponding to each word. Since the text embedding vector wi is not only composed of the word vector ai, but also composed of the sentence segmentation vector bi and the position vector ci, it can also reflect the sentence position and word position of the i-th word.
如上述步骤S3所述,生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词。文本嵌入向量序列{w1,w2,…,wn}由n个单词对应的文本嵌入向量构成,其中文本嵌入向量以列向量的形式展示,因此文本嵌入向量序列{w1,w2,…,wn}也被视为n列的矩阵;As described in the above step S3, a text embedding vector sequence {w1, w2,...,wn} is generated, wherein there are a total of n words in the word sequence. The text embedding vector sequence {w1,w2,...,wn} is composed of the text embedding vectors corresponding to n words, and the text embedding vector is displayed in the form of a column vector, so the text embedding vector sequence {w1,w2,...,wn} is also Treated as a matrix with n columns;
如上述步骤S4所述,将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量。其中所述知识嵌入模型例如为TransE模型,能够将知识图谱中的实体和关系以向量的形式提取出来,并且由于知识图谱中的知识节点与关系的专业程度更高(可针对性地选择合适的知识图谱),从而获取每个单词对应的实体嵌入向量。其中所述知识嵌入模型,例如TransE模型,是传统的模型,在此不再赘述。进一步地,若存在不为实体的单词,则将所述单词对应的实体嵌入向量设置为0。As described in step S4 above, the word sequence is input into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,...,en}, where en is the entity embedding vector corresponding to the nth word. The knowledge embedding model is, for example, the TransE model, which can extract the entities and relationships in the knowledge graph in the form of vectors, and because the knowledge nodes and relationships in the knowledge graph are more professional (you can select the appropriate Knowledge graph) to obtain the entity embedding vector corresponding to each word. The knowledge embedding model, such as the TransE model, is a traditional model, and will not be repeated here. Further, if there is a word that is not an entity, the entity embedding vector corresponding to the word is set to 0.
如上述步骤S5所述,将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2。其中,M层词粒度编码器中进行计算的过程例如为:在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;根据公式:
Figure PCTCN2019116339-appb-000001
计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接;将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值;将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。
As described in step S5 above, the text embedding vector sequence is input into the preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M The layer word granularity encoder and the preset N layer knowledge granularity encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2. Wherein, the calculation process in the M-layer word granular encoder is, for example: in the multi-head self-attention mechanism layer in the first-layer word granular encoder, the text embedding vector sequence is respectively multiplied by the trained h th A parameter matrix group, thereby obtaining the first matrix {Q1,Q2,...,Qh}, the second matrix {K1,K2,...,Kh} and the third matrix {V1,V2,...,Vh}, each of which is A parameter matrix group includes three q×k first parameter matrices; according to the formula:
Figure PCTCN2019116339-appb-000001
Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h; according to the formula: Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h ) W, the multihead self-attention matrix Multihead is calculated, where W is the preset second parameter matrix, and the Concat function refers to directly splicing the matrix in the column direction; the multihead self-attention matrix is input into the feedforward fully connected layer , The temporary text embedding vector FFN(x) is obtained, where the calculation formula in the feedforward fully connected layer is: FFN(x)=gelu(xW 1 +b 1 )W 2 +b 2 , where x is the The multi-head self-attention matrix, W 1 , W 2 are preset parameter matrices, b 1 , b 2 are preset bias values; the temporary text embedding vectors corresponding to all words form a temporary text embedding vector sequence, and The temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text embedding vector sequence output by the last layer of word granularity encoder is obtained.
如上述步骤S6所述,将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。其中,所述N层知识粒度编码器中进行计算的过程例如为:将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
Figure PCTCN2019116339-appb-000002
和第二向量序列
Figure PCTCN2019116339-appb-000003
将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层 中的计算公式为:mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
Figure PCTCN2019116339-appb-000004
Figure PCTCN2019116339-appb-000005
均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。
As described in step S6 above, input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding output by the last layer of knowledge granularity encoder The vector sequence and the final entity are embedded in the vector sequence. Wherein, the calculation process in the N-layer knowledge granular encoder is, for example: inputting the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer of the first-layer knowledge granular encoder To get the first vector sequence
Figure PCTCN2019116339-appb-000002
And the second vector sequence
Figure PCTCN2019116339-appb-000003
The first vector sequence and the second vector sequence are input to the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word, where the information The calculation formula in the aggregation layer is: mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
Figure PCTCN2019116339-appb-000004
Figure PCTCN2019116339-appb-000005
All are preset parameter matrices, b 3 , b 4 , b 5 are preset offset values; generate the first text embedding vector sequence {m1, m2,..., mn} and the first entity embedding vector sequence {m1 ,m2,...,mn}, and input the first text embedding vector sequence {m1,m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn} into the next level of knowledge granularity coding Until the final text embedding vector sequence and the final entity embedding vector sequence output by the last level of knowledge granularity encoder are obtained.
如上述步骤S7所述,将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。其中所述分类模型可以为任意可行的分类模型,例如softmax分类器等。由于所述最终文本嵌入向量序列和最终实体嵌入向量序列利用了实体嵌入向量,因此最终的文本分类结果更适应于专业情境,分类更准确。As described in step S7 above, the final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing, and a text classification result is obtained. The classification model may be any feasible classification model, such as a softmax classifier. Since the final text embedding vector sequence and the final entity embedding vector sequence utilize the entity embedding vector, the final text classification result is more suitable for professional situations and the classification is more accurate.
在一个实施方式中,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列的步骤S5,包括:In one embodiment, each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the text embedding vector sequence is input to the preset M layer word granularity. The step S5 of calculating in the encoder to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder includes:
S501、在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;S501. In the multi-head self-attention mechanism layer of the first-layer word granularity encoder, the text embedding vector sequence is respectively multiplied by the trained h first parameter matrix groups to obtain the first matrix {Q1, Q2 ,...,Qh}, the second matrix {K1,K2,...,Kh} and the third matrix {V1,V2,...,Vh}, where each first parameter matrix group includes three q×k first Parameter matrix
S502、根据公式:
Figure PCTCN2019116339-appb-000006
计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
S502. According to the formula:
Figure PCTCN2019116339-appb-000006
Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
S503、根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; S503. According to the formula: Multihead({w 1 ,w 2 ,...,w n })=Concat(head 1 ,head 2 ,...,head h )W, calculate the multihead self-attention matrix Multihead, where W is the preset The second parameter matrix of, Concat function refers to the direct splicing of the matrix in the column direction;
S504、将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; S504. Input the multi-head self-attention matrix into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is: FFN(x)= gelu(xW 1 +b 1 )W 2 +b 2 , where x is the multi-head self-attention matrix, W 1 and W 2 are preset parameter matrices, and b 1 and b 2 are preset bias values;
S505、将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。S505. The temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence, and the temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text output by the last layer of word granularity encoder is obtained Embedding a sequence of vectors.
如上所述,实现了得到最后一层词粒度编码器输出的中间文本嵌入向量序列。由于每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,因此单词与单词之间的关系得到了体现(上下文关系)。并且为了提高自注意力的性能,本申请采用根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接的方式,构建成综合矩阵,再乘以第二参数矩阵W,从而得到多头自注意力矩阵,从而提高自注意力的性能(采用了多个自注意力组)。再将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量,将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列。因此第一层词粒度编码器的输出即是所述暂时文本嵌入向量序列。由于本申请设置有M层词粒度编码器,因此重复上述计算过程,即可得到最后一层词粒度编码器输出的中间文本嵌入向量序列。 As mentioned above, the intermediate text embedding vector sequence output by the last layer of word granularity encoder is achieved. Since each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected sequentially, the relationship between words and words (contextual relationship) is reflected. And in order to improve the performance of self-attention, this application adopts the formula: Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, and calculates the multihead self Attention matrix Multihead, where W is the preset second parameter matrix, and the Concat function refers to the method of directly splicing the matrix in the column direction to construct a comprehensive matrix, and then multiplying the second parameter matrix W to obtain the multihead self-attention matrix , Thereby improving the performance of self-attention (using multiple self-attention groups). The multi-head self-attention matrix is then input into the feedforward fully connected layer to obtain a temporary text embedding vector, and the temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence. Therefore, the output of the first-level word granularity encoder is the temporary text embedding vector sequence. Since this application is provided with an M-layer word granularity encoder, repeat the above calculation process to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder.
在一个实施方式中,每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,所 述将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列的步骤S6,包括:In one embodiment, each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and the intermediate text embedding vector sequence and the entity embedding vector sequence are input to the N The step S6 of obtaining the final text embedding vector sequence and the final entity embedding vector sequence outputted by the knowledge granularity encoder of the last layer of knowledge granularity encoder is:
S601、将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
Figure PCTCN2019116339-appb-000007
和第二向量序列
Figure PCTCN2019116339-appb-000008
S601. Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-layer knowledge granularity encoder, so as to obtain the first vector sequence
Figure PCTCN2019116339-appb-000007
And the second vector sequence
Figure PCTCN2019116339-appb-000008
S602、将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:S602. Input the first vector sequence and the second vector sequence into the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word. The calculation formula in the information aggregation layer is:
mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
Figure PCTCN2019116339-appb-000009
均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
Figure PCTCN2019116339-appb-000009
They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
S603、生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。S603. Generate the first text embedding vector sequence {m1, m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector sequence {m1,m2, …,Mn} and the first entity embedding vector sequence {m1,m2,…,mn} are input into the next level of knowledge granularity encoder, until the final text embedding vector sequence and final entity embedding output from the last layer of knowledge granularity encoder are obtained Vector sequence.
如上所述,实现了得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,其中多头自注意力机制层的计算方法与前述词粒度编码器中的多头自注意力机制层的计算方法可以相同,但是由于采用的参数矩阵是训练得来的,因此参数矩阵可以不同。所述信息聚合层用于采用激活函数gelu获取最终文本嵌入向量mj和最终实体嵌入向量pj。信息聚合层中的计算公式为:As mentioned above, the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder are achieved. Each level of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer. The calculation method of the multi-head self-attention mechanism layer is the same as the calculation method of the multi-head self-attention mechanism layer in the aforementioned word granular encoder. The same, but because the parameter matrix used is trained, the parameter matrix can be different. The information aggregation layer is used to obtain the final text embedding vector mj and the final entity embedding vector pj by using the activation function gelu. The calculation formula in the information aggregation layer is:
mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
Figure PCTCN2019116339-appb-000010
均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值。从而可得第一层知识粒度编码器输出的第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}。重复知识粒度编码器的计算过程,直至最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。
mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
Figure PCTCN2019116339-appb-000010
They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values. Thus, the first text embedding vector sequence {m1, m2,...,mn} and the first entity embedding vector sequence {m1, m2,...,mn} output by the first-level knowledge granularity encoder can be obtained. The calculation process of the knowledge granular encoder is repeated until the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granular encoder.
在一个实施方式中,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型的步骤S5之前,包括:In one embodiment, said inputting said text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein Before step S5 where the M-layer word granular encoder and the preset N-layer knowledge granular encoder are sequentially connected to form a semantic representation model, it includes:
S41、调用预先采集的训练用文本;S41. Call the pre-collected training text;
S42、根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值;S42. Generate a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generation method, and input the training text embedding vector sequence into a preset M-layer word granularity encoder Perform calculations to obtain the first sub-attention matrix output by the M-layer word granularity encoder, and then input the first sub-attention matrix into the preset first loss function to obtain the first loss function value;
S43、根据预设的实体嵌入向量序列生成方法,生成与所述训练用文本对应的训练用实体嵌入向量序列,并将所述训练用实体嵌入向量序列和所述训练用文本嵌入向量序列输入预设的N层知识粒度编码器中进行计算,从而得到所述N层知识粒度编码器输出的第二子注意力矩阵,再将所述第二子注意力矩阵输入预设的第二损失函数中,从而得到第二损失函数值;S43. Generate a training entity embedding vector sequence corresponding to the training text according to a preset entity embedding vector sequence generation method, and input the training entity embedding vector sequence and the training text embedding vector sequence into the preset Suppose the calculation is performed in the N-layer knowledge granular encoder to obtain the second sub-attention matrix output by the N-layer knowledge granular encoder, and then the second sub-attention matrix is input into the preset second loss function , So as to obtain the second loss function value;
S44、根据公式:总损失函数值=所述第一损失函数值+所述第二损失函数值,计算得到总损失函数 值,并判断所述总损失函数值是否大于预设的损失函数阈值;S44. According to the formula: total loss function value=the first loss function value+the second loss function value, calculate the total loss function value, and determine whether the total loss function value is greater than a preset loss function threshold;
S45、若所述总损失函数值大于预设的损失函数阈值,则调整所述语义表征模型参数,以使所述总损失函数值小于所述损失函数阈值。S45. If the total loss function value is greater than the preset loss function threshold value, adjust the semantic representation model parameters so that the total loss function value is less than the loss function threshold value.
如上所述,实现了训练语义表征模型。其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,因此本申请采用综合考虑第一损失函数和第二损失函数的方式,同时训练M层词粒度编码器和N层知识粒度编码器。据此,设置总损失函数值=所述第一损失函数值+所述第二损失函数值,并判断所述总损失函数值是否大于预设的损失函数阈值。由于总损失函数衡量的是输出与期望的差别程度,若总损失函数的数值小,则表明语义表征模型适用于当前的训练数据,反之,则需要调整参数。因此,若所述总损失函数值大于预设的损失函数阈值,则调整所述语义表征模型参数,以使所述总损失函数值小于所述损失函数阈值。As described above, the training semantic representation model is realized. The M-layer word granularity encoder and the preset N-layer knowledge granularity encoder are sequentially connected to form a semantic representation model. Therefore, this application adopts a comprehensive consideration of the first loss function and the second loss function, and simultaneously trains the M-layer words Granular encoder and N-layer knowledge granular encoder. Accordingly, set the total loss function value=the first loss function value+the second loss function value, and determine whether the total loss function value is greater than a preset loss function threshold. Since the total loss function measures the degree of difference between the output and expectations, if the value of the total loss function is small, it indicates that the semantic representation model is suitable for the current training data, otherwise, the parameters need to be adjusted. Therefore, if the total loss function value is greater than the preset loss function threshold value, the semantic representation model parameters are adjusted so that the total loss function value is less than the loss function threshold value.
在一个实施方式中,所述根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列的步骤S42,包括:In one embodiment, the step S42 of generating a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generating method includes:
S421、对所述训练用文本中的随机单词替换为掩码标记,并对所述掩码标记后的训练用文本进行预处理,从而得到训练用单词序列,其中所述预处理至少包括句子划分和单词划分;S421. Replace random words in the training text with mask marks, and perform preprocessing on the training text after the mask marks to obtain a training word sequence, wherein the preprocessing includes at least sentence division And word division;
S422、根据预设的词向量库、第i个单词所属句子在所述训练用文本中的位置与句子切分向量的对应关系、第i个单词在所述训练用单词序列中的位置与位置向量的对应关系,对应获取与所述训练用单词序列中的第i个单词对应的训练用词向量di、训练用句子切分向量fi和训练用位置向量gi;S422. According to the preset word vector library, the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector, and the position and position of the i-th word in the training word sequence Correspondence of the vectors, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
S423、根据公式:ti=di+fi+gi,计算得到第i个单词对应的训练用文本嵌入向量ti,其中训练用词向量di、训练用句子切分向量fi和训练用位置向量gi具有相同的维度;S423. According to the formula: ti=di+fi+gi, the training text embedding vector ti corresponding to the i-th word is calculated, wherein the training word vector di, the training sentence segmentation vector fi and the training position vector gi are the same Dimension of
S424、生成训练用文本嵌入向量序列{t1,t2,…,tn},其中所述训练用单词序列中共有n个单词。S424. Generate a training text embedding vector sequence {t1, t2,..., tn}, where there are a total of n words in the training word sequence.
如上所述,实现了根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列。其中对所述训练用文本中的随机单词替换为掩码标记,并对所述掩码标记后的训练用文本进行预处理,从而得到训练用单词序列,即采用掩码嵌入的方式进行训练,以期待模型能根据上下文关系,预测出掩码标记处对应的词语。由于训练的是语义表征模型,因此预处理的方式、生成训练用文本嵌入向量序列的方式均与所述语义表征模型正常运作时的预处理的方式、生成文本嵌入向量序列的方式相同。As described above, it is realized that the training text embedding vector sequence corresponding to the training text is generated according to the preset text embedding vector sequence generating method. Wherein, the random words in the training text are replaced with mask marks, and the training text after the mask marks is preprocessed to obtain the training word sequence, that is, the training is performed by mask embedding, It is expected that the model can predict the corresponding words at the mask mark based on the context. Since the semantic representation model is trained, the preprocessing method and the method of generating the text embedding vector sequence for training are the same as the preprocessing method and the method of generating the text embedding vector sequence when the semantic representation model operates normally.
在一个实施方式中,所述根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值的步骤S42之前,包括:In one embodiment, the method for generating a text embedding vector sequence according to a preset generates a training text embedding vector sequence corresponding to the training text, and inputs the training text embedding vector sequence into a preset M The first sub-attention matrix output by the M-layer word granular encoder is calculated, and then the first sub-attention matrix is input into the preset first loss function to obtain Before step S42 of the first loss function value, it includes:
S411、将所述第一损失函数设置为:LOSS1=-∑Y ilog X i,其中LOSS1为所述第一损失函数,Yi是所述训练用文本对应的期望第一子注意力矩阵,Xi是所述第一子注意力矩阵; S411. Set the first loss function as: LOSS1=-∑Y i log X i , where LOSS1 is the first loss function, Yi is the expected first sub-attention matrix corresponding to the training text, Xi Is the first sub-attention matrix;
S412、将所述第二损失函数设置为:LOSS2=-∑(G ilog H i+(1-Gi)log(1-Hi)),其中LOSS2为所述第二损失函数,Gi是所述训练用文本对应的期望第二子注意力矩阵,Hi是所述第二子注意力矩阵。 S412. Set the second loss function as: LOSS2=-∑(G i log H i +(1-Gi)log(1-Hi)), where LOSS2 is the second loss function, and Gi is the The expected second sub-attention matrix corresponding to the training text, Hi is the second sub-attention matrix.
如上所述,实现了设置第一损失函数和第二损失函数。损失函数用于衡量训练数据生成的数值与期 望值的差别,从而反应模型的参数是否需要调整。本申请采用将所述第一损失函数设置为:LOSS1=-∑Y ilog X i,其中LOSS1为所述第一损失函数,Yi是所述训练用文本对应的期望第一子注意力矩阵,Xi是所述第一子注意力矩阵;将所述第二损失函数设置为:LOSS2=-∑(G ilog H i+(1-Gi)log(1-Hi)),其中LOSS2为所述第二损失函数,Gi是所述训练用文本对应的期望第二子注意力矩阵,Hi是所述第二子注意力矩阵的方式,以衡量第一子注意力矩阵和第二子注意力矩阵相对于期望值的差别程度。 As described above, it is achieved to set the first loss function and the second loss function. The loss function is used to measure the difference between the value generated by the training data and the expected value, so as to reflect whether the parameters of the model need to be adjusted. In this application, the first loss function is set as: LOSS1=-∑Y i log X i , where LOSS1 is the first loss function, and Yi is the expected first sub-attention matrix corresponding to the training text, Xi is the first sub-attention matrix; the second loss function is set as: LOSS2=-∑(G i log H i +(1-Gi)log(1-Hi)), where LOSS2 is the The second loss function, Gi is the expected second sub-attention matrix corresponding to the training text, and Hi is the method of the second sub-attention matrix to measure the first sub-attention matrix and the second sub-attention matrix The degree of difference relative to the expected value.
在一个实施方式中,所述将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果的步骤S7之后,包括:In one embodiment, after the step S7 of inputting the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model to obtain a text classification result, the method includes:
S71、根据预设的分类结果与回答句子的对应关系,获取与所述文本分类结果对应的指定回答句子;S71. Obtain a designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence;
S72、输出所述指定回答句子。S72. Output the designated answer sentence.
如上所述,实现了输出所述指定回答句子。由于本申请尤其适用于专业情境中的面试问答过程,因此所述原始文本应是面试者对于问题的答案,而所述文本分类结果即是所述答案的解析。由于是面试问答过程,因此本申请还采用了根据预设的分类结果与回答句子的对应关系,获取与所述文本分类结果对应的指定回答句子;输出所述指定回答句子的方式,完成问答过程中与面试者的最后交互。其中,所述指定回答句子例如为:恭喜你,面试通过等。As described above, the output of the designated answer sentence is realized. Since this application is particularly suitable for the interview question and answer process in a professional context, the original text should be the interviewer's answer to the question, and the text classification result is the analysis of the answer. Since it is an interview question and answer process, this application also adopts the method of obtaining the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence; outputting the designated answer sentence to complete the question and answer process In the final interaction with the interviewer. Wherein, the designated answer sentence is, for example, congratulations, passed the interview, etc.
参照图2,本申请实施例提供一种基于语义表征模型的文本分类装置,包括:2, an embodiment of the present application provides a text classification device based on a semantic representation model, including:
文本获取单元10,用于获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;The text acquisition unit 10 is configured to acquire the input original text and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
第一嵌入计算单元20,用于根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;The first embedding calculation unit 20 is configured to generate a word vector according to a preset method, the corresponding relationship between the position of the sentence to which the i-th word belongs in the original text and the sentence segmentation vector, and the position of the i-th word in the word sequence Correspondence between position and position vector, correspondingly obtain the word vector ai, sentence segmentation vector bi and position vector ci corresponding to the i-th word in the word sequence, and calculate according to the formula: wi=ai+bi+ci Obtain the text embedding vector wi corresponding to the i-th word, where the word vector ai, the sentence segmentation vector bi and the position vector ci have the same dimensions;
第一序列生成单元30,用于生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;The first sequence generating unit 30 is configured to generate a text embedding vector sequence {w1, w2,..., wn}, wherein there are a total of n words in the word sequence;
第二序列生成单元40,用于将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;The second sequence generating unit 40 is configured to input the word sequence into a preset knowledge embedding model to obtain an entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding corresponding to the nth word vector;
中间序列生成单元50,用于将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;The intermediate sequence generating unit 50 is configured to input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein The M-layer word granular encoder and the preset N-layer knowledge granular encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
最终序列生成单元60,用于将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;The final sequence generating unit 60 is configured to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text output by the last layer of knowledge granularity encoder Embedding vector sequence and final entity embedding vector sequence;
文本分类单元70,用于将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The text classification unit 70 is configured to input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
其中上述单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一 一对应,在此不再赘述。The operations performed by the above-mentioned units respectively correspond to the steps of the text classification method based on the semantic representation model of the foregoing embodiment, and will not be repeated here.
在一个实施方式中,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述中间序列生成单元50,包括:In one embodiment, each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the intermediate sequence generating unit 50 includes:
矩阵组计算子单元,用于在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;The matrix group calculation subunit is used to multiply the text embedding vector sequence by the trained h first parameter matrix groups in the multi-head self-attention mechanism layer of the first-layer word granularity encoder to obtain the first parameter matrix group. A matrix {Q1,Q2,...,Qh}, a second matrix {K1,K2,...,Kh} and a third matrix {V1,V2,...,Vh}, each of the first parameter matrix groups includes three q×k first parameter matrix;
第一矩阵获取子单元,用于根据公式:
Figure PCTCN2019116339-appb-000011
计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
The first matrix obtains subunits, which are used according to the formula:
Figure PCTCN2019116339-appb-000011
Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
第二矩阵获取子单元,用于根据公式:The second matrix obtains subunits, which are used according to the formula:
Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, calculate the multihead self-attention matrix Multihead, where W is the preset second parameter matrix , Concat function refers to directly splicing the matrix in the column direction;
暂时向量获取子单元,用于将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; The temporary vector acquisition subunit is used to input the multi-head self-attention matrix into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is : FFN(x)=gelu(xW 1 +b 1 )W 2 +b 2 , where x is the multi-head self-attention matrix, W 1 , W 2 are preset parameter matrices, and b 1 , b 2 are preset parameter matrices. Set offset value;
中间序列生成子单元,用于将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。The intermediate sequence generation subunit is used to compose the temporary text embedding vector corresponding to all words into a temporary text embedding vector sequence, and input the temporary text embedding vector sequence into the next layer of word granularity encoder until the last layer of word granularity is obtained The intermediate text output by the encoder is embedded in the vector sequence.
其中上述子单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The operations performed by the above subunits respectively correspond to the steps of the semantic representation model-based text classification method of the foregoing embodiment, and will not be repeated here.
在一个实施方式中,每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,所述最终序列生成单元60,包括:In one embodiment, each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and the final sequence generating unit 60 includes:
第一获取子单元,用于将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
Figure PCTCN2019116339-appb-000012
和第二向量序列
Figure PCTCN2019116339-appb-000013
The first acquisition subunit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-level knowledge granularity encoder, thereby obtaining the first vector sequence
Figure PCTCN2019116339-appb-000012
And the second vector sequence
Figure PCTCN2019116339-appb-000013
第一计算子单元,用于将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:The first calculation subunit is used to input the first vector sequence and the second vector sequence into the information aggregation layer in the first-layer knowledge granularity encoder, so as to obtain the final text embedding vector mj and mj corresponding to the j-th word. The final entity embedding vector pj, where the calculation formula in the information aggregation layer is:
mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
Figure PCTCN2019116339-appb-000014
均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
Figure PCTCN2019116339-appb-000014
They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
第二计算子单元,用于生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。The second calculation subunit is used to generate the first text embedding vector sequence {m1, m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector The sequence {m1,m2,…,mn} and the first entity embedding vector sequence {m1,m2,…,mn} are input into the next level of knowledge granularity encoder until the final text embedding output by the last layer of knowledge granularity encoder is obtained The vector sequence and the final entity are embedded in the vector sequence.
其中上述子单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The operations performed by the above subunits respectively correspond to the steps of the semantic representation model-based text classification method of the foregoing embodiment, and will not be repeated here.
在一个实施方式中,所述装置,包括:In one embodiment, the device includes:
文本调用单元,用于调用预先采集的训练用文本;The text calling unit is used to call the pre-collected training text;
第一获取单元,用于根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值;The first acquiring unit is configured to generate a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generation method, and input the training text embedding vector sequence into a preset M layer The word granularity encoder performs calculations to obtain the first sub-attention matrix output by the M-layer word granular encoder, and then inputs the first sub-attention matrix into the preset first loss function, thereby obtaining the first sub-attention matrix A loss function value;
第二获取单元,用于根据预设的实体嵌入向量序列生成方法,生成与所述训练用文本对应的训练用实体嵌入向量序列,并将所述训练用实体嵌入向量序列和所述训练用文本嵌入向量序列输入预设的N层知识粒度编码器中进行计算,从而得到所述N层知识粒度编码器输出的第二子注意力矩阵,再将所述第二子注意力矩阵输入预设的第二损失函数中,从而得到第二损失函数值;The second acquiring unit is configured to generate a training entity embedding vector sequence corresponding to the training text according to a preset entity embedding vector sequence generating method, and embed the training entity into the vector sequence and the training text The embedding vector sequence is input into the preset N-layer knowledge granular encoder for calculation, so as to obtain the second sub-attention matrix output by the N-layer knowledge granular encoder, and then the second sub-attention matrix is input into the preset In the second loss function, the value of the second loss function is thus obtained;
第三获取单元,用于根据公式:总损失函数值=所述第一损失函数值+所述第二损失函数值,计算得到总损失函数值,并判断所述总损失函数值是否大于预设的损失函数阈值;The third acquiring unit is configured to calculate the total loss function value according to the formula: total loss function value=the first loss function value+the second loss function value, and determine whether the total loss function value is greater than a preset value The threshold of loss function;
调整单元,用于若所述总损失函数值大于预设的损失函数阈值,则调整所述语义表征模型参数,以使所述总损失函数值小于所述损失函数阈值。The adjustment unit is configured to, if the total loss function value is greater than a preset loss function threshold, adjust the semantic representation model parameters so that the total loss function value is less than the loss function threshold.
其中上述单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The operations performed by the above-mentioned units respectively correspond to the steps of the text classification method based on the semantic representation model of the foregoing embodiment one-to-one, and will not be repeated here.
在一个实施方式中,所述第一获取单元,包括:In one embodiment, the first acquiring unit includes:
第二获取子单元,用于对所述训练用文本中的随机单词替换为掩码标记,并对所述掩码标记后的训练用文本进行预处理,从而得到训练用单词序列,其中所述预处理至少包括句子划分和单词划分;The second acquisition subunit is used to replace random words in the training text with mask marks, and preprocess the training text after the mask marks to obtain a training word sequence, wherein Preprocessing includes at least sentence division and word division;
第三获取子单元,用于根据预设的词向量库、第i个单词所属句子在所述训练用文本中的位置与句子切分向量的对应关系、第i个单词在所述训练用单词序列中的位置与位置向量的对应关系,对应获取与所述训练用单词序列中的第i个单词对应的训练用词向量di、训练用句子切分向量fi和训练用位置向量gi;The third acquisition subunit is used for the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector according to the preset word vector library, and the position of the i-th word in the training word Correspondence between the position in the sequence and the position vector, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
第四获取子单元,用于根据公式:ti=di+fi+gi,计算得到第i个单词对应的训练用文本嵌入向量ti,其中训练用词向量di、训练用句子切分向量fi和训练用位置向量gi具有相同的维度;The fourth acquisition subunit is used to calculate the training text embedding vector ti corresponding to the i-th word according to the formula: ti=di+fi+gi, where the training word vector di, the training sentence segmentation vector fi and training Use the position vector gi to have the same dimension;
第五获取子单元,用于生成训练用文本嵌入向量序列{t1,t2,…,tn},其中所述训练用单词序列中共有n个单词。The fifth acquisition subunit is used to generate a training text embedding vector sequence {t1, t2,..., tn}, wherein there are a total of n words in the training word sequence.
其中上述子单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The operations performed by the above subunits respectively correspond to the steps of the semantic representation model-based text classification method of the foregoing embodiment, and will not be repeated here.
在一个实施方式中,所述装置,包括:In one embodiment, the device includes:
第一设置单元,用于将所述第一损失函数设置为:LOSS1=-∑Y ilog X i,其中LOSS1为所述第一损失函数,Yi是所述训练用文本对应的期望第一子注意力矩阵,Xi是所述第一子注意力矩阵; The first setting unit is configured to set the first loss function as: LOSS1=-∑Y i log X i , where LOSS1 is the first loss function, and Yi is the expected first sub-function corresponding to the training text An attention matrix, Xi is the first sub-attention matrix;
第二设置单元,用于将所述第二损失函数设置为:LOSS2=-∑(G ilog H i+(1-Gi)log(1-Hi)),其中LOSS2为所述第二损失函数,Gi是所述训练用文本对应的期望第二子注意力矩阵,Hi是所述第二子注意力矩阵。 The second setting unit is configured to set the second loss function as: LOSS2=-∑(G i log H i +(1-Gi)log(1-Hi)), where LOSS2 is the second loss function , Gi is the expected second sub-attention matrix corresponding to the training text, and Hi is the second sub-attention matrix.
其中上述单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一 一对应,在此不再赘述。The operations performed by the above-mentioned units respectively correspond to the steps of the text classification method based on the semantic representation model of the foregoing embodiment, and will not be repeated here.
在一个实施方式中,所述装置,包括:In one embodiment, the device includes:
句子获取单元,用于根据预设的分类结果与回答句子的对应关系,获取与所述文本分类结果对应的指定回答句子;The sentence acquisition unit is configured to acquire the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence;
句子输出单元,用于输出所述指定回答句子。The sentence output unit is used to output the specified answer sentence.
其中上述单元分别用于执行的操作与前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The operations performed by the above-mentioned units respectively correspond to the steps of the text classification method based on the semantic representation model of the foregoing embodiment one-to-one, and will not be repeated here.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储基于语义表征模型的文本分类方法所用数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于语义表征模型的文本分类方法。3, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in the figure. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store the data used in the text classification method based on the semantic representation model. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a text classification method based on the semantic representation model.
上述处理器执行上述基于语义表征模型的文本分类方法,其中所述方法包括的步骤分别与执行前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。The above-mentioned processor executes the above-mentioned semantic representation model-based text classification method, wherein the steps included in the method respectively correspond to the steps of executing the semantic representation model-based text classification method of the foregoing embodiment, and will not be repeated here.
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现基于语义表征模型的文本分类方法,其中所述方法包括的步骤分别与执行前述实施方式的基于语义表征模型的文本分类方法的步骤一一对应,在此不再赘述。其中所述计算机可读存储介质,例如为非易失性的计算机可读存储介质,或者为易失性的计算机可读存储介质。An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, a method for text classification based on a semantic representation model is realized, wherein the steps included in the method are respectively the same as those in the foregoing The steps of the text classification method based on the semantic representation model of the embodiment correspond one-to-one, and will not be repeated here. The computer-readable storage medium is, for example, a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.

Claims (20)

  1. 一种基于语义表征模型的文本分类方法,包括:A text classification method based on a semantic representation model, including:
    获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;Acquiring the input original text, and preprocessing the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
    根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;According to the preset word vector generation method, the corresponding relationship between the position of the i-th word in the original text and the sentence segmentation vector, and the corresponding relationship between the position of the i-th word in the word sequence and the position vector, corresponding Obtain the word vector ai, sentence segmentation vector bi, and position vector ci corresponding to the i-th word in the word sequence, and calculate the text embedding corresponding to the i-th word according to the formula: wi=ai+bi+ci Vector wi, where the word vector ai, sentence segmentation vector bi and position vector ci have the same dimensions;
    生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;Generate a text embedding vector sequence {w1, w2,..., wn}, where there are a total of n words in the word sequence;
    将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;Input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word;
    将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;Input the text embedding vector sequence into the preset M-layer word granularity encoder for calculation, thereby obtaining the intermediate text embedding vector sequence output by the last layer word granularity encoder; wherein the M-layer word granularity encoder and the pre- Suppose that the N-level knowledge granularity encoders are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, thereby obtaining the final text embedding vector sequence and the final entity embedding vector output by the last layer of knowledge granularity encoder sequence;
    将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing to obtain a text classification result.
  2. 根据权利要求1所述的基于语义表征模型的文本分类方法,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列的步骤,包括:According to the text classification method based on the semantic representation model of claim 1, each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and said embedding the text into the vector The sequence is input into the preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder, including:
    在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;In the multi-head self-attention mechanism layer of the first-layer word granularity encoder, the text embedding vector sequence is respectively multiplied by the trained h first parameter matrix groups, thereby obtaining the first matrix {Q1,Q2,... ,Qh}, the second matrix {K1,K2,...,Kh} and the third matrix {V1,V2,...,Vh}, where each first parameter matrix group includes three q×k first parameter matrices ;
    根据公式:
    Figure PCTCN2019116339-appb-100001
    计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
    According to the formula:
    Figure PCTCN2019116339-appb-100001
    Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
    根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; According to the formula: Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, the multihead self-attention matrix Multihead is calculated, where W is the preset first Two-parameter matrix, the Concat function refers to directly splicing the matrix in the column direction;
    将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; The multi-head self-attention matrix is input into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is: FFN(x)=gelu( xW 1 + b 1 ) W 2 + b 2 , where x is the multi-head self-attention matrix, W 1 and W 2 are preset parameter matrices, and b 1 and b 2 are preset bias values;
    将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。The temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence, and the temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text embedding vector output by the last layer of word granularity encoder is obtained sequence.
  3. 根据权利要求1所述的基于语义表征模型的文本分类方法,每一层知识粒度编码器均包括一个 多头自注意力机制层和一个信息聚合层,所述将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列的步骤,包括:According to the text classification method based on the semantic representation model of claim 1, each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and the intermediate text is embedded in the vector sequence and the information aggregation layer. The steps of inputting the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder, include:
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
    Figure PCTCN2019116339-appb-100002
    和第二向量序列
    Figure PCTCN2019116339-appb-100003
    Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-layer knowledge granularity encoder, thereby obtaining the first vector sequence
    Figure PCTCN2019116339-appb-100002
    And the second vector sequence
    Figure PCTCN2019116339-appb-100003
    将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:The first vector sequence and the second vector sequence are input to the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word, where the information The calculation formula in the aggregation layer is:
    mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
    Figure PCTCN2019116339-appb-100004
    W 3、W 4
    Figure PCTCN2019116339-appb-100005
    均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
    mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
    Figure PCTCN2019116339-appb-100004
    W 3 , W 4 ,
    Figure PCTCN2019116339-appb-100005
    They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
    生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。Generate the first text embedding vector sequence {m1,m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector sequence {m1,m2,..., mn} and the first entity embedding vector sequence {m1,m2,...,mn} are input into the next level of knowledge granularity encoder, until the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder are obtained .
  4. 根据权利要求1所述的基于语义表征模型的文本分类方法,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型的步骤之前,包括:The text classification method based on the semantic representation model according to claim 1, said inputting said text embedding vector sequence into a preset M-layer word granularity encoder for calculation, thereby obtaining the output of the last layer of word granularity encoder The intermediate text embedding vector sequence; wherein the M-layer word granularity encoder and the preset N-layer knowledge granularity encoder are sequentially connected to form a semantic representation model before the step includes:
    调用预先采集的训练用文本;Call the pre-collected training text;
    根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值;According to the preset text embedding vector sequence generating method, generate the training text embedding vector sequence corresponding to the training text, and input the training text embedding vector sequence into the preset M-layer word granularity encoder for calculation , So as to obtain the first sub-attention matrix output by the M-layer word granularity encoder, and then input the first sub-attention matrix into the preset first loss function to obtain the first loss function value;
    根据预设的实体嵌入向量序列生成方法,生成与所述训练用文本对应的训练用实体嵌入向量序列,并将所述训练用实体嵌入向量序列和所述训练用文本嵌入向量序列输入预设的N层知识粒度编码器中进行计算,从而得到所述N层知识粒度编码器输出的第二子注意力矩阵,再将所述第二子注意力矩阵输入预设的第二损失函数中,从而得到第二损失函数值;According to the preset entity embedding vector sequence generating method, the training entity embedding vector sequence corresponding to the training text is generated, and the training entity embedding vector sequence and the training text embedding vector sequence are input into the preset Perform calculations in the N-layer knowledge granular encoder to obtain the second sub-attention matrix output by the N-layer knowledge granular encoder, and then input the second sub-attention matrix into the preset second loss function, thereby Obtain the second loss function value;
    根据公式:总损失函数值=所述第一损失函数值+所述第二损失函数值,计算得到总损失函数值,并判断所述总损失函数值是否大于预设的损失函数阈值;According to the formula: total loss function value=the first loss function value+the second loss function value, the total loss function value is calculated, and it is judged whether the total loss function value is greater than a preset loss function threshold;
    若所述总损失函数值大于预设的损失函数阈值,则调整所述语义表征模型参数,以使所述总损失函数值小于所述损失函数阈值。If the total loss function value is greater than the preset loss function threshold value, the semantic representation model parameters are adjusted so that the total loss function value is less than the loss function threshold value.
  5. 根据权利要求4所述的基于语义表征模型的文本分类方法,所述根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列的步骤,包括:The text classification method based on a semantic representation model according to claim 4, wherein the step of generating a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generating method comprises:
    对所述训练用文本中的随机单词替换为掩码标记,并对所述掩码标记后的训练用文本进行预处理,从而得到训练用单词序列,其中所述预处理至少包括句子划分和单词划分;Replace random words in the training text with mask marks, and perform preprocessing on the training text after the mask marks to obtain a training word sequence, wherein the preprocessing includes at least sentence division and words Divide
    根据预设的词向量库、第i个单词所属句子在所述训练用文本中的位置与句子切分向量的对应关系、 第i个单词在所述训练用单词序列中的位置与位置向量的对应关系,对应获取与所述训练用单词序列中的第i个单词对应的训练用词向量di、训练用句子切分向量fi和训练用位置向量gi;According to the preset word vector library, the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector, the position of the i-th word in the training word sequence and the position vector Correspondence, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
    根据公式:ti=di+fi+gi,计算得到第i个单词对应的训练用文本嵌入向量ti,其中训练用词向量di、训练用句子切分向量fi和训练用位置向量gi具有相同的维度;According to the formula: ti=di+fi+gi, the training text embedding vector ti corresponding to the i-th word is calculated, where the training word vector di, the training sentence segmentation vector fi and the training position vector gi have the same dimensions ;
    生成训练用文本嵌入向量序列{t1,t2,…,tn},其中所述训练用单词序列中共有n个单词。Generate a training text embedding vector sequence {t1, t2,..., tn}, wherein there are a total of n words in the training word sequence.
  6. 根据权利要求4所述的基于语义表征模型的文本分类方法,所述根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值的步骤之前,包括:The method for text classification based on a semantic representation model according to claim 4, wherein said training text embedding vector sequence corresponding to said training text is generated according to a preset text embedding vector sequence generation method, and said training The text embedding vector sequence is input into the preset M-layer word granularity encoder for calculation, so as to obtain the first sub-attention matrix output by the M-layer word granularity encoder, and then the first sub-attention matrix is input to the pre- Before the step of obtaining the value of the first loss function in the first loss function, it includes:
    将所述第一损失函数设置为:LOSS1=-∑Y ilogX i,其中LOSS1为所述第一损失函数,Yi是所述训练用文本对应的期望第一子注意力矩阵,Xi是所述第一子注意力矩阵; Set the first loss function as: LOSS1=-∑Y i logX i , where LOSS1 is the first loss function, Yi is the expected first sub-attention matrix corresponding to the training text, and Xi is the The first sub-attention matrix;
    将所述第二损失函数设置为:LOSS2=-∑(G ilogH i+(1-Gi)log(1-Hi)),其中LOSS2为所述第二损失函数,Gi是所述训练用文本对应的期望第二子注意力矩阵,Hi是所述第二子注意力矩阵。 Set the second loss function as: LOSS2=-∑(G i logH i +(1-Gi)log(1-Hi)), where LOSS2 is the second loss function, and Gi is the training text The corresponding expected second sub-attention matrix, Hi is the second sub-attention matrix.
  7. 根据权利要求1所述的基于语义表征模型的文本分类方法,所述将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果的步骤之后,包括:The text classification method based on a semantic representation model according to claim 1, after the step of inputting the final text embedding vector sequence and final entity embedding vector sequence into a preset classification model for processing, and obtaining a text classification result ,include:
    根据预设的分类结果与回答句子的对应关系,获取与所述文本分类结果对应的指定回答句子;Obtaining the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence;
    输出所述指定回答句子。The specified answer sentence is output.
  8. 一种基于语义表征模型的文本分类装置,包括:A text classification device based on a semantic representation model, including:
    文本获取单元,用于获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;The text acquisition unit is configured to acquire the input original text and preprocess the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
    第一嵌入计算单元,用于根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;The first embedding calculation unit is used to generate the word vector according to the preset method, the corresponding relationship between the position of the sentence to which the i-th word belongs in the original text and the sentence segmentation vector, and the position of the i-th word in the word sequence Correspondence with the position vector, correspondingly obtain the word vector ai, sentence segmentation vector bi and position vector ci corresponding to the i-th word in the word sequence, and calculate according to the formula: wi=ai+bi+ci The text embedding vector wi corresponding to the i-th word, where the word vector ai, the sentence segmentation vector bi and the position vector ci have the same dimensions;
    第一序列生成单元,用于生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;The first sequence generating unit is used to generate a text embedding vector sequence {w1, w2,..., wn}, wherein there are a total of n words in the word sequence;
    第二序列生成单元,用于将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;The second sequence generating unit is used to input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word ;
    中间序列生成单元,用于将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;The intermediate sequence generating unit is used to input the text embedding vector sequence into a preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder; wherein the M The layer word granularity encoder and the preset N layer knowledge granularity encoder are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
    最终序列生成单元,用于将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;The final sequence generating unit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding output by the last layer of knowledge granularity encoder Vector sequence and final entity embedding vector sequence;
    文本分类单元,用于将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The text classification unit is used to input the final text embedding vector sequence and the final entity embedding vector sequence into a preset classification model for processing to obtain a text classification result.
  9. 根据权利要求8所述的基于语义表征模型的文本分类装置,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述中间序列生成单元,包括:According to the text classification device based on the semantic representation model of claim 8, each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the intermediate sequence generating unit includes :
    矩阵组计算子单元,用于在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;The matrix group calculation subunit is used to multiply the text embedding vector sequence by the trained h first parameter matrix groups in the multi-head self-attention mechanism layer of the first-layer word granularity encoder to obtain the first parameter matrix group. A matrix {Q1,Q2,...,Qh}, a second matrix {K1,K2,...,Kh} and a third matrix {V1,V2,...,Vh}, each of the first parameter matrix groups includes three q×k first parameter matrix;
    第一矩阵获取子单元,用于根据公式:
    Figure PCTCN2019116339-appb-100006
    计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
    The first matrix obtains subunits, which are used according to the formula:
    Figure PCTCN2019116339-appb-100006
    Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
    第二矩阵获取子单元,用于根据公式:The second matrix obtains subunits, which are used according to the formula:
    Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, calculate the multihead self-attention matrix Multihead, where W is the preset second parameter matrix , Concat function refers to directly splicing the matrix in the column direction;
    暂时向量获取子单元,用于将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; The temporary vector acquisition subunit is used to input the multi-head self-attention matrix into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is : FFN(x)=gelu(xW 1 +b 1 )W 2 +b 2 , where x is the multi-head self-attention matrix, W 1 , W 2 are preset parameter matrices, and b 1 , b 2 are preset parameter matrices. Set offset value;
    中间序列生成子单元,用于将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。The intermediate sequence generation subunit is used to compose the temporary text embedding vector corresponding to all words into a temporary text embedding vector sequence, and input the temporary text embedding vector sequence into the next layer of word granularity encoder until the last layer of word granularity is obtained The intermediate text output by the encoder is embedded in the vector sequence.
  10. 根据权利要求8所述的基于语义表征模型的文本分类装置,每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,所述最终序列生成单元,包括:According to the text classification device based on the semantic representation model of claim 8, each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and the final sequence generation unit includes:
    第一获取子单元,用于将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
    Figure PCTCN2019116339-appb-100007
    和第二向量序列
    Figure PCTCN2019116339-appb-100008
    The first acquisition subunit is used to input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-level knowledge granularity encoder, thereby obtaining the first vector sequence
    Figure PCTCN2019116339-appb-100007
    And the second vector sequence
    Figure PCTCN2019116339-appb-100008
    第一计算子单元,用于将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:The first calculation subunit is used to input the first vector sequence and the second vector sequence into the information aggregation layer in the first-layer knowledge granularity encoder, so as to obtain the final text embedding vector mj and mj corresponding to the j-th word. The final entity embedding vector pj, where the calculation formula in the information aggregation layer is:
    mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
    Figure PCTCN2019116339-appb-100009
    W 3、W 4
    Figure PCTCN2019116339-appb-100010
    均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
    mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
    Figure PCTCN2019116339-appb-100009
    W 3 , W 4 ,
    Figure PCTCN2019116339-appb-100010
    They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
    第二计算子单元,用于生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。The second calculation subunit is used to generate the first text embedding vector sequence {m1, m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector The sequence {m1,m2,…,mn} and the first entity embedding vector sequence {m1,m2,…,mn} are input into the next level of knowledge granularity encoder until the final text embedding output by the last layer of knowledge granularity encoder is obtained The vector sequence and the final entity are embedded in the vector sequence.
  11. 根据权利要求8所述的基于语义表征模型的文本分类装置,所述装置,包括:The text classification device based on a semantic representation model according to claim 8, the device comprising:
    文本调用单元,用于调用预先采集的训练用文本;The text calling unit is used to call the pre-collected training text;
    第一获取单元,用于根据预设的文本嵌入向量序列生成方法,生成与所述训练用文本对应的训练用文本嵌入向量序列,并将所述训练用文本嵌入向量序列输入预设的M层词粒度编码器中进行计算,从而 得到所述M层词粒度编码器输出的第一子注意力矩阵,再将所述第一子注意力矩阵输入预设的第一损失函数中,从而得到第一损失函数值;The first acquiring unit is configured to generate a training text embedding vector sequence corresponding to the training text according to a preset text embedding vector sequence generation method, and input the training text embedding vector sequence into a preset M layer The word granularity encoder performs calculations to obtain the first sub-attention matrix output by the M-layer word granular encoder, and then inputs the first sub-attention matrix into the preset first loss function, thereby obtaining the first sub-attention matrix A loss function value;
    第二获取单元,用于根据预设的实体嵌入向量序列生成方法,生成与所述训练用文本对应的训练用实体嵌入向量序列,并将所述训练用实体嵌入向量序列和所述训练用文本嵌入向量序列输入预设的N层知识粒度编码器中进行计算,从而得到所述N层知识粒度编码器输出的第二子注意力矩阵,再将所述第二子注意力矩阵输入预设的第二损失函数中,从而得到第二损失函数值;The second acquiring unit is configured to generate a training entity embedding vector sequence corresponding to the training text according to a preset entity embedding vector sequence generating method, and embed the training entity into the vector sequence and the training text The embedding vector sequence is input into the preset N-layer knowledge granular encoder for calculation, so as to obtain the second sub-attention matrix output by the N-layer knowledge granular encoder, and then the second sub-attention matrix is input into the preset In the second loss function, the value of the second loss function is thus obtained;
    第三获取单元,用于根据公式:总损失函数值=所述第一损失函数值+所述第二损失函数值,计算得到总损失函数值,并判断所述总损失函数值是否大于预设的损失函数阈值;The third acquiring unit is configured to calculate the total loss function value according to the formula: total loss function value=the first loss function value+the second loss function value, and determine whether the total loss function value is greater than a preset value The threshold of loss function;
    调整单元,用于若所述总损失函数值大于预设的损失函数阈值,则调整所述语义表征模型参数,以使所述总损失函数值小于所述损失函数阈值。The adjustment unit is configured to, if the total loss function value is greater than a preset loss function threshold, adjust the semantic representation model parameters so that the total loss function value is less than the loss function threshold.
  12. 根据权利要求11所述的基于语义表征模型的文本分类装置,所述第一获取单元,包括:The apparatus for text classification based on a semantic representation model according to claim 11, wherein the first acquiring unit comprises:
    第二获取子单元,用于对所述训练用文本中的随机单词替换为掩码标记,并对所述掩码标记后的训练用文本进行预处理,从而得到训练用单词序列,其中所述预处理至少包括句子划分和单词划分;The second acquisition subunit is used to replace random words in the training text with mask marks, and preprocess the training text after the mask marks to obtain a training word sequence, wherein Preprocessing includes at least sentence division and word division;
    第三获取子单元,用于根据预设的词向量库、第i个单词所属句子在所述训练用文本中的位置与句子切分向量的对应关系、第i个单词在所述训练用单词序列中的位置与位置向量的对应关系,对应获取与所述训练用单词序列中的第i个单词对应的训练用词向量di、训练用句子切分向量fi和训练用位置向量gi;The third acquisition subunit is used for the corresponding relationship between the position of the sentence to which the i-th word belongs in the training text and the sentence segmentation vector according to the preset word vector library, and the position of the i-th word in the training word Correspondence between the position in the sequence and the position vector, correspondingly acquiring the training word vector di, the training sentence segmentation vector fi, and the training position vector gi corresponding to the i-th word in the training word sequence;
    第四获取子单元,用于根据公式:ti=di+fi+gi,计算得到第i个单词对应的训练用文本嵌入向量ti,其中训练用词向量di、训练用句子切分向量fi和训练用位置向量gi具有相同的维度;The fourth acquisition subunit is used to calculate the training text embedding vector ti corresponding to the i-th word according to the formula: ti=di+fi+gi, where the training word vector di, the training sentence segmentation vector fi and training Use the position vector gi to have the same dimension;
    第五获取子单元,用于生成训练用文本嵌入向量序列{t1,t2,…,tn},其中所述训练用单词序列中共有n个单词。The fifth acquisition subunit is used to generate a training text embedding vector sequence {t1, t2,..., tn}, wherein there are a total of n words in the training word sequence.
  13. 根据权利要求11所述的基于语义表征模型的文本分类装置,所述装置,包括:The text classification device based on a semantic representation model according to claim 11, the device comprising:
    第一设置单元,用于将所述第一损失函数设置为:LOSS1=-∑Y ilogX i,其中LOSS1为所述第一损失函数,Yi是所述训练用文本对应的期望第一子注意力矩阵,Xi是所述第一子注意力矩阵; The first setting unit is configured to set the first loss function as: LOSS1=-∑Y i logX i , where LOSS1 is the first loss function, and Yi is the expected first sub-attention corresponding to the training text Force matrix, Xi is the first sub-attention matrix;
    第二设置单元,用于将所述第二损失函数设置为:LOSS2=-∑(G ilogH i+(1-Gi)log(1-Hi)),其中LOSS2为所述第二损失函数,Gi是所述训练用文本对应的期望第二子注意力矩阵,Hi是所述第二子注意力矩阵。 The second setting unit is configured to set the second loss function as: LOSS2=-∑(G i logH i +(1-Gi)log(1-Hi)), where LOSS2 is the second loss function, Gi is the expected second sub-attention matrix corresponding to the training text, and Hi is the second sub-attention matrix.
  14. 根据权利要求8所述的基于语义表征模型的文本分类装置,所述装置,包括:The text classification device based on a semantic representation model according to claim 8, the device comprising:
    句子获取单元,用于根据预设的分类结果与回答句子的对应关系,获取与所述文本分类结果对应的指定回答句子;The sentence acquisition unit is configured to acquire the designated answer sentence corresponding to the text classification result according to the preset correspondence between the classification result and the answer sentence;
    句子输出单元,用于输出所述指定回答句子。The sentence output unit is used to output the specified answer sentence.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现基于语义表征模型的文本分类方法,所述基于语义表征模型的文本分类方法,包括:A computer device includes a memory and a processor, the memory stores a computer program, the processor implements a semantic representation model-based text classification method when the processor executes the computer program, the semantic representation model-based text classification method, include:
    获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;Acquiring the input original text, and preprocessing the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
    根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;According to the preset word vector generation method, the corresponding relationship between the position of the i-th word in the original text and the sentence segmentation vector, and the corresponding relationship between the position of the i-th word in the word sequence and the position vector, corresponding Obtain the word vector ai, sentence segmentation vector bi, and position vector ci corresponding to the i-th word in the word sequence, and calculate the text embedding corresponding to the i-th word according to the formula: wi=ai+bi+ci Vector wi, where the word vector ai, sentence segmentation vector bi and position vector ci have the same dimensions;
    生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;Generate a text embedding vector sequence {w1, w2,..., wn}, where there are a total of n words in the word sequence;
    将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;Input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word;
    将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;Input the text embedding vector sequence into the preset M-layer word granularity encoder for calculation, thereby obtaining the intermediate text embedding vector sequence output by the last layer word granularity encoder; wherein the M-layer word granularity encoder and the pre- Suppose that the N-level knowledge granularity encoders are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, thereby obtaining the final text embedding vector sequence and the final entity embedding vector output by the last layer of knowledge granularity encoder sequence;
    将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing to obtain a text classification result.
  16. 根据权利要求15所述的计算机设备,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列的步骤,包括:The computer device according to claim 15, wherein each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the said text embedding vector sequence is input to a preset The steps of calculating in the M-layer word granular encoder to obtain the intermediate text embedding vector sequence output by the last layer of word granular encoder include:
    在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;In the multi-head self-attention mechanism layer of the first-layer word granularity encoder, the text embedding vector sequence is respectively multiplied by the trained h first parameter matrix groups, thereby obtaining the first matrix {Q1,Q2,... ,Qh}, the second matrix {K1,K2,...,Kh} and the third matrix {V1,V2,...,Vh}, where each first parameter matrix group includes three q×k first parameter matrices ;
    根据公式:
    Figure PCTCN2019116339-appb-100011
    计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
    According to the formula:
    Figure PCTCN2019116339-appb-100011
    Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
    根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; According to the formula: Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, the multihead self-attention matrix Multihead is calculated, where W is the preset first Two-parameter matrix, the Concat function refers to directly splicing the matrix in the column direction;
    将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; The multi-head self-attention matrix is input into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is: FFN(x)=gelu( xW 1 + b 1 ) W 2 + b 2 , where x is the multi-head self-attention matrix, W 1 and W 2 are preset parameter matrices, and b 1 and b 2 are preset bias values;
    将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。The temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence, and the temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text embedding vector output by the last layer of word granularity encoder is obtained sequence.
  17. 根据权利要求15所述的计算机设备,每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,所述将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列的步骤,包括:The computer device according to claim 15, wherein each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and said inputting said intermediate text embedding vector sequence and said entity embedding vector sequence The steps of performing calculation in the N-layer knowledge granular encoder to obtain the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granular encoder include:
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
    Figure PCTCN2019116339-appb-100012
    和第二向量序列
    Figure PCTCN2019116339-appb-100013
    Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-layer knowledge granularity encoder, thereby obtaining the first vector sequence
    Figure PCTCN2019116339-appb-100012
    And the second vector sequence
    Figure PCTCN2019116339-appb-100013
    将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:The first vector sequence and the second vector sequence are input to the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word, where the information The calculation formula in the aggregation layer is:
    mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
    Figure PCTCN2019116339-appb-100014
    W 3、W 4
    Figure PCTCN2019116339-appb-100015
    均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
    mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
    Figure PCTCN2019116339-appb-100014
    W 3 , W 4 ,
    Figure PCTCN2019116339-appb-100015
    They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
    生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。Generate the first text embedding vector sequence {m1,m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector sequence {m1,m2,..., mn} and the first entity embedding vector sequence {m1,m2,...,mn} are input into the next level of knowledge granularity encoder, until the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder are obtained .
  18. 一种非易失性的计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现基于语义表征模型的文本分类方法,所述基于语义表征模型的文本分类方法,包括:A non-volatile computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to realize a text classification method based on a semantic representation model, the text classification method based on a semantic representation model, include:
    获取输入的原始文本,并对所述原始文本进行预处理,从而得到单词序列,其中所述预处理至少包括句子划分和单词划分;Acquiring the input original text, and preprocessing the original text to obtain a word sequence, wherein the preprocessing includes at least sentence division and word division;
    根据预设的词向量生成方法、第i个单词所属句子在原始文本中的位置与句子切分向量的对应关系、第i个单词在所述单词序列中的位置与位置向量的对应关系,对应获取与所述单词序列中的第i个单词对应的词向量ai、句子切分向量bi和位置向量ci,并根据公式:wi=ai+bi+ci,计算得到第i个单词对应的文本嵌入向量wi,其中词向量ai、句子切分向量bi和位置向量ci具有相同的维度;According to the preset word vector generation method, the corresponding relationship between the position of the i-th word in the original text and the sentence segmentation vector, and the corresponding relationship between the position of the i-th word in the word sequence and the position vector, corresponding Obtain the word vector ai, sentence segmentation vector bi, and position vector ci corresponding to the i-th word in the word sequence, and calculate the text embedding corresponding to the i-th word according to the formula: wi=ai+bi+ci Vector wi, where the word vector ai, sentence segmentation vector bi and position vector ci have the same dimensions;
    生成文本嵌入向量序列{w1,w2,…,wn},其中所述单词序列中共有n个单词;Generate a text embedding vector sequence {w1, w2,..., wn}, where there are a total of n words in the word sequence;
    将所述单词序列输入预设的知识嵌入模型中,从而获取实体嵌入向量序列{e1,e2,…,en},其中en是第n个单词对应的实体嵌入向量;Input the word sequence into the preset knowledge embedding model to obtain the entity embedding vector sequence {e1, e2,..., en}, where en is the entity embedding vector corresponding to the nth word;
    将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列;其中所述M层词粒度编码器和预设的N层知识粒度编码器顺序连接从而构成语义表征模型,其中M与N均大于等于2;Input the text embedding vector sequence into the preset M-layer word granularity encoder for calculation, thereby obtaining the intermediate text embedding vector sequence output by the last layer word granularity encoder; wherein the M-layer word granularity encoder and the pre- Suppose that the N-level knowledge granularity encoders are sequentially connected to form a semantic representation model, where M and N are both greater than or equal to 2;
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列;Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, thereby obtaining the final text embedding vector sequence and the final entity embedding vector output by the last layer of knowledge granularity encoder sequence;
    将所述最终文本嵌入向量序列和最终实体嵌入向量序列输入到预设的分类模型中进行处理,得到文本分类结果。The final text embedding vector sequence and the final entity embedding vector sequence are input into a preset classification model for processing to obtain a text classification result.
  19. 根据权利要求18所述的非易失性的计算机可读存储介质,每一层词粒度编码器由一个多头自注意力机制层和一个前馈全连接层顺序连接构成,所述将所述文本嵌入向量序列输入到预设的M层词粒度编码器中进行计算,从而得到最后一层词粒度编码器输出的中间文本嵌入向量序列的步骤,包括:According to the non-volatile computer-readable storage medium of claim 18, each layer of word granularity encoder is composed of a multi-head self-attention mechanism layer and a feedforward fully connected layer connected in sequence, and the text The embedding vector sequence is input into the preset M-layer word granularity encoder for calculation, so as to obtain the intermediate text embedding vector sequence output by the last layer of word granularity encoder, including:
    在第一层词粒度编码器中的多头自注意力机制层中,将所述文本嵌入向量序列分别乘以经过训练的h个第一参数矩阵组,从而得到第一矩阵{Q1,Q2,…,Qh},第二矩阵{K1,K2,…,Kh}和第三矩阵{V1,V2,…,Vh},其中每个第一参数矩阵组均包括三个q×k的第一参数矩阵;In the multi-head self-attention mechanism layer of the first-layer word granularity encoder, the text embedding vector sequence is respectively multiplied by the trained h first parameter matrix groups, thereby obtaining the first matrix {Q1,Q2,... ,Qh}, the second matrix {K1,K2,...,Kh} and the third matrix {V1,V2,...,Vh}, where each first parameter matrix group includes three q×k first parameter matrices ;
    根据公式:
    Figure PCTCN2019116339-appb-100016
    计算得到第z个子注意力矩阵,其中z大于等于1且小于等于h;
    According to the formula:
    Figure PCTCN2019116339-appb-100016
    Calculate the z-th sub-attention matrix, where z is greater than or equal to 1 and less than or equal to h;
    根据公式:Multihead({w 1,w 2,…,w n})=Concat(head 1,head 2,…,head h)W,计算得到多头自注意力 矩阵Multihead,其中W为预设的第二参数矩阵,Concat函数指将矩阵按列方向直接拼接; According to the formula: Multihead({w 1 ,w 2 ,…,w n })=Concat(head 1 ,head 2 ,…,head h )W, the multihead self-attention matrix Multihead is calculated, where W is the preset first Two-parameter matrix, the Concat function refers to directly splicing the matrix in the column direction;
    将所述多头自注意力矩阵输入所述前馈全连接层中,从而得到暂时文本嵌入向量FFN(x),其中所述前馈全连接层中的计算公式为:FFN(x)=gelu(xW 1+b 1)W 2+b 2,其中x为所述多头自注意力矩阵,W 1、W 2为预设的参数矩阵,b 1、b 2为预设的偏置值; The multi-head self-attention matrix is input into the feedforward fully connected layer to obtain a temporary text embedding vector FFN(x), wherein the calculation formula in the feedforward fully connected layer is: FFN(x)=gelu( xW 1 + b 1 ) W 2 + b 2 , where x is the multi-head self-attention matrix, W 1 and W 2 are preset parameter matrices, and b 1 and b 2 are preset bias values;
    将所有单词对应的暂时文本嵌入向量组成暂时文本嵌入向量序列,并将所述暂时文本嵌入向量序列输入下一层词粒度编码器中,直至获取最后一层词粒度编码器输出的中间文本嵌入向量序列。The temporary text embedding vectors corresponding to all words are formed into a temporary text embedding vector sequence, and the temporary text embedding vector sequence is input into the next layer of word granularity encoder, until the intermediate text embedding vector output by the last layer of word granularity encoder is obtained sequence.
  20. 根据权利要求18所述的非易失性的计算机可读存储介质,每一层知识粒度编码器均包括一个多头自注意力机制层和一个信息聚合层,所述将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到所述N层知识粒度编码器中进行计算,从而得到最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列的步骤,包括:The non-volatile computer-readable storage medium according to claim 18, each layer of knowledge granularity encoder includes a multi-head self-attention mechanism layer and an information aggregation layer, and said intermediate text is embedded in a vector sequence The steps of inputting the entity embedding vector sequence into the N-layer knowledge granularity encoder for calculation, so as to obtain the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder, include:
    将所述中间文本嵌入向量序列和所述实体嵌入向量序列输入到第一层知识粒度编码器中的多头自注意力机制层中,从而得到第一向量序列
    Figure PCTCN2019116339-appb-100017
    和第二向量序列
    Figure PCTCN2019116339-appb-100018
    Input the intermediate text embedding vector sequence and the entity embedding vector sequence into the multi-head self-attention mechanism layer in the first-layer knowledge granularity encoder, thereby obtaining the first vector sequence
    Figure PCTCN2019116339-appb-100017
    And the second vector sequence
    Figure PCTCN2019116339-appb-100018
    将所述第一向量序列和第二向量序列输入到第一层知识粒度编码器中的信息聚合层中,从而得到第j个单词对应的最终文本嵌入向量mj和最终实体嵌入向量pj,其中信息聚合层中的计算公式为:The first vector sequence and the second vector sequence are input to the information aggregation layer in the first-level knowledge granularity encoder, so as to obtain the final text embedding vector mj and the final entity embedding vector pj corresponding to the jth word, where the information The calculation formula in the aggregation layer is:
    mj=gelu(W 3h j+b 3);pj=gelu(W 4h j+b 4);其中
    Figure PCTCN2019116339-appb-100019
    W 3、W 4
    Figure PCTCN2019116339-appb-100020
    均为预设的参数矩阵,b 3、b 4、b 5均为预设的偏置值;
    mj=gelu(W 3 h j +b 3 ); pj=gelu(W 4 h j +b 4 ); where
    Figure PCTCN2019116339-appb-100019
    W 3 , W 4 ,
    Figure PCTCN2019116339-appb-100020
    They are all preset parameter matrices, and b 3 , b 4 , and b 5 are all preset offset values;
    生成第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn},并将所述第一文本嵌入向量序列{m1,m2,…,mn}和第一实体嵌入向量序列{m1,m2,…,mn}输入下一层知识粒度编码器中,直至获取最后一层知识粒度编码器输出的最终文本嵌入向量序列和最终实体嵌入向量序列。Generate the first text embedding vector sequence {m1,m2,...,mn} and the first entity embedding vector sequence {m1,m2,...,mn}, and embed the first text into the vector sequence {m1,m2,..., mn} and the first entity embedding vector sequence {m1,m2,...,mn} are input into the next level of knowledge granularity encoder, until the final text embedding vector sequence and the final entity embedding vector sequence output by the last layer of knowledge granularity encoder are obtained .
PCT/CN2019/116339 2019-09-19 2019-11-07 Semantic representation model-based text classification method and apparatus, and computer device WO2021051503A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910886622.1 2019-09-19
CN201910886622.1A CN110781312B (en) 2019-09-19 2019-09-19 Text classification method and device based on semantic representation model and computer equipment

Publications (1)

Publication Number Publication Date
WO2021051503A1 true WO2021051503A1 (en) 2021-03-25

Family

ID=69383591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116339 WO2021051503A1 (en) 2019-09-19 2019-11-07 Semantic representation model-based text classification method and apparatus, and computer device

Country Status (2)

Country Link
CN (1) CN110781312B (en)
WO (1) WO2021051503A1 (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113379032A (en) * 2021-06-08 2021-09-10 全球能源互联网研究院有限公司 Layered bidirectional LSTM sequence model training method and system
CN113378973A (en) * 2021-06-29 2021-09-10 沈阳雅译网络技术有限公司 Image classification method based on self-attention mechanism
CN113420121A (en) * 2021-06-24 2021-09-21 中国科学院声学研究所 Text processing model training method, voice text processing method and device
CN113468874A (en) * 2021-06-09 2021-10-01 大连理工大学 Biomedical relation extraction method based on graph convolution self-coding
CN113486669A (en) * 2021-07-06 2021-10-08 上海市东方医院(同济大学附属东方医院) Semantic recognition method for emergency rescue input voice
CN113535984A (en) * 2021-08-11 2021-10-22 华侨大学 Attention mechanism-based knowledge graph relation prediction method and device
CN113626537A (en) * 2021-07-06 2021-11-09 南京邮电大学 Entity relationship extraction method and system for knowledge graph construction
CN113657257A (en) * 2021-08-16 2021-11-16 浙江大学 End-to-end sign language translation method and system
CN113742188A (en) * 2021-08-25 2021-12-03 宁波大学 BERT-based non-invasive computer behavior monitoring method and system
CN113741886A (en) * 2021-08-02 2021-12-03 扬州大学 Statement level program repairing method and system based on graph
CN113779192A (en) * 2021-08-23 2021-12-10 河海大学 Text classification algorithm of bidirectional dynamic route based on labeled constraint
CN113821636A (en) * 2021-08-27 2021-12-21 上海快确信息科技有限公司 Financial text joint extraction and classification scheme based on knowledge graph
CN113837233A (en) * 2021-08-30 2021-12-24 厦门大学 Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN114003730A (en) * 2021-10-29 2022-02-01 福州大学 Open world knowledge complementing method and system based on relation specific gate filtering
CN114281986A (en) * 2021-11-15 2022-04-05 国网吉林省电力有限公司 Enterprise file secret point marking method based on self-attention network
CN114357176A (en) * 2021-11-26 2022-04-15 永中软件股份有限公司 Method for automatically extracting entity knowledge, computer device and computer readable medium
CN114357158A (en) * 2021-12-09 2022-04-15 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114781356A (en) * 2022-03-14 2022-07-22 华南理工大学 Text abstract generation method based on input sharing
CN114925742A (en) * 2022-03-24 2022-08-19 华南理工大学 Symbolic music emotion classification system and method based on auxiliary task
CN115357690A (en) * 2022-10-19 2022-11-18 有米科技股份有限公司 Text repetition removing method and device based on text mode self-supervision
CN115422477A (en) * 2022-09-16 2022-12-02 哈尔滨理工大学 Track neighbor query system, method, computer and storage medium
CN117132997A (en) * 2023-10-26 2023-11-28 国网江西省电力有限公司电力科学研究院 Handwriting form recognition method based on multi-head attention mechanism and knowledge graph
CN117151121A (en) * 2023-10-26 2023-12-01 安徽农业大学 Multi-intention spoken language understanding method based on fluctuation threshold and segmentation
CN117744635A (en) * 2024-02-12 2024-03-22 长春职业技术学院 English text automatic correction system and method based on intelligent AI
CN117763190A (en) * 2024-02-22 2024-03-26 彩讯科技股份有限公司 Intelligent picture text matching method and system
CN117763190B (en) * 2024-02-22 2024-05-14 彩讯科技股份有限公司 Intelligent picture text matching method and system

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581929B (en) * 2020-04-22 2022-09-27 腾讯科技(深圳)有限公司 Text generation method based on table and related device
CN111694936B (en) * 2020-04-26 2023-06-06 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for identification of AI intelligent interview
CN113672737A (en) * 2020-05-13 2021-11-19 复旦大学 Knowledge graph entity concept description generation system
CN111563166B (en) * 2020-05-28 2024-02-13 浙江学海教育科技有限公司 Pre-training model method for classifying mathematical problems
CN111737995B (en) * 2020-05-29 2024-04-05 北京百度网讯科技有限公司 Method, device, equipment and medium for training language model based on multiple word vectors
CN112241631A (en) * 2020-10-23 2021-01-19 平安科技(深圳)有限公司 Text semantic recognition method and device, electronic equipment and storage medium
CN112307752A (en) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN113032567B (en) * 2021-03-29 2022-03-29 广东众聚人工智能科技有限公司 Position embedding interpretation method and device, computer equipment and storage medium
CN112948633B (en) * 2021-04-01 2023-09-05 北京奇艺世纪科技有限公司 Content tag generation method and device and electronic equipment
CN113449081A (en) * 2021-07-08 2021-09-28 平安国际智慧城市科技股份有限公司 Text feature extraction method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150039331A1 (en) * 2013-08-02 2015-02-05 Real Endpoints LLC Assessing pharmaceuticals
CN108829722A (en) * 2018-05-08 2018-11-16 国家计算机网络与信息安全管理中心 A kind of Dual-Attention relationship classification method and system of remote supervisory
CN109271516A (en) * 2018-09-26 2019-01-25 清华大学 Entity type classification method and system in a kind of knowledge mapping

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2518946C1 (en) * 2012-11-27 2014-06-10 Александр Александрович Харламов Method for automatic semantic indexing of natural language text
CN105005556A (en) * 2015-07-29 2015-10-28 成都理工大学 Index keyword extraction method and system based on big geological data
US10255269B2 (en) * 2016-12-30 2019-04-09 Microsoft Technology Licensing, Llc Graph long short term memory for syntactic relationship discovery
CN109871451B (en) * 2019-01-25 2021-03-19 中译语通科技股份有限公司 Method and system for extracting relation of dynamic word vectors

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150039331A1 (en) * 2013-08-02 2015-02-05 Real Endpoints LLC Assessing pharmaceuticals
CN108829722A (en) * 2018-05-08 2018-11-16 国家计算机网络与信息安全管理中心 A kind of Dual-Attention relationship classification method and system of remote supervisory
CN109271516A (en) * 2018-09-26 2019-01-25 清华大学 Entity type classification method and system in a kind of knowledge mapping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "BERT Analysis and its Application in Text Classification", 7 May 2019 (2019-05-07), XP055792801, Retrieved from the Internet <URL:https://www.cnblogs.com/xlturing/p/10824400.html> [retrieved on 20200618] *

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113239192B (en) * 2021-04-29 2024-04-16 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113379032A (en) * 2021-06-08 2021-09-10 全球能源互联网研究院有限公司 Layered bidirectional LSTM sequence model training method and system
CN113468874A (en) * 2021-06-09 2021-10-01 大连理工大学 Biomedical relation extraction method based on graph convolution self-coding
CN113468874B (en) * 2021-06-09 2024-04-16 大连理工大学 Biomedical relation extraction method based on graph convolution self-coding
CN113420121B (en) * 2021-06-24 2023-07-28 中国科学院声学研究所 Text processing model training method, voice text processing method and device
CN113420121A (en) * 2021-06-24 2021-09-21 中国科学院声学研究所 Text processing model training method, voice text processing method and device
CN113378973B (en) * 2021-06-29 2023-08-08 沈阳雅译网络技术有限公司 Image classification method based on self-attention mechanism
CN113378973A (en) * 2021-06-29 2021-09-10 沈阳雅译网络技术有限公司 Image classification method based on self-attention mechanism
CN113486669B (en) * 2021-07-06 2024-03-29 上海市东方医院(同济大学附属东方医院) Semantic recognition method for emergency rescue input voice
CN113626537A (en) * 2021-07-06 2021-11-09 南京邮电大学 Entity relationship extraction method and system for knowledge graph construction
CN113626537B (en) * 2021-07-06 2023-10-17 南京邮电大学 Knowledge graph construction-oriented entity relation extraction method and system
CN113486669A (en) * 2021-07-06 2021-10-08 上海市东方医院(同济大学附属东方医院) Semantic recognition method for emergency rescue input voice
CN113741886A (en) * 2021-08-02 2021-12-03 扬州大学 Statement level program repairing method and system based on graph
CN113741886B (en) * 2021-08-02 2023-09-26 扬州大学 Sentence-level program repairing method and system based on graph
CN113535984B (en) * 2021-08-11 2023-05-26 华侨大学 Knowledge graph relation prediction method and device based on attention mechanism
CN113535984A (en) * 2021-08-11 2021-10-22 华侨大学 Attention mechanism-based knowledge graph relation prediction method and device
CN113657257A (en) * 2021-08-16 2021-11-16 浙江大学 End-to-end sign language translation method and system
CN113657257B (en) * 2021-08-16 2023-12-19 浙江大学 End-to-end sign language translation method and system
CN113779192A (en) * 2021-08-23 2021-12-10 河海大学 Text classification algorithm of bidirectional dynamic route based on labeled constraint
CN113742188A (en) * 2021-08-25 2021-12-03 宁波大学 BERT-based non-invasive computer behavior monitoring method and system
CN113821636A (en) * 2021-08-27 2021-12-21 上海快确信息科技有限公司 Financial text joint extraction and classification scheme based on knowledge graph
CN113837233A (en) * 2021-08-30 2021-12-24 厦门大学 Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN113837233B (en) * 2021-08-30 2023-11-17 厦门大学 Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN114003730A (en) * 2021-10-29 2022-02-01 福州大学 Open world knowledge complementing method and system based on relation specific gate filtering
CN114281986B (en) * 2021-11-15 2024-03-26 国网吉林省电力有限公司 Enterprise file dense point labeling method based on self-attention network
CN114281986A (en) * 2021-11-15 2022-04-05 国网吉林省电力有限公司 Enterprise file secret point marking method based on self-attention network
CN114357176A (en) * 2021-11-26 2022-04-15 永中软件股份有限公司 Method for automatically extracting entity knowledge, computer device and computer readable medium
CN114357176B (en) * 2021-11-26 2023-11-21 永中软件股份有限公司 Entity knowledge automatic extraction method, computer device and computer readable medium
CN114357158B (en) * 2021-12-09 2024-04-09 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114357158A (en) * 2021-12-09 2022-04-15 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114781356A (en) * 2022-03-14 2022-07-22 华南理工大学 Text abstract generation method based on input sharing
CN114925742B (en) * 2022-03-24 2024-05-14 华南理工大学 Symbol music emotion classification system and method based on auxiliary task
CN114925742A (en) * 2022-03-24 2022-08-19 华南理工大学 Symbolic music emotion classification system and method based on auxiliary task
CN115422477B (en) * 2022-09-16 2023-09-05 哈尔滨理工大学 Track neighbor query system, method, computer and storage medium
CN115422477A (en) * 2022-09-16 2022-12-02 哈尔滨理工大学 Track neighbor query system, method, computer and storage medium
CN115357690A (en) * 2022-10-19 2022-11-18 有米科技股份有限公司 Text repetition removing method and device based on text mode self-supervision
CN115357690B (en) * 2022-10-19 2023-04-07 有米科技股份有限公司 Text repetition removing method and device based on text mode self-supervision
CN117151121A (en) * 2023-10-26 2023-12-01 安徽农业大学 Multi-intention spoken language understanding method based on fluctuation threshold and segmentation
CN117132997B (en) * 2023-10-26 2024-03-12 国网江西省电力有限公司电力科学研究院 Handwriting form recognition method based on multi-head attention mechanism and knowledge graph
CN117151121B (en) * 2023-10-26 2024-01-12 安徽农业大学 Multi-intention spoken language understanding method based on fluctuation threshold and segmentation
CN117132997A (en) * 2023-10-26 2023-11-28 国网江西省电力有限公司电力科学研究院 Handwriting form recognition method based on multi-head attention mechanism and knowledge graph
CN117744635A (en) * 2024-02-12 2024-03-22 长春职业技术学院 English text automatic correction system and method based on intelligent AI
CN117744635B (en) * 2024-02-12 2024-04-30 长春职业技术学院 English text automatic correction system and method based on intelligent AI
CN117763190A (en) * 2024-02-22 2024-03-26 彩讯科技股份有限公司 Intelligent picture text matching method and system
CN117763190B (en) * 2024-02-22 2024-05-14 彩讯科技股份有限公司 Intelligent picture text matching method and system

Also Published As

Publication number Publication date
CN110781312A (en) 2020-02-11
CN110781312B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
WO2021051503A1 (en) Semantic representation model-based text classification method and apparatus, and computer device
Rastogi et al. Scalable multi-domain dialogue state tracking
Tuan et al. Capturing greater context for question generation
CN109062907B (en) Neural machine translation method integrating dependency relationship
Chollampatt et al. Neural quality estimation of grammatical error correction
US20210232753A1 (en) Ml using n-gram induced input representation
WO2022134793A1 (en) Method and apparatus for extracting semantic information in video frame, and computer device
Arslan et al. Doubly attentive transformer machine translation
CN115374270A (en) Legal text abstract generation method based on graph neural network
CN110263143A (en) Improve the neurologic problems generation method of correlation
CN109840506B (en) Method for solving video question-answering task by utilizing video converter combined with relational interaction
CN112000788A (en) Data processing method and device and computer readable storage medium
WO2021147405A1 (en) Customer-service statement quality detection method and related device
Han et al. A coordinated representation learning enhanced multimodal machine translation approach with multi-attention
CN114048282A (en) Text tree local matching-based image-text cross-modal retrieval method and system
CN112463935A (en) Open domain dialogue generation method and model with strong generalized knowledge selection
WO2022178950A1 (en) Method and apparatus for predicting statement entity, and computer device
Panesar et al. Improving visual question answering by leveraging depth and adapting explainability
US11954429B2 (en) Automated notebook completion using sequence-to-sequence transformer
Olive et al. Practical high breakdown regression
Banerjee et al. Attr2style: A transfer learning approach for inferring fashion styles via apparel attributes
JP7161974B2 (en) Quality control method
Frieder et al. Large language models for mathematicians
Ren Scalable and accurate dialogue state tracking via hierarchical sequence generation
Chen et al. Efficient Artificial Intelligence-Teaching Assistant Based on ChatGPT

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946020

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946020

Country of ref document: EP

Kind code of ref document: A1