WO2023005968A1 - Text category recognition method and apparatus, and electronic device and storage medium - Google Patents

Text category recognition method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2023005968A1
WO2023005968A1 PCT/CN2022/108224 CN2022108224W WO2023005968A1 WO 2023005968 A1 WO2023005968 A1 WO 2023005968A1 CN 2022108224 W CN2022108224 W CN 2022108224W WO 2023005968 A1 WO2023005968 A1 WO 2023005968A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
subtext
text
sequence
sample
Prior art date
Application number
PCT/CN2022/108224
Other languages
French (fr)
Chinese (zh)
Inventor
马玉昆
卜英桐
程大川
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023005968A1 publication Critical patent/WO2023005968A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • Embodiments of the present disclosure relate to the technical field of information processing, and specifically relate to a text category recognition method, device, electronic device, and storage medium.
  • Text category recognition refers to pointing out whether the text belongs to a preset category for a piece of text, or giving a probability value that the text belongs to a preset category.
  • the e-commerce platform needs to identify the product introduction text uploaded by the e-commerce company to determine whether the product introduction text meets the requirements and whether there are inappropriate expressions.
  • the literary works platform needs to identify the text content of literary novels uploaded by users to determine whether the novel text content includes vulgar and indecent content.
  • Embodiments of the present disclosure provide a text category recognition method, device, electronic device and storage medium.
  • an embodiment of the present disclosure provides a text category recognition method, the method comprising:
  • the following first calculation operation is performed: for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the sentence With respect to the attention feature vector of the subtext; based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext with respect to the text to be identified;
  • the feature extraction model and the classification model are pre-trained through the following training steps:
  • training sample set wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
  • the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text
  • the feature extraction model includes a word vector feature extraction model and a sentence vector feature extraction model
  • the word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
  • the word vector feature extraction model includes at least one of the following: a long short-term memory network, and a translation model.
  • the sentence vector feature extraction model includes at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
  • each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector
  • feature extraction according to the word vector feature extraction model
  • the training step before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the training step also includes:
  • the word vector corresponding to the word segment is set as the preset word vector.
  • the method also includes:
  • first recognition result information for indicating that the text to be recognized is a preset text category is generated.
  • the method also includes:
  • second recognition result information for indicating that the text to be recognized is not a preset text category is generated.
  • the method also includes:
  • the method also includes:
  • an embodiment of the present disclosure provides a text category recognition device, which includes:
  • the splitting unit is configured to split the text to be recognized to obtain a subtext sequence, and split each subtext in the subtext sequence to obtain a corresponding sentence sequence;
  • the feature extraction unit is configured to perform feature extraction for each sentence in the sentence sequence corresponding to each subtext according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
  • the calculation unit is configured to perform the following first calculation operation for each subtext in the subtext sequence: for each sentence in the subtext, based on the sentence corresponding to each sentence in the sentence sequence corresponding to the subtext A feature vector, calculating the attention feature vector of the sentence relative to the subtext; calculating the attention feature vector of the subtext relative to the text to be identified based on the attention feature vector of each sentence relative to the subtext;
  • the splicing unit is configured to splice the attention feature vectors of the subtexts in the subtext sequence relative to the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized;
  • the recognition unit is configured to input the feature vector of the text to be recognized into a pre-trained classification model to obtain a probability value that the text to be recognized belongs to a preset category of text.
  • the feature extraction model and the classification model are pre-trained as follows:
  • training sample set wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
  • the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text
  • the feature extraction model includes a word vector feature extraction model and a sentence vector feature extraction model
  • the feature extraction unit is further configured to:
  • the word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
  • the word vector feature extraction model includes at least one of the following: a long short-term memory network, and a translation model.
  • the sentence vector feature extraction model includes at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
  • each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector
  • feature extraction according to the word vector feature extraction model
  • the training step before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the training step also includes:
  • the word vector corresponding to the word segment is set as the preset word vector.
  • the device also includes:
  • a determining unit configured to determine whether the probability value is greater than a preset probability threshold
  • the first generating unit is configured to generate first recognition result information indicating that the text to be recognized is a preset text category in response to determining that the value is greater than or equal to greater than 100%.
  • the device also includes:
  • the second generating unit is configured to, in response to determining that it is not greater than, generate second recognition result information for indicating that the text to be recognized is not a preset text category.
  • the device also includes:
  • the first presentation unit is configured to, for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, calculate that the sentence belongs to the preset based on the attention feature vector of the sentence relative to the subtext According to the probability value of the text category, the presentation manner corresponding to the sentence is determined according to the calculated probability value, and the sentence is presented according to the determined presentation manner.
  • the device also includes:
  • the second presentation unit is configured to, for each subtext in the subtext sequence, calculate the probability that the subtext belongs to the preset text category based on the attention feature vector of the subtext relative to the text to be recognized value, determine the presentation manner corresponding to the subtext according to the calculated probability value, and present the subtext according to the determined presentation manner.
  • an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored. When executed by one or more processors, the above one or more processors implement the method described in any implementation manner of the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by one or more processors, any implementation manner in the first aspect can be realized. described method.
  • the text magnitude will be dozens of times An increase of hundreds of times will also cause a lot of labor loss; 4.
  • Using the bag-of-words model to directly model long texts is only based on the statistical information of the frequency of words in long texts, and it is impossible to give specific information in long texts.
  • the content is more related to the probability value of a specific category, which cannot meet richer business needs; and if the deep semantic model is used, it needs to be truncated. At this time, the range of text that can be covered is small, which may also cause omissions.
  • the text category recognition method, device, electronic equipment, and storage medium provided by the embodiments of the present disclosure split the text to be recognized into subtexts.
  • the subtext is split into sentences, and then the sentence feature vector is generated for the sentence, and then the attention feature vector of each sentence to the subtext and the attention feature vector of each subtext to be recognized are generated, and each subtext is spliced relative to The attention feature vector of the text to be recognized is obtained to obtain the feature vector of the text to be recognized.
  • input the feature vector of the text to be recognized into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text.
  • the hierarchical attention relationship between the sentence, the subtext and the text to be recognized is established, and then the feature vector of the text to be recognized is generated to calculate the
  • the probability value of the preset text category realizes the automatic classification of the text to be recognized, which reduces the labor cost of text classification; and, optionally, the sentence can also be calculated by using the attention feature vector of the sentence relative to the subtext.
  • the sentence belongs to the probability value of the preset text category, and according to the calculated probability value, the corresponding presentation mode of the sentence is determined, and the sentence is presented according to the determined presentation mode, and then the text with different probability values belonging to the preset text category is realized.
  • Sentences are presented in a corresponding manner for reference during manual labeling to reduce the possibility of missing.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
  • Fig. 2 is a flowchart of an embodiment of the text category recognition method according to the present disclosure
  • FIG. 3 is a schematic diagram of an application scenario of a text category recognition method according to the present disclosure.
  • FIG. 4 is a flowchart of another embodiment of a text category recognition method according to the present disclosure.
  • FIG. 5 is a schematic structural diagram of an embodiment of a text category recognition device according to the present disclosure.
  • FIG. 6 is a structural schematic diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.
  • FIG. 1 shows an exemplary system architecture 100 to which embodiments of the text category recognition method, device, electronic device and storage medium of the present disclosure can be applied.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as text category recognition applications, speech recognition applications, short video social applications, audio and video conference applications, live video applications, and document editing applications. Applications, input method applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) players, laptops and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3
  • MP4 Motion Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the terminal devices 101, 102, and 103 are software, they can be installed in the terminal devices listed above. It can be implemented as multiple software or software modules (for example, for providing text category recognition services), or as a single software or software module. No specific limitation is made here.
  • the text category recognition method provided in the present disclosure may be executed by the terminal devices 101 , 102 , 103 , and correspondingly, the text category recognition apparatus may be set in the terminal devices 101 , 102 , 103 .
  • the system architecture 100 may not include the server 105 .
  • the text category recognition method provided by the present disclosure can be jointly executed by the terminal device 101, 102, 103 and the server 105, for example, the step of "obtaining the text to be recognized" can be executed by the terminal device 101, 102, 103, Steps such as “for each sentence in the sentence sequence corresponding to each subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence” can be performed by the server 105 .
  • the means for identifying text categories can also be respectively set in the terminal devices 101, 102, 103 and the server 105.
  • the text category recognition method provided by the present disclosure can be executed by the server 105, and correspondingly, the text category recognition device can also be set in the server 105.
  • the system architecture 100 may not include the terminal devices 101, 102 , 103.
  • the server 105 may be hardware or software.
  • the server 105 can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • the server 105 is software, it can be implemented as multiple software or software modules (for example, for providing distributed services), or as a single software or software module. No specific limitation is made here.
  • terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • FIG. 2 shows a process 200 of an embodiment of a text category recognition method according to the present disclosure.
  • the text category recognition method includes the following steps:
  • Step 201 splitting the text to be recognized to obtain a subtext sequence, and splitting each subtext in the subtext sequence to obtain a corresponding sentence sequence.
  • the execution body of the text category recognition method (such as the server 105 shown in FIG. 1 ) can first locally or remotely connect to the above-mentioned execution body from other electronic devices (such as the terminal device shown in FIG. 1 ) 101, 102, 103) Obtain the text to be recognized.
  • the text to be recognized may be composed of characters of the same language, or may be composed of characters of more than one language, which is not specifically limited in the present disclosure.
  • the text to be recognized may be text in various situations, which is not specifically limited in the present disclosure.
  • the text to be recognized may be any of the following: a part of the news text, some chapters of the novel text, and the like.
  • the text to be recognized may be relatively long text, for example, the text to be recognized may include at least 400 sentences.
  • the above execution subject can use various implementation methods to split the text to be recognized to obtain subtext sequences.
  • the above-mentioned executive body may split the text to be recognized into a first preset number (for example, 20) of subtexts, wherein, the number of sentences in each subtext may be within a preset number range (for example, , greater than or equal to 20 and less than or equal to 25).
  • a preset number range for example, , greater than or equal to 20 and less than or equal to 25.
  • the subtext sequence can be obtained.
  • each subtext in the subtext sequence is split to obtain a corresponding sentence sequence.
  • the sentence sequence can be obtained by splitting according to the punctuation marks in the subtext.
  • Step 202 for each sentence in the sentence sequence corresponding to each subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence.
  • the execution subject can perform feature extraction on each sentence in the sentence sequence corresponding to the subtext according to a pre-trained feature extraction model to obtain the The sentence feature vector corresponding to the sentence.
  • the feature extraction model is used to represent the correspondence between the sentence and the feature vector corresponding to the sentence.
  • the feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model.
  • step 202 can be performed as follows: for each sentence in the sentence sequence corresponding to each subtext, first, perform feature extraction for each word segment in the word segment sequence corresponding to the sentence according to the word vector feature extraction model to obtain the corresponding word vector , and then combine the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence. Finally, perform feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the corresponding sentence. Sentence feature vector.
  • various currently known or future word segmentation processing methods can be used to perform word segmentation processing on the sentence to obtain the word segmentation sequence corresponding to the sentence, which will not be repeated here.
  • the word vector feature extraction model is used to represent the correspondence between words and word vectors corresponding to the words, that is, the word vector feature extraction model is used to map words to word vectors.
  • the word vector feature extraction model can be a bag of words model (BAW, Bag Of Words).
  • the word vector feature extraction model can include at least one of the following: long-short-term memory (LSTM, Long Short-Term Memory) network, translation (Transformer) model (for example, BERT model, ALBERT model).
  • the word vector corresponding to each word segment can be a V-dimensional vector, where V is a positive integer
  • the word segment sequence corresponding to the sentence can include W word segments
  • the word vector corresponding to each word segment in the word segment sequence corresponding to the sentence can be combined.
  • Get a W*V matrix where each row corresponds to a word vector for a word segment.
  • the obtained matrix can be expanded to U rows, and U is greater than or equal to W.
  • W For each row greater than W, it can be supplemented by padding.
  • it can be greater than W
  • Each row of matrix elements is set to 0. In this way, the sentence feature matrix corresponding to each sentence is a U*V matrix.
  • the sentence vector feature extraction model is used to represent the correspondence between the sentence feature matrix and the sentence feature vector corresponding to the sentence, that is, the sentence vector feature extraction model is used to map the sentence feature matrix to the sentence feature vector.
  • the sentence vector feature extraction model may include at least one of the following: convolutional neural network (CNN, Convolutional Neural Networks), bidirectional long-term short-term memory network (BiLSTM, Bi-directional Long Short-Term Memory).
  • the sentence feature vector corresponding to the sentence can be extracted. Since the word vector is extracted for each participle in the sentence first, and then combined according to the position of the participle to obtain the sentence feature matrix, Then perform feature extraction on the sentence feature matrix to obtain the sentence feature vector.
  • the extracted sentence feature vector can not only represent the word information in the sentence, but also represent the context between words in the sentence, that is, semantic information, which is more conducive to the text category of the subsequent process. identify.
  • Step 203 for each subtext in the subtext sequence, perform a first calculation operation.
  • the execution subject may perform the first calculation operation for each subtext in the subtext sequence obtained in step 201 .
  • the first computing operation may include sub-step 2031 and sub-step 2032:
  • Sub-step 2031 for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the attention feature vector of the sentence relative to the subtext.
  • the sentence feature vector corresponding to each sentence is an M-dimensional vector
  • the number of sentences in each subtext is at most S (for example, 32) sentences
  • each sentence in the sentence sequence corresponding to this subtext can form a matrix F with a size of S ⁇ M, and the matrix F can be regarded as the sub-text feature matrix corresponding to the sub-text.
  • calculating the attention feature vector of each sentence in the subtext relative to the subtext can be expressed as follows:
  • the sentence feature vector F i corresponding to the i-th sentence is an M-dimensional vector, which can also be considered as 1 *M matrix F i .
  • the attention feature vector of the i-th sentence relative to the subtext is matrix B i , then the Cartesian product of matrix F i and matrix B i should be matrix F, which can be expressed as follows:
  • the matrix B i is a matrix of S*1, where the element B i,j,1 in the jth row and the first column of the matrix B i is used to represent the i-th in the sentence sequence corresponding to the subtext
  • the degree of relevance, importance or attention between the sentence and the jth sentence, j is a positive integer between 1 and S.
  • the attention feature vector matrix B i of the i-th sentence in the sentence sequence corresponding to the sub-text relative to the sub-text can be obtained by calculating the known matrix F and matrix F i , Since the matrix B i is a matrix of S*1, it can also be considered that the matrix B i is the attention feature vector of the ith sentence in the sentence sequence corresponding to the sub-text relative to the sub-text.
  • Sub-step 2032 based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext relative to the text to be recognized.
  • the attention feature vector of each sentence in the sentence sequence corresponding to the subtext relative to the subtext has been obtained, and the above assumption is continued, that is, it is assumed that the i-th sentence in the sentence sequence corresponding to the subtext
  • the attention feature vector B i relative to the subtext is a matrix of S*1, where the element B i,j,1 in the jth row and column 1 of the matrix B i is used to represent the sentence sequence corresponding to the subtext.
  • the attention feature vector B i of each sentence relative to the subtext is combined, which can be Obtain an attention expression matrix B whose size is S*S for the subtext, and the element B i,j in the attention expression matrix is used to represent the relationship between the i-th sentence and the j-th sentence in the sub-text Importance, relevance, or attention.
  • the attention representation matrix corresponding to each subtext is a matrix with a size of S*S.
  • the attention representation matrix corresponding to the pth subtext is C p
  • C p is a matrix of S*S.
  • the three-dimensional matrix C of P*S*S can be obtained by combining the attention feature matrix C p corresponding to each subtext.
  • the attention feature vector of the pth subtext relative to the text to be recognized is a matrix E p
  • the attention feature matrix C p of the pth subtext and the attention feature matrix E p of the pth subtext relative to the text to be recognized The Cartesian product should be a matrix C, which can be expressed as follows:
  • the matrix E p is a matrix of P*1, where the element E p,q,1 in the first column of the qth row in the matrix E p is used to represent the pth subtext and the first subtext in the text to be recognized Relevance, importance or attention among q subtexts.
  • the attention feature matrix Ep of the pth subtext in the subtext sequence corresponding to the text to be recognized relative to the text to be recognized can be obtained by calculating the known matrix C and matrix Cp , Since the matrix E p is a matrix of P*1, it can also be considered that the matrix E p is the attention feature vector of the p-th subtext in the subtext sequence corresponding to the text to be recognized relative to the text to be recognized.
  • Step 204 splicing the attention feature vectors of the subtexts in the subtext sequence relative to the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized.
  • the execution subject can splice the attention feature vectors of the subtext in the subtext sequence relative to the text to be recognized to obtain the corresponding The feature vector of the text to be recognized.
  • the attention feature vector E p of each subtext in the P subtexts in the splicing subtext sequence relative to the text to be recognized can obtain the text feature vector E to be recognized, and the dimension of the text feature vector E to be recognized is P*P.
  • Step 205 input the feature vector of the text to be recognized into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text.
  • the execution subject can input the feature vector of the text to be recognized whose dimension is P*P calculated in step 204 into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text.
  • the classification model is used to characterize the corresponding relationship between the text feature vector and the probability value that the text belongs to the preset category text.
  • the feature extraction model and the classification model may be pre-trained through the training step 300 as shown in Figure 3, and the training step 300 may include the following steps 301 to 3:
  • Step 301 determining an initial feature extraction model and an initial classification model.
  • the subject of the training step may be the same as or different from the subject of the text category recognition method. If they are the same, the execution subject of the training step can store the model structure information and the parameter values of the model parameters of the trained feature extraction model and classification model locally after the training obtains the feature extraction model and classification model. If different, the execution subject of the training step can send the model structure information and the parameter values of the model parameters of the trained feature extraction model and classification model to the execution subject of the text category recognition method after training the feature extraction model and the classification model.
  • the initial feature extraction model and the initial classification model may include various types of calculation models
  • the model structure information that needs to be determined is different for different types of calculation models.
  • each model parameter of the initial feature extraction model and the initial classification model can be initialized with some different small random numbers. "Small random number” is used to ensure that the model will not enter a saturated state due to excessive weight, which will cause training failure, and "different” is used to ensure that the model can learn normally.
  • the initial classification model may be a Softmax classifier.
  • Step 302 acquiring a training sample set.
  • the training samples in the training sample set include sample text and a sample label used to represent whether the sample text belongs to a preset category of text.
  • sample labels can be obtained by manual annotation.
  • Step 303 for the training samples in the training sample set, perform parameter adjustment operations until the preset training end condition is satisfied.
  • parameter adjustment operations may include:
  • Step 3031 splitting the sample text in the training sample to obtain a sample subtext sequence, and splitting each subtext in the sample subtext sequence to obtain a corresponding sentence sequence.
  • the same or similar method in step 201 may be followed.
  • Step 3032 for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain a sentence feature vector corresponding to the sentence.
  • Step 3033 for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text.
  • the second computing operation includes the following first to fourth steps:
  • the first step is to calculate the attention feature vector of the sentence relative to the sample subtext based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the sample subtext.
  • the second step is to calculate the attention feature vector of the sample subtext relative to the sample text based on the attention feature vector of each sentence relative to the sample subtext.
  • step 2031 and step 2032 are basically the same as the operations of step 2031 and step 2032 respectively, and will not be repeated here.
  • the attention feature vector of the sample subtext in the sample subtext sequence relative to the sample text is spliced to obtain the sample text feature vector corresponding to the sample text.
  • step 204 the specific operation of the third step is basically the same as the operation of step 204, and will not be repeated here.
  • the fourth step is to input the obtained sample text feature vector into the initial classification model to obtain the probability value that the sample text belongs to the preset category text.
  • Step 3034 based on the difference between the obtained probability value and the sample label in the training sample, adjust the model parameters of the initial feature extraction model and the initial classification model.
  • various implementation manners may be adopted to adjust model parameters of the initial feature extraction model and the initial classification model based on the difference between the obtained probability value and the sample label in the training sample.
  • stochastic gradient descent SGD, Stochastic Gradient Descent
  • Newton's Method Newton's Method
  • Quasi-Newton Methods Conjugate Gradient
  • Conjugate Gradient heuristic optimization methods and other known Or various optimization algorithms developed in the future.
  • step 304 the trained initial feature extraction model and initial classification model are determined as pre-trained feature extraction models and classification models.
  • the initial feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model.
  • step 3032 for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature vector corresponding to the sentence, which can be performed as follows:
  • the execution subject of the training step can also: For each word segment in the word segment sequence, in response to determining that the word segment matches a keyword in the preset text category keyword set, the word vector corresponding to the word segment is set as the preset word vector.
  • the preset word vector may be a word vector in which all vector components are 0. In this way, by specifying the word vectors corresponding to the words that match the keywords in the preset text category keyword set, the recognition ability of the feature extraction model and classification model for the text of the preset text category can be improved.
  • the preset text category keyword set can be dynamically learned from a large amount of corpus using machine learning or data mining algorithms, or it can be manually formulated by technicians according to the needs and experience of specific application scenarios.
  • the preset text category keywords The word set can also include both dynamically learned keywords and manually specified keywords.
  • the above execution subject may also perform the following step 206 after step 205:
  • Step 206 determine whether the probability value is greater than a preset probability threshold.
  • step 207 If determined to be greater than, go to step 207 for execution.
  • Step 207 generating first recognition result information for indicating that the text to be recognized is a preset text category.
  • the above-mentioned first recognition result information can be used to determine that the text to be recognized belongs to the text category.
  • the execution subject may also proceed to step 208 for execution if it is determined in step 206 that the value is not greater than.
  • Step 208 generating second recognition result information for indicating that the text to be recognized is not a preset text category.
  • the above execution subject may also perform the following step 209 at other time points after step 2031, for example, before step 204, or before step 205, or after step 205:
  • Step 209 for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, based on the attention feature vector of the sentence relative to the subtext, calculate the probability value that the sentence belongs to the preset text category, according to the calculated The probability value of determines the presentation manner corresponding to the sentence, and presents the sentence according to the determined presentation manner.
  • the attention feature vector of a certain sentence relative to the subtext to which it belongs is a matrix B i
  • the matrix B i is a matrix of S*1, wherein the jth row and the first column of the matrix B i
  • the element B i, j, 1 is used to represent the correlation, importance or attention between the i-th sentence and the j-th sentence in the sentence sequence corresponding to the subtext, and j is a positive integer between 1 and S .
  • the probability value of the sentence belonging to the preset text category is calculated, for example, it can be:
  • the sum of the elements of Bi is calculated, or the sum of the squares of the calculated elements is used as the probability value that the sentence belongs to the preset text category.
  • the corresponding relationship between different probability value ranges and corresponding presentation modes can be set in advance, and then when the calculated probability value belongs to the corresponding probability value range, the The presentation manner corresponding to the corresponding probability value range is determined as the presentation manner corresponding to the sentence. For example, when the probability value is greater than 0.8, the presentation method is to use red font. When the probability value is greater than 0.5 and less than 0.8, it will be presented in pink font.
  • the above execution subject may also perform the following step 210 at other time points after step 2032, for example, before step 204, or before step 205, or after step 205:
  • Step 210 for each subtext in the subtext sequence, based on the attention feature vector of the subtext relative to the text to be recognized, calculate the probability value that the subtext belongs to the preset text category, and determine the subtext according to the calculated probability value.
  • the attention feature vector of a certain subtext relative to the text to be recognized is a matrix Ep of P*1, wherein the elements Ep , q, and 1 is used to indicate the degree of relevance, importance or attention between the pth subtext and the qth subtext in the text to be recognized.
  • the probability value of the subtext belonging to the preset text category is calculated, for example, it may be:
  • the sum of each element of E p is calculated, or the calculated sum of the squares of each element is used as a probability value that the subtext belongs to a preset text category.
  • the presentation manner corresponding to the corresponding probability value range is determined as the presentation manner corresponding to the text. For example, when the probability value is greater than 0.9, it is presented in bold font. When the probability value is greater than 0.6 and less than 0.9, the presentation method is normal font.
  • FIG. 4 is a schematic diagram of an application scenario of the text category recognition method according to this embodiment.
  • the server 41 acquires the text 43 to be recognized from the terminal device 42 .
  • the server 41 splits the text to be recognized 43 to obtain a subtext sequence 44, and splits the subtexts 441, 442 and 443 in the subtext sequence 44 to obtain corresponding sentence sequences 451, 452 and 453.
  • the sentence sequence 451 includes Sentence 45101 to sentence 45120
  • sentence sequence 452 includes sentence 45201 to sentence 45222
  • sentence sequence 453 includes sentence 45201 to sentence 45325.
  • the server 41 For each sentence in 45101 to sentence 45120, sentence 45201 to sentence 45222, and sentence 45301 to sentence 45325, the server 41 performs feature extraction according to the pre-trained feature extraction model to obtain the sentence feature vector 46101 to sentence feature vector 46120 corresponding to the sentence. , sentence feature vector 46201 to sentence feature vector 46222, sentence feature vector 46201 to sentence feature vector 46225. Then, the server 41 respectively performs the first calculation operation on the subtexts 441, 442 and 443 in the subtext sequence 44, and respectively obtains attention feature vectors 471, 472 and 473 of 441, 442 and 443 relative to the text to be recognized 43.
  • the server 41 concatenates the attention feature vectors 471 , 472 and 473 to obtain the text feature vector 48 corresponding to the text to be recognized. Finally, input the feature vector 48 of the text to be recognized into the pre-trained classification model 49 to obtain the probability value 50 that the text to be recognized belongs to the preset category of text.
  • the text category recognition method realized by the above-mentioned embodiments of the present disclosure realizes the establishment of hierarchical attention among sentences, subtexts and texts to be recognized through the attention feature vector of the sentence relative to the subtext and the attention feature vector of the subtext relative to the text to be recognized relationship, and then generate the feature vector of the text to be recognized to calculate the probability value belonging to the preset text category, which realizes the automatic classification of the text to be recognized and reduces the labor cost of text classification.
  • the present disclosure provides an embodiment of a text category recognition device, which corresponds to the method embodiment shown in FIG. 2 , and the device specifically It can be applied to various electronic devices.
  • the text category recognition device 500 of this embodiment includes: a splitting unit 501 , a feature extraction unit 502 , a calculation unit 503 , a splicing unit 504 and a recognition unit 505 .
  • the splitting unit 501 is configured to split the text to be recognized to obtain a subtext sequence, and split each subtext in the subtext sequence to obtain a corresponding sentence sequence;
  • the feature extraction unit 502 is configured to Each sentence in the sentence sequence corresponding to each sub-text is subjected to feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
  • the calculation unit 503 is configured to for each of the sub-text sequences subtext, perform the following first calculation operation: for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the attention feature of the sentence relative to the subtext Vector; Based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext
  • step 201, step 202, step 203, step 204 and step 205 in the embodiment will not be repeated here.
  • the feature extraction model and the classification model can be obtained through pre-training as follows:
  • training sample set wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
  • the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text
  • the feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model
  • the feature extraction unit 502 may be further configured to:
  • the word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
  • the word vector feature extraction model may include at least one of the following: a long short-term memory network, and a translation model.
  • the sentence vector feature extraction model may include at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
  • each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence perform feature extraction according to the initial feature extraction model to obtain the sentence features corresponding to the sentence vector, which can include:
  • each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector
  • feature extraction according to the word vector feature extraction model
  • the training step may also include:
  • the word vector corresponding to the word segment is set as the preset word vector.
  • the device 500 may also include:
  • a determining unit 506, configured to determine whether the probability value is greater than a preset probability threshold
  • the first generating unit 507 is configured to generate first recognition result information indicating that the text to be recognized is a preset text category in response to determining that the value is greater than or equal to greater than 100%.
  • the device 500 may also include:
  • the second generation unit 508 is configured to, in response to determining that it is not greater than, generate second recognition result information for indicating that the text to be recognized is not a preset text category.
  • the device 500 may also include:
  • the first presentation unit 509 is configured to, for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, calculate that the sentence belongs to the predetermined sentence based on the attention feature vector of the sentence relative to the subtext. Assuming the probability value of the text category, determine the presentation mode corresponding to the sentence according to the calculated probability value, and present the sentence according to the determined presentation mode.
  • the device 500 may also include:
  • the second presentation unit 510 is configured to, for each subtext in the subtext sequence, calculate the probability that the subtext belongs to the preset text category based on the attention feature vector of the subtext relative to the text to be recognized. A probability value, determining a presentation manner corresponding to the subtext according to the calculated probability value, and presenting the subtext according to the determined presentation manner.
  • FIG. 6 it shows a schematic structural diagram of a computer system 600 suitable for implementing the electronic device of the present disclosure.
  • the computer system 600 shown in FIG. 6 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • a computer system 600 may include a processing device (eg, a central processing unit, a graphics processing unit, etc.) 601 that may be accessed randomly according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608.
  • ROM read-only memory
  • RAM memory
  • various programs and data necessary for the operation of the computer system 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, etc.; output devices 607, including, for example, a liquid crystal display (LCD), speaker, vibrator, etc. ; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609 .
  • the communication means 609 may allow the computer system 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 6 shows a computer system 600 of electronic devices having various means, it should be understood that implementing or possessing all of the illustrated means is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device realizes the text shown in the embodiment shown in FIG. 2 and its optional implementation manners. Class recognition method.
  • Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (such as through an Internet Service Provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the unit does not constitute a limitation on the unit itself under certain circumstances, for example, the acquisition unit may also be described as "a unit that acquires the text to be recognized".

Abstract

Provided in the present disclosure are a text category recognition method and apparatus, and an electronic device and a storage medium. The method comprises: splitting text to be recognized, so as to obtain a sub-text sequence, and splitting each piece of sub-text in the sub-text sequence, so as to obtain a corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each piece of sub-text, performing feature extraction according to a pre-trained feature extraction model, so as to obtain a sentence feature vector corresponding to the sentence; for each piece of sub-text in the sub-text sequence, executing a first calculation operation to calculate an attention feature vector of the sub-text with respect to the text to be recognized; splicing the attention feature vectors of all the pieces of sub-text in the sub-text sequence with respect to the text to be recognized, so as to obtain a feature vector of the text to be recognized that corresponds to the text to be recognized; and inputting the feature vector of the text to be recognized into a pre-trained classification model, so as to obtain a probability value of the text to be recognized being a preset category of text. Therefore, text to be recognized is automatically classified, thereby reducing the labor costs of text classification.

Description

文本类别识别方法、装置、电子设备和存储介质Text category recognition method, device, electronic device and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202110849917.9、申请日为2021年07月27日,名称为“文本类别识别方法、装置、电子设备和存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on the Chinese patent application with the application number 202110849917.9, the filing date is July 27, 2021, and the title is "text category recognition method, device, electronic equipment and storage medium", and claims the priority of the Chinese patent application, The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本公开的实施例涉及信息处理技术领域,具体涉及文本类别识别方法、装置、电子设备和存储介质。Embodiments of the present disclosure relate to the technical field of information processing, and specifically relate to a text category recognition method, device, electronic device, and storage medium.
背景技术Background technique
文本类别识别是指对于一段文本指出该文本是否属于预设类别,或者给出该文本属于预设类别的概率值。例如,电商平台需要对电商上传的产品介绍文本进行识别,以确定产品介绍文本是否符合要求,是否存在不适合的表述。又例如,文学作品平台需要对用户上传的文学小说文本内容进行识别,以确定小说文本内容是否包括低俗不雅内容等。Text category recognition refers to pointing out whether the text belongs to a preset category for a piece of text, or giving a probability value that the text belongs to a preset category. For example, the e-commerce platform needs to identify the product introduction text uploaded by the e-commerce company to determine whether the product introduction text meets the requirements and whether there are inappropriate expressions. For another example, the literary works platform needs to identify the text content of literary novels uploaded by users to determine whether the novel text content includes vulgar and indecent content.
发明内容Contents of the invention
本公开的实施例提出了文本类别识别方法、装置、电子设备和存储介质。Embodiments of the present disclosure provide a text category recognition method, device, electronic device and storage medium.
第一方面,本公开的实施例提供了一种文本类别识别方法,该方法包括:In a first aspect, an embodiment of the present disclosure provides a text category recognition method, the method comprising:
将待识别文本进行拆分得到子文本序列,将所述子文本序列中每个子文本进行拆分得到对应的句子序列;Splitting the text to be recognized to obtain a subtext sequence, and splitting each subtext in the subtext sequence to obtain a corresponding sentence sequence;
对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量;For each sentence in the sentence sequence corresponding to each said subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
对于所述子文本序列中的每个子文本,执行以下第一计算操作:对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量;基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于所述待识别文本的注意力特征向量;For each subtext in the subtext sequence, the following first calculation operation is performed: for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the sentence With respect to the attention feature vector of the subtext; based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext with respect to the text to be identified;
拼接所述子文本序列中子文本相对于所述待识别文本的注意力特征向量,得到所述待识别文本对应的待识别文本特征向量;Splicing the subtext in the subtext sequence relative to the attention feature vector of the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized;
将所述待识别文本特征向量输入预先训练的分类模型,得到所述待识别文本属于预设类别文本的概率值。Inputting the feature vector of the text to be recognized into a pre-trained classification model to obtain a probability value that the text to be recognized belongs to a preset category of text.
在一些可选的实施方式中,所述特征提取模型和所述分类模型通过如下训练步骤预先训练得到:In some optional implementation manners, the feature extraction model and the classification model are pre-trained through the following training steps:
确定初始特征提取模型和初始分类模型;Determine the initial feature extraction model and initial classification model;
获取训练样本集合,其中,所述训练样本包括样本文本和用于表征该样本文本是否属于预设类别文本的样本标签;Obtain a training sample set, wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
对于所述训练样本集合中的训练样本,执行以下参数调整操作直到满足预设训练结束条件:对该训练样本中的样本文本进行拆分得到样本子文本序列,将所述样本子文本序列中每个子文本进行拆分得到对应的句子序列;对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量;对于所述样本子文本序列中的每个样本子文本,执行第二计算操作以得到该样本子文本相对于所述样本文本的注意力特征向量:基于该样本子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该样本子文本的注意力特征向量;基于每个句子相对于该样本子文本的注意力特征向量,计算该样本子文本相对于所述样本文本的注意力特征向量;拼接所述样本子文本序列中样本子文本相对于该样本文本的注意力特征向量,得到该样本文本对应的样本文本特征向量;将所得到的样本文本特征向量输入所述初始分类模型,得到该样本文本属于所述预设类别文本的概率值;基于所得到的概率值与该训练样本中的样本标签之间的差异,调整所述初始特征提取模型和所述初始分类模型的模型参数;For the training samples in the training sample set, the following parameter adjustment operations are performed until the preset training end condition is met: the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text; Input the obtained sample text feature vector into the The initial classification model to obtain the probability value that the sample text belongs to the preset category text; based on the difference between the obtained probability value and the sample label in the training sample, adjust the initial feature extraction model and the initial Model parameters for classification models;
将训练得到的所述初始特征提取模型和所述初始分类模型确定为预先训练的所述特征提取模型和所述分类模型。Determining the trained initial feature extraction model and the initial classification model as the pre-trained feature extraction model and the classification model.
在一些可选的实施方式中,所述特征提取模型包括词向量特征提取模型和句向量特征提取模型;以及In some optional implementation manners, the feature extraction model includes a word vector feature extraction model and a sentence vector feature extraction model; and
所述对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量,包括:For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence, including:
对于每个所述子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子 对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to the word vector feature extraction model for each participle in the word segment sequence corresponding to the sentence to obtain a corresponding word vector, and combine the corresponding word vectors of the sentence The word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
在一些可选的实施方式中,所述词向量特征提取模型包括以下至少一项:长短期记忆网络、翻译模型。In some optional implementation manners, the word vector feature extraction model includes at least one of the following: a long short-term memory network, and a translation model.
在一些可选的实施方式中,所述句向量特征提取模型包括以下至少一项:卷积神经网络、双向长短期记忆网络。In some optional implementation manners, the sentence vector feature extraction model includes at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
在一些可选的实施方式中,所述对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量,包括:In some optional implementation manners, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence features corresponding to the sentence vector, including:
对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector Combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, performing feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the sentence corresponding to the sentence Feature vector.
在一些可选的实施方式中,在所述组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵之前,所述训练步骤还包括:In some optional implementation manners, before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the training step also includes:
对该句子对应的分词序列中每个分词,响应于确定该分词与预设文本类别关键词集合中的关键词匹配,将该分词对应的词向量设置为预设词向量。For each word segment in the word segment sequence corresponding to the sentence, in response to determining that the word segment matches a keyword in the preset text category keyword set, the word vector corresponding to the word segment is set as the preset word vector.
在一些可选的实施方式中,所述方法还包括:In some optional embodiments, the method also includes:
确定所述概率值是否大于预设概率阈值;determining whether the probability value is greater than a preset probability threshold;
响应于确定大于,生成用于指示所述待识别文本为预设文本类别的第一识别结果信息。In response to determining that the value is greater than, first recognition result information for indicating that the text to be recognized is a preset text category is generated.
在一些可选的实施方式中,所述方法还包括:In some optional embodiments, the method also includes:
响应于确定不大于,生成用于指示所述待识别文本不是预设文本类别的第二识别结果信息。In response to determining that it is not greater than, second recognition result information for indicating that the text to be recognized is not a preset text category is generated.
在一些可选的实施方式中,所述方法还包括:In some optional embodiments, the method also includes:
对于所述子文本序列中的每个子文本对应的句子序列中每个句子,基于该句子相对于该子文本的注意力特征向量,计算该句子属于所述预设文本类别的概率值,根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句 子。For each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, based on the attention feature vector of the sentence relative to the subtext, calculate the probability value that the sentence belongs to the preset text category, according to the calculation The obtained probability value determines the presentation manner corresponding to the sentence, and presents the sentence according to the determined presentation manner.
在一些可选的实施方式中,所述方法还包括:In some optional embodiments, the method also includes:
对于所述子文本序列中的每个子文本,基于该子文本相对于所述待识别文本的注意力特征向量,计算该子文本属于所述预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本。For each subtext in the subtext sequence, based on the attention feature vector of the subtext relative to the text to be recognized, calculate the probability value that the subtext belongs to the preset text category, according to the calculated probability value A presentation manner corresponding to the subtext is determined, and the subtext is presented according to the determined presentation manner.
第二方面,本公开的实施例提供了一种文本类别识别装置,该装置包括:In a second aspect, an embodiment of the present disclosure provides a text category recognition device, which includes:
拆分单元,被配置成将待识别文本进行拆分得到子文本序列,将所述子文本序列中每个子文本进行拆分得到对应的句子序列;The splitting unit is configured to split the text to be recognized to obtain a subtext sequence, and split each subtext in the subtext sequence to obtain a corresponding sentence sequence;
特征提取单元,被配置成对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量;The feature extraction unit is configured to perform feature extraction for each sentence in the sentence sequence corresponding to each subtext according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
计算单元,被配置成对于所述子文本序列中的每个子文本,执行以下第一计算操作:对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量;基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于所述待识别文本的注意力特征向量;The calculation unit is configured to perform the following first calculation operation for each subtext in the subtext sequence: for each sentence in the subtext, based on the sentence corresponding to each sentence in the sentence sequence corresponding to the subtext A feature vector, calculating the attention feature vector of the sentence relative to the subtext; calculating the attention feature vector of the subtext relative to the text to be identified based on the attention feature vector of each sentence relative to the subtext;
拼接单元,被配置成拼接所述子文本序列中子文本相对于所述待识别文本的注意力特征向量,得到所述待识别文本对应的待识别文本特征向量;The splicing unit is configured to splice the attention feature vectors of the subtexts in the subtext sequence relative to the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized;
识别单元,被配置成将所述待识别文本特征向量输入预先训练的分类模型,得到所述待识别文本属于预设类别文本的概率值。The recognition unit is configured to input the feature vector of the text to be recognized into a pre-trained classification model to obtain a probability value that the text to be recognized belongs to a preset category of text.
在一些可选的实施方式中,所述特征提取模型和所述分类模型通过如下方式预先训练得到:In some optional implementation manners, the feature extraction model and the classification model are pre-trained as follows:
确定初始特征提取模型和初始分类模型;Determine the initial feature extraction model and initial classification model;
获取训练样本集合,其中,所述训练样本包括样本文本和用于表征该样本文本是否属于预设类别文本的样本标签;Obtain a training sample set, wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
对于所述训练样本集合中的训练样本,执行以下参数调整操作直到满足预设训练结束条件:对该训练样本中的样本文本进行拆分得到样本子文本序列,将所述样本子文本序列中每个子文本进行拆分得到对应的句子序列;对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量;对于所述样本子文本序列中的每个样本子文本,执行第二计算操作以得到该样本子文本相对于所述样本文本的注意力特征向量:基于该样本子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该 样本子文本的注意力特征向量;基于每个句子相对于该样本子文本的注意力特征向量,计算该样本子文本相对于所述样本文本的注意力特征向量;拼接所述样本子文本序列中样本子文本相对于该样本文本的注意力特征向量,得到该样本文本对应的样本文本特征向量;将所得到的样本文本特征向量输入所述初始分类模型,得到该样本文本属于所述预设类别文本的概率值;基于所得到的概率值与该训练样本中的样本标签之间的差异,调整所述初始特征提取模型和所述初始分类模型的模型参数;For the training samples in the training sample set, the following parameter adjustment operations are performed until the preset training end condition is met: the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text; Input the obtained sample text feature vector into the The initial classification model to obtain the probability value that the sample text belongs to the preset category text; based on the difference between the obtained probability value and the sample label in the training sample, adjust the initial feature extraction model and the initial Model parameters for classification models;
将训练得到的所述初始特征提取模型和所述初始分类模型确定为预先训练的所述特征提取模型和所述分类模型。Determining the trained initial feature extraction model and the initial classification model as the pre-trained feature extraction model and the classification model.
在一些可选的实施方式中,所述特征提取模型包括词向量特征提取模型和句向量特征提取模型;以及In some optional implementation manners, the feature extraction model includes a word vector feature extraction model and a sentence vector feature extraction model; and
所述特征提取单元进一步被配置成:The feature extraction unit is further configured to:
对于每个所述子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to the word vector feature extraction model for each participle in the word segment sequence corresponding to the sentence to obtain a corresponding word vector, and combine the corresponding word vectors of the sentence The word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
在一些可选的实施方式中,所述词向量特征提取模型包括以下至少一项:长短期记忆网络、翻译模型。In some optional implementation manners, the word vector feature extraction model includes at least one of the following: a long short-term memory network, and a translation model.
在一些可选的实施方式中,所述句向量特征提取模型包括以下至少一项:卷积神经网络、双向长短期记忆网络。In some optional implementation manners, the sentence vector feature extraction model includes at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
在一些可选的实施方式中,所述对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量,包括:In some optional implementation manners, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence features corresponding to the sentence vector, including:
对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector Combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, performing feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the sentence corresponding to the sentence Feature vector.
在一些可选的实施方式中,在所述组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵之前,所述训练步骤还包括:In some optional implementation manners, before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the training step also includes:
对该句子对应的分词序列中每个分词,响应于确定该分词与预设文本类别关键词集合中的关键词匹配,将该分词对应的词向量设置为预设词向量。For each word segment in the word segment sequence corresponding to the sentence, in response to determining that the word segment matches a keyword in the preset text category keyword set, the word vector corresponding to the word segment is set as the preset word vector.
在一些可选的实施方式中,所述装置还包括:In some optional embodiments, the device also includes:
确定单元,被配置成确定所述概率值是否大于预设概率阈值;a determining unit configured to determine whether the probability value is greater than a preset probability threshold;
第一生成单元,被配置成响应于确定大于,生成用于指示所述待识别文本为预设文本类别的第一识别结果信息。The first generating unit is configured to generate first recognition result information indicating that the text to be recognized is a preset text category in response to determining that the value is greater than or equal to greater than 100%.
在一些可选的实施方式中,所述装置还包括:In some optional embodiments, the device also includes:
第二生成单元,被配置成响应于确定不大于,生成用于指示所述待识别文本不是预设文本类别的第二识别结果信息。The second generating unit is configured to, in response to determining that it is not greater than, generate second recognition result information for indicating that the text to be recognized is not a preset text category.
在一些可选的实施方式中,所述装置还包括:In some optional embodiments, the device also includes:
第一呈现单元,被配置成对于所述子文本序列中的每个子文本对应的句子序列中每个句子,基于该句子相对于该子文本的注意力特征向量,计算该句子属于所述预设文本类别的概率值,根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句子。The first presentation unit is configured to, for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, calculate that the sentence belongs to the preset based on the attention feature vector of the sentence relative to the subtext According to the probability value of the text category, the presentation manner corresponding to the sentence is determined according to the calculated probability value, and the sentence is presented according to the determined presentation manner.
在一些可选的实施方式中,所述装置还包括:In some optional embodiments, the device also includes:
第二呈现单元,被配置成对于所述子文本序列中的每个子文本,基于该子文本相对于所述待识别文本的注意力特征向量,计算该子文本属于所述预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本。The second presentation unit is configured to, for each subtext in the subtext sequence, calculate the probability that the subtext belongs to the preset text category based on the attention feature vector of the subtext relative to the text to be recognized value, determine the presentation manner corresponding to the subtext according to the calculated probability value, and present the subtext according to the determined presentation manner.
第三方面,本公开的实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当上述一个或多个程序被上述一个或多个处理器执行时,使得上述一个或多个处理器实现如第一方面中任一实现方式描述的方法。In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored. When executed by one or more processors, the above one or more processors implement the method described in any implementation manner of the first aspect.
第四方面,本公开的实施例提供了一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被一个或多个处理器执行时实现如第一方面中任一实现方式描述的方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by one or more processors, any implementation manner in the first aspect can be realized. described method.
目前,在对长文本(例如,长度超过5000字的文本)进行类别识别(比如,指出某段长文本内容是否涉及特定类别)时,大多采用以下方式:1、人工标记;2、通过关键词筛选;3、将长文本拆分为短句或段落,再由人工对短句或者段落进行标记;4、利用机器学习模型直接对长文本建模,但方法局限于词袋模型等简单模型,若需要使用深度语义模型,则需要对长文本进行截断操作。其中,1、人工标记的方法人 力成本较高;2、通过关键词筛选的方式可能会造成误伤和漏放,效率较低;3、长文本拆为短文本后,文本量级将成几十倍上百倍的增加,同样会造成大量的人工损耗;4、而使用词袋模型直接对长文本建模仅仅基于长文本中的词语出现频率的统计信息进行建模,无法给出长文本中具体哪些内容更涉及特定类别的概率值,无法满足更加丰富的业务需求;而若使用深度语义模型,则需要进行截断,此时能覆盖的文本范围较小,同样可能会造成漏放。At present, when performing category recognition on long texts (for example, texts with a length of more than 5,000 words) (for example, indicating whether a certain long text content involves a specific category), the following methods are mostly used: 1. Manual labeling; 2. Through keywords Screening; 3. Split the long text into short sentences or paragraphs, and then manually mark the short sentences or paragraphs; 4. Use machine learning models to directly model long texts, but the method is limited to simple models such as the bag of words model, If you need to use a deep semantic model, you need to truncate long text. Among them, 1. The manual marking method has high labor costs; 2. The method of keyword screening may cause accidental injury and leakage, and the efficiency is low; 3. After the long text is split into short text, the text magnitude will be dozens of times An increase of hundreds of times will also cause a lot of labor loss; 4. Using the bag-of-words model to directly model long texts is only based on the statistical information of the frequency of words in long texts, and it is impossible to give specific information in long texts. The content is more related to the probability value of a specific category, which cannot meet richer business needs; and if the deep semantic model is used, it needs to be truncated. At this time, the range of text that can be covered is small, which may also cause omissions.
为了提高对长文本进行分类的准确性,减少人工成本以及减少漏放等,本公开的实施例提供的文本类别识别方法、装置、电子设备和存储介质,通过将待识别文本拆分子文本,将子文本拆分成句子,再通过对句子生成句子特征向量,再生成每个句子对所属子文本的注意力特征向量以及每个子文本对待识别文本的注意力特征向量,并拼接每个子文本相对于待识别文本的注意力特征向量,得到待识别文本特征向量。最后,将待识别文本特征向量输入预先训练的分类模型,得到待识别文本属于预设类别文本的概率值。即,通过句子相对子文本的注意力特征向量和子文本相对待识别文本的注意力特征向量实现建立句子、子文本和待识别文本之间的层次注意力关系,进而生成待识别文本特征向量计算属于预设文本类别的概率值,实现了自动对待识别文本进行分类,降低了对文本分类的人工成本;并且,可选地,还可以通过利用句子相对于该子文本的注意力特征向量,计算该句子属于预设文本类别的概率值,并根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句子,继而实现对具有不同的属于预设文本类别概率值的句子按照相应方式进行呈现,以供人工标记时参考,减少漏放可能。或者,可选地,还可以通过利用子文本相对于待识别文本的注意力特征向量,计算该子文本属于预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本,继而实现对具有不同的属于预设文本类别概率值的子文本按照相应方式进行呈现,以供人工标记时参考,减少漏放可能。In order to improve the accuracy of classifying long texts, reduce labor costs, and reduce leakage, etc., the text category recognition method, device, electronic equipment, and storage medium provided by the embodiments of the present disclosure split the text to be recognized into subtexts. The subtext is split into sentences, and then the sentence feature vector is generated for the sentence, and then the attention feature vector of each sentence to the subtext and the attention feature vector of each subtext to be recognized are generated, and each subtext is spliced relative to The attention feature vector of the text to be recognized is obtained to obtain the feature vector of the text to be recognized. Finally, input the feature vector of the text to be recognized into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text. That is, through the attention feature vector of the sentence relative to the subtext and the attention feature vector of the subtext relative to the text to be recognized, the hierarchical attention relationship between the sentence, the subtext and the text to be recognized is established, and then the feature vector of the text to be recognized is generated to calculate the The probability value of the preset text category realizes the automatic classification of the text to be recognized, which reduces the labor cost of text classification; and, optionally, the sentence can also be calculated by using the attention feature vector of the sentence relative to the subtext. The sentence belongs to the probability value of the preset text category, and according to the calculated probability value, the corresponding presentation mode of the sentence is determined, and the sentence is presented according to the determined presentation mode, and then the text with different probability values belonging to the preset text category is realized. Sentences are presented in a corresponding manner for reference during manual labeling to reduce the possibility of missing. Or, optionally, it is also possible to calculate the probability value that the subtext belongs to the preset text category by using the attention feature vector of the subtext relative to the text to be recognized, and determine the corresponding presentation mode of the subtext according to the calculated probability value , and present the subtext according to the determined presentation manner, and then realize the presentation of subtexts with different probability values belonging to the preset text category in a corresponding manner, so as to provide reference for manual marking and reduce the possibility of missing.
附图说明Description of drawings
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显。附图仅用于示出具体实施方式的目的,而并不认为是对本发明的限制。在附图中:Other features, objects and advantages of the present disclosure will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for the purpose of illustrating specific embodiments and are not to be considered as limiting the invention. In the attached picture:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
图2是根据本公开的文本类别识别方法的一个实施例的流程图;Fig. 2 is a flowchart of an embodiment of the text category recognition method according to the present disclosure;
图3是根据本公开的文本类别识别方法的一个应用场景的示意图;FIG. 3 is a schematic diagram of an application scenario of a text category recognition method according to the present disclosure;
图4是根据本公开的文本类别识别方法的又一个实施例的流程图;FIG. 4 is a flowchart of another embodiment of a text category recognition method according to the present disclosure;
图5是根据本公开的文本类别识别装置的一个实施例的结构示意图;FIG. 5 is a schematic structural diagram of an embodiment of a text category recognition device according to the present disclosure;
图6是适于用来实现本公开的实施例的电子设备的计算机系统的结构示意图。FIG. 6 is a structural schematic diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings and embodiments.
图1示出了可以应用本公开的文本类别识别方法、装置、电子设备和存储介质的实施例的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which embodiments of the text category recognition method, device, electronic device and storage medium of the present disclosure can be applied.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如文本类别识别类应用、语音识别类应用、短视频社交类应用、音视频会议类应用、视频直播类应用、文档编辑类应用、输入法类应用、网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as text category recognition applications, speech recognition applications, short video social applications, audio and video conference applications, live video applications, and document editing applications. Applications, input method applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的终端设备中。其可以实现成多个软件或软件模块(例如用来提供文本类别识别服务),也可以实现成单个软件或软件模块。在此不做具体限定。The terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) players, laptops and desktop computers, etc. When the terminal devices 101, 102, and 103 are software, they can be installed in the terminal devices listed above. It can be implemented as multiple software or software modules (for example, for providing text category recognition services), or as a single software or software module. No specific limitation is made here.
在一些情况下,本公开所提供的文本类别识别方法可以由终端设备101、102、103执行,相应地,文本类别识别装置可以设置于终端设备101、102、103中。这时,系统架构100也可以不包括服务器105。In some cases, the text category recognition method provided in the present disclosure may be executed by the terminal devices 101 , 102 , 103 , and correspondingly, the text category recognition apparatus may be set in the terminal devices 101 , 102 , 103 . At this time, the system architecture 100 may not include the server 105 .
在一些情况下,本公开所提供的文本类别识别方法可以由终端设备101、102、103和服务器105共同执行,例如,“获取待识别文本”的步骤可以由终端设备101、102、103执行,“对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量”等步骤可以由服务器105执行。本公开对此不做限定。相应地,文本类别识别装置也可以分别设置于终端设备101、102、103和服务器105中。In some cases, the text category recognition method provided by the present disclosure can be jointly executed by the terminal device 101, 102, 103 and the server 105, for example, the step of "obtaining the text to be recognized" can be executed by the terminal device 101, 102, 103, Steps such as “for each sentence in the sentence sequence corresponding to each subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence” can be performed by the server 105 . The present disclosure does not limit this. Correspondingly, the means for identifying text categories can also be respectively set in the terminal devices 101, 102, 103 and the server 105.
在一些情况下,本公开所提供的文本类别识别方法可以由服务器105执行,相应地,文本类别识别装置也可以设置于服务器105中,这时,系统架构100也可以不包括终端设备101、102、103。In some cases, the text category recognition method provided by the present disclosure can be executed by the server 105, and correspondingly, the text category recognition device can also be set in the server 105. At this time, the system architecture 100 may not include the terminal devices 101, 102 , 103.
需要说明的是,服务器105可以是硬件,也可以是软件。当服务器105为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器105为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server 105 may be hardware or software. When the server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When the server 105 is software, it can be implemented as multiple software or software modules (for example, for providing distributed services), or as a single software or software module. No specific limitation is made here.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
继续参考图2,其示出了根据本公开的文本类别识别方法的一个实施例的流程200,该文本类别识别方法包括以下步骤:Continue to refer to FIG. 2 , which shows a process 200 of an embodiment of a text category recognition method according to the present disclosure. The text category recognition method includes the following steps:
步骤201,将待识别文本进行拆分得到子文本序列,将子文本序列中每个子文本进行拆分得到对应的句子序列。 Step 201, splitting the text to be recognized to obtain a subtext sequence, and splitting each subtext in the subtext sequence to obtain a corresponding sentence sequence.
在本实施例中,文本类别识别方法的执行主体(例如图1所示的服务器105)可以首先本地或者远程地从与上述执行主体网络连接的其他电子设备(例如,图1所示的终端设备101、102、103)获取待识别文本。In this embodiment, the execution body of the text category recognition method (such as the server 105 shown in FIG. 1 ) can first locally or remotely connect to the above-mentioned execution body from other electronic devices (such as the terminal device shown in FIG. 1 ) 101, 102, 103) Obtain the text to be recognized.
这里,待识别文本可以由同一种语言的字符组成的,也可以由多于一种语言的字符所组成的,本公开对此不做具体限定。Here, the text to be recognized may be composed of characters of the same language, or may be composed of characters of more than one language, which is not specifically limited in the present disclosure.
待识别文本可以是各种情况下的文本,本公开对此不做具体限定。The text to be recognized may be text in various situations, which is not specifically limited in the present disclosure.
在一些可选的实施方式中,待识别文本可以是以下任一项:新闻正文文本的一部 分、小说文本的部分章节等。In some optional implementation manners, the text to be recognized may be any of the following: a part of the news text, some chapters of the novel text, and the like.
待识别文本的可以是长度相对较长的长文本,例如待识别文本可以包括至少400个句子。The text to be recognized may be relatively long text, for example, the text to be recognized may include at least 400 sentences.
然后,上述执行主体可以采用各种实现方式将待识别文本进行拆分得到子文本序列。Then, the above execution subject can use various implementation methods to split the text to be recognized to obtain subtext sequences.
在一些可选的实施方式中,上述执行主体可以将待识别文本拆分成第一预设数目(例如,20)个子文本,其中,每个子文本中的句子数量可以是预设数量范围(例如,大于等于20小于等于25)内的随机数。在拆分时,相邻两子文本之间可以有重叠,这样可以在后续过程中保持子文本之间的连续语义信息。In some optional implementation manners, the above-mentioned executive body may split the text to be recognized into a first preset number (for example, 20) of subtexts, wherein, the number of sentences in each subtext may be within a preset number range (for example, , greater than or equal to 20 and less than or equal to 25). When splitting, there can be overlap between two adjacent subtexts, so that the continuous semantic information between subtexts can be maintained in the subsequent process.
将拆分得到的各子文本按照该子文本在待识别文本中的位置进行排列,即可得到子文本序列。Arranging the split subtexts according to the position of the subtexts in the text to be recognized, the subtext sequence can be obtained.
最后,再将子文本序列中每个子文本进行拆分得到对应的句子序列。实践中,例如可以按照子文本内的标点符号进行拆分得到句子序列。Finally, each subtext in the subtext sequence is split to obtain a corresponding sentence sequence. In practice, for example, the sentence sequence can be obtained by splitting according to the punctuation marks in the subtext.
步骤202,对于每个子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量。 Step 202, for each sentence in the sentence sequence corresponding to each subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence.
在本实施例中,上述执行主体可以对于步骤201中得到的子文本序列中的每个子文本,将该子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量。其中,特征提取模型用于表征语句和语句对应的特征向量之间的对应关系。In this embodiment, for each subtext in the subtext sequence obtained in step 201, the execution subject can perform feature extraction on each sentence in the sentence sequence corresponding to the subtext according to a pre-trained feature extraction model to obtain the The sentence feature vector corresponding to the sentence. Among them, the feature extraction model is used to represent the correspondence between the sentence and the feature vector corresponding to the sentence.
在一些可选的实施方式中,特征提取模型可以包括词向量特征提取模型和句向量特征提取模型。基于此,步骤202可以如下执行:对于每个子文本对应的句子序列中的每个句子,首先,对该句子对应的分词序列中每个分词按照词向量特征提取模型进行特征提取得到对应的词向量,再组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,最后,对该句子对应的句子特征矩阵按照句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。In some optional implementation manners, the feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model. Based on this, step 202 can be performed as follows: for each sentence in the sentence sequence corresponding to each subtext, first, perform feature extraction for each word segment in the word segment sequence corresponding to the sentence according to the word vector feature extraction model to obtain the corresponding word vector , and then combine the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence. Finally, perform feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the corresponding sentence. Sentence feature vector.
其中,可以采用各种现在已知或者未来开发的分词处理方法对句子进行分词处理,以得到句子对应的分词序列,在此不再赘述。Among them, various currently known or future word segmentation processing methods can be used to perform word segmentation processing on the sentence to obtain the word segmentation sequence corresponding to the sentence, which will not be repeated here.
其中,词向量特征提取模型用于表征词语和词语对应的词向量之间的对应关系,即词向量特征提取模型用于将词语映射到词向量。作为示例,词向量特征提取模型可以为词袋模型(BAW,Bag Of Words)。可选地,词向量特征提取模型可以包括以下 至少一项:长短期记忆(LSTM,Long Short-Term Memory)网络、翻译(Transformer)模型(例如,BERT模型,ALBERT模型)。Among them, the word vector feature extraction model is used to represent the correspondence between words and word vectors corresponding to the words, that is, the word vector feature extraction model is used to map words to word vectors. As an example, the word vector feature extraction model can be a bag of words model (BAW, Bag Of Words). Optionally, the word vector feature extraction model can include at least one of the following: long-short-term memory (LSTM, Long Short-Term Memory) network, translation (Transformer) model (for example, BERT model, ALBERT model).
而,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,可以是按照各分词在分词序列中的所在位置进行顺序组合。例如,每个分词对应的词向量可以为V维向量,其中V为正整数,而句子对应的分词序列可以包括W个分词,而组合该句子对应的分词序列中各分词对应的词向量,可以得到W*V的矩阵,其中,每一行对应一个分词的词向量。但为了保证各个句子对应的句子特征矩阵为相同大小的矩阵,可以将所得到的矩阵扩充到U行,U大于等于W,对于大于W的各行,可以用padding方式进行补充,例如可以将大于W的各行矩阵元素均置为0。这样,即对于各个句子对应的句子特征矩阵均为U*V的矩阵。However, combining the word vectors corresponding to each word segment in the word segment sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence may be combined in sequence according to the position of each word segment in the word segment sequence. For example, the word vector corresponding to each word segment can be a V-dimensional vector, where V is a positive integer, and the word segment sequence corresponding to the sentence can include W word segments, and the word vector corresponding to each word segment in the word segment sequence corresponding to the sentence can be combined. Get a W*V matrix, where each row corresponds to a word vector for a word segment. However, in order to ensure that the sentence feature matrix corresponding to each sentence is a matrix of the same size, the obtained matrix can be expanded to U rows, and U is greater than or equal to W. For each row greater than W, it can be supplemented by padding. For example, it can be greater than W Each row of matrix elements is set to 0. In this way, the sentence feature matrix corresponding to each sentence is a U*V matrix.
其中,句向量特征提取模型用于表征句子特征矩阵和句子对应的句子特征向量之间的对应关系,即句向量特征提取模型用于将句子特征矩阵映射到句子特征向量。可选地,句向量特征提取模型可以包括以下至少一项:卷积神经网络(CNN,Convolutional Neural Networks)、双向长短期记忆网络(BiLSTM,Bi-directional Long Short-Term Memory)。Among them, the sentence vector feature extraction model is used to represent the correspondence between the sentence feature matrix and the sentence feature vector corresponding to the sentence, that is, the sentence vector feature extraction model is used to map the sentence feature matrix to the sentence feature vector. Optionally, the sentence vector feature extraction model may include at least one of the following: convolutional neural network (CNN, Convolutional Neural Networks), bidirectional long-term short-term memory network (BiLSTM, Bi-directional Long Short-Term Memory).
通过利用词向量特征提取模型和句向量特征提取模型,可以提取得到该句子对应的句子特征向量,由于先对句子中各分词分别进行词向量提取,再按照分词所在位置进行组合得到句子特征矩阵,再对句子特征矩阵进行特征提取得到句子特征向量,所提取得到的句子特征向量中既可以表征句子中词语信息,也可以表征句子中词语间的上下文,即语义信息,更利于后续过程的文本类别识别。By using the word vector feature extraction model and the sentence vector feature extraction model, the sentence feature vector corresponding to the sentence can be extracted. Since the word vector is extracted for each participle in the sentence first, and then combined according to the position of the participle to obtain the sentence feature matrix, Then perform feature extraction on the sentence feature matrix to obtain the sentence feature vector. The extracted sentence feature vector can not only represent the word information in the sentence, but also represent the context between words in the sentence, that is, semantic information, which is more conducive to the text category of the subsequent process. identify.
步骤203,对于子文本序列中的每个子文本,执行第一计算操作。 Step 203, for each subtext in the subtext sequence, perform a first calculation operation.
这里,上述执行主体可以对于步骤201得到的子文本序列中的每个子文本,执行第一计算操作。Here, the execution subject may perform the first calculation operation for each subtext in the subtext sequence obtained in step 201 .
这里,第一计算操作可以包括子步骤2031和子步骤2032:Here, the first computing operation may include sub-step 2031 and sub-step 2032:
子步骤2031,对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量。Sub-step 2031, for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the attention feature vector of the sentence relative to the subtext.
这里,假设经过步骤202,每个句子对应的句子特征向量为M维向量,而假设每个子文本中句子数量最多为S(例如,32)个句子,则该子文本对应的句子序列中各个句子对应的句子特征向量可以组成大小为S×M的矩阵F,该矩阵F可以认为是该子文本对应的子文本特征矩阵。而,计算该子文本中每个句子相对于该子文本的注意 力特征向量可以表示如下:Here, assuming that through step 202, the sentence feature vector corresponding to each sentence is an M-dimensional vector, and assuming that the number of sentences in each subtext is at most S (for example, 32) sentences, then each sentence in the sentence sequence corresponding to this subtext The corresponding sentence feature vectors can form a matrix F with a size of S×M, and the matrix F can be regarded as the sub-text feature matrix corresponding to the sub-text. However, calculating the attention feature vector of each sentence in the subtext relative to the subtext can be expressed as follows:
假设对于该子文本对应的句子序列中的第i个句子,其中,i为1到S之间的正整数,第i个句子对应的句子特征向量F i为M维向量,也可以认为是1*M的矩阵F i。而第i个句子相对于该子文本的注意力特征向量为矩阵B i,则矩阵F i与矩阵B i的笛卡尔乘积应为矩阵F,具体可以用公式表示如下: Assume that for the i-th sentence in the sentence sequence corresponding to the subtext, where i is a positive integer between 1 and S, the sentence feature vector F i corresponding to the i-th sentence is an M-dimensional vector, which can also be considered as 1 *M matrix F i . And the attention feature vector of the i-th sentence relative to the subtext is matrix B i , then the Cartesian product of matrix F i and matrix B i should be matrix F, which can be expressed as follows:
F i×B i=F       公式(1) F i ×B i =F formula (1)
从上述公式可以看出,矩阵B i为S*1的矩阵,其中,矩阵B i中第j行第1列的元素B i,j,1用于表示该子文本对应的句子序列中第i个句子和第j个句子之间的相关度、重要程度或者注意程度,j为1到S之间的正整数。 It can be seen from the above formula that the matrix B i is a matrix of S*1, where the element B i,j,1 in the jth row and the first column of the matrix B i is used to represent the i-th in the sentence sequence corresponding to the subtext The degree of relevance, importance or attention between the sentence and the jth sentence, j is a positive integer between 1 and S.
在具体计算上述矩阵B i时,即可通过已知的矩阵F和矩阵F i进行计算得到该子文本对应的句子序列中第i个句子相对于该子文本的注意力特征向量矩阵B i,由于矩阵B i为S*1的矩阵,也可认为矩阵B i即为该子文本对应的句子序列中第i个句子相对于该子文本的注意力特征向量。 When specifically calculating the above-mentioned matrix B i , the attention feature vector matrix B i of the i-th sentence in the sentence sequence corresponding to the sub-text relative to the sub-text can be obtained by calculating the known matrix F and matrix F i , Since the matrix B i is a matrix of S*1, it can also be considered that the matrix B i is the attention feature vector of the ith sentence in the sentence sequence corresponding to the sub-text relative to the sub-text.
子步骤2032,基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于待识别文本的注意力特征向量。Sub-step 2032, based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext relative to the text to be recognized.
这里,经过子步骤2012,已经得到该子文本对应的句子序列中每个句子相对于该子文本的注意力特征向量,继续沿用上述假设,即假设该子文本对应的句子序列中第i个句子相对于该子文本的注意力特征向量B i为S*1的矩阵,其中,矩阵B i中第j行第1列的元素B i,j,1用于表示该子文本对应的句子序列中第i个句子和第j个句子之间的相关度、重要程度或者注意程度。并且假设该子文本中包括S个句子,则按照该子文本对应的句子序列中句子在句子系列中的所在位置,用每个句子相对于该子文本的注意力特征向量B i进行组合,可以得到针对该子文本的大小为S*S的注意力表示矩阵B,该注意力表示矩阵中的元素B i,j用于表示该子文本中的第i个句子与第j个句子之间的重要程度、相关程度或者注意程度。 Here, after sub-step 2012, the attention feature vector of each sentence in the sentence sequence corresponding to the subtext relative to the subtext has been obtained, and the above assumption is continued, that is, it is assumed that the i-th sentence in the sentence sequence corresponding to the subtext The attention feature vector B i relative to the subtext is a matrix of S*1, where the element B i,j,1 in the jth row and column 1 of the matrix B i is used to represent the sentence sequence corresponding to the subtext The degree of relevance, importance, or degree of attention between the i-th sentence and the j-th sentence. And assuming that the subtext includes S sentences, then according to the position of the sentence in the sentence sequence corresponding to the subtext, the attention feature vector B i of each sentence relative to the subtext is combined, which can be Obtain an attention expression matrix B whose size is S*S for the subtext, and the element B i,j in the attention expression matrix is used to represent the relationship between the i-th sentence and the j-th sentence in the sub-text Importance, relevance, or attention.
而,计算该子文本相对于待识别文本的注意力特征向量,可以表示如下:However, the calculation of the attention feature vector of the subtext relative to the text to be recognized can be expressed as follows:
假设待识别文本对应的子文本序列中有P个子文本。P为正整数。其中,每个子文本对应的注意表示矩阵为大小为S*S的矩阵。假设对于子文本序列中第p个子文本,该第p个子文本对应的注意力表示矩阵为C p,C p为S*S的矩阵。按照子文本序列中每个子文本在子文本序列中的位置,组合每个子文本对应的注意力特征矩阵C p可以得到P*S*S的三维矩阵C。假设第p个子文本相对于待识别文本的注意力特征向量为 矩阵E p,则第p个子文本的注意力特征矩阵C p与第p个子文本相对于待识别文本的注意力特征矩阵E p的笛卡尔乘积应为矩阵C,具体可以用公式表示如下: Suppose there are P subtexts in the subtext sequence corresponding to the text to be recognized. P is a positive integer. Wherein, the attention representation matrix corresponding to each subtext is a matrix with a size of S*S. Assume that for the pth subtext in the subtext sequence, the attention representation matrix corresponding to the pth subtext is C p , and C p is a matrix of S*S. According to the position of each subtext in the subtext sequence, the three-dimensional matrix C of P*S*S can be obtained by combining the attention feature matrix C p corresponding to each subtext. Assume that the attention feature vector of the pth subtext relative to the text to be recognized is a matrix E p , then the attention feature matrix C p of the pth subtext and the attention feature matrix E p of the pth subtext relative to the text to be recognized The Cartesian product should be a matrix C, which can be expressed as follows:
C p×E p=C       公式(2) C p ×E p =C formula (2)
从上述公式可以看出,矩阵E p为P*1的矩阵,其中,矩阵E p中第q行第1列的元素E p,q,1用于表示待识别文本中第p个子文本和第q个子文本之间的相关度、重要程度或者注意程度。 It can be seen from the above formula that the matrix E p is a matrix of P*1, where the element E p,q,1 in the first column of the qth row in the matrix E p is used to represent the pth subtext and the first subtext in the text to be recognized Relevance, importance or attention among q subtexts.
在具体计算上述矩阵E p时,即可通过已知的矩阵C和矩阵C p进行计算得到待识别文本对应的子文本序列中第p个子文本相对于待识别文本的注意力特征矩阵E p,由于矩阵E p为P*1的矩阵,也可认为矩阵E p即为待识别文本对应的子文本序列中第p个子文本相对于待识别文本的注意力特征向量。 When specifically calculating the above matrix Ep , the attention feature matrix Ep of the pth subtext in the subtext sequence corresponding to the text to be recognized relative to the text to be recognized can be obtained by calculating the known matrix C and matrix Cp , Since the matrix E p is a matrix of P*1, it can also be considered that the matrix E p is the attention feature vector of the p-th subtext in the subtext sequence corresponding to the text to be recognized relative to the text to be recognized.
步骤204,拼接子文本序列中子文本相对于待识别文本的注意力特征向量,得到待识别文本对应的待识别文本特征向量。 Step 204, splicing the attention feature vectors of the subtexts in the subtext sequence relative to the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized.
在本实施例中,上述执行主体例如可以按照子文本序列中子文本在在文本序列中的位置,拼接子文本序列中子文本相对于待识别文本的注意力特征向量,得到待识别文本对应的待识别文本特征向量。In this embodiment, for example, according to the position of the subtext in the subtext sequence in the subtext sequence, the execution subject can splice the attention feature vectors of the subtext in the subtext sequence relative to the text to be recognized to obtain the corresponding The feature vector of the text to be recognized.
这里,继续沿用上述举例,拼接子文本序列中的P个子文本中每个子文本相对于待识别文本的注意力特征向量E p可以得到待识别文本特征向量E,待识别文本特征向量E的维度为P*P。 Here, continuing to use the above example, the attention feature vector E p of each subtext in the P subtexts in the splicing subtext sequence relative to the text to be recognized can obtain the text feature vector E to be recognized, and the dimension of the text feature vector E to be recognized is P*P.
步骤205,将待识别文本特征向量输入预先训练的分类模型,得到待识别文本属于预设类别文本的概率值。 Step 205, input the feature vector of the text to be recognized into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text.
在本实施例中,上述执行主体可以将步骤204中计算得到的维度为P*P的待识别文本特征向量输入预先训练的分类模型,得到待识别文本属于预设类别文本的概率值。其中,分类模型用于表征文本特征向量与文本属于预设类别文本的概率值之间的对应关系。In this embodiment, the execution subject can input the feature vector of the text to be recognized whose dimension is P*P calculated in step 204 into the pre-trained classification model to obtain the probability value that the text to be recognized belongs to the preset category of text. Wherein, the classification model is used to characterize the corresponding relationship between the text feature vector and the probability value that the text belongs to the preset category text.
在一些可选的实施方式中,特征提取模型和分类模型可以是通过如图3所示的训练步骤300预先训练得到的,训练步骤300可以包括以下步骤301到步骤:In some optional embodiments, the feature extraction model and the classification model may be pre-trained through the training step 300 as shown in Figure 3, and the training step 300 may include the following steps 301 to 3:
步骤301,确定初始特征提取模型和初始分类模型。 Step 301, determining an initial feature extraction model and an initial classification model.
这里,训练步骤的执行主体可以与文本类别识别方法的执行主体相同或者不同。如果相同,则训练步骤的执行主体可以在训练得到特征提取模型和分类模型后将训练好的特征提取模型和分类模型的模型结构信息和模型参数的参数值存储在本地。如果 不同,则训练步骤的执行主体可以在训练得到特征提取模型和分类模型后将训练好的特征提取模型和分类模型的模型结构信息和模型参数的参数值发送给文本类别识别方法的执行主体。Here, the subject of the training step may be the same as or different from the subject of the text category recognition method. If they are the same, the execution subject of the training step can store the model structure information and the parameter values of the model parameters of the trained feature extraction model and classification model locally after the training obtains the feature extraction model and classification model. If different, the execution subject of the training step can send the model structure information and the parameter values of the model parameters of the trained feature extraction model and classification model to the execution subject of the text category recognition method after training the feature extraction model and the classification model.
这里,由于初始特征提取模型和初始分类模型可以包括各种类型的计算模型,对于不同类型的计算模型,所需要确定的模型结构信息也不相同。Here, since the initial feature extraction model and the initial classification model may include various types of calculation models, the model structure information that needs to be determined is different for different types of calculation models.
然后,可以初始化初始特征提取模型和初始分类模型的模型参数。实践中,可以将初始特征提取模型和初始分类模型的各个模型参数用一些不同的小随机数进行初始化。“小随机数”用来保证模型不会因权重过大而进入饱和状态,从而导致训练失败,“不同”用来保证模型可以正常地学习。Then, the model parameters of the initial feature extraction model and the initial classification model can be initialized. In practice, each model parameter of the initial feature extraction model and the initial classification model can be initialized with some different small random numbers. "Small random number" is used to ensure that the model will not enter a saturated state due to excessive weight, which will cause training failure, and "different" is used to ensure that the model can learn normally.
可选地,初始分类模型可以为Softmax分类器。Optionally, the initial classification model may be a Softmax classifier.
步骤302,获取训练样本集合。 Step 302, acquiring a training sample set.
这里,训练样本集合中的训练样本包括样本文本和用于表征该样本文本是否属于预设类别文本的样本标签。实践中,样本标签可以是由人工标注得到。Here, the training samples in the training sample set include sample text and a sample label used to represent whether the sample text belongs to a preset category of text. In practice, sample labels can be obtained by manual annotation.
步骤303,对于训练样本集合中的训练样本,执行参数调整操作直到满足预设训练结束条件。 Step 303, for the training samples in the training sample set, perform parameter adjustment operations until the preset training end condition is satisfied.
这里,参数调整操作可以包括:Here, parameter adjustment operations may include:
步骤3031,对该训练样本中的样本文本进行拆分得到样本子文本序列,将样本子文本序列中每个子文本进行拆分得到对应的句子序列。实践中,可以按照步骤201中相同或类似方法进行。 Step 3031, splitting the sample text in the training sample to obtain a sample subtext sequence, and splitting each subtext in the sample subtext sequence to obtain a corresponding sentence sequence. In practice, the same or similar method in step 201 may be followed.
步骤3032,对于样本子文本序列中每个样本子文本对应的句子序列中的每个句子,按照初始特征提取模型进行特征提取得到该句子对应的句子特征向量。 Step 3032, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain a sentence feature vector corresponding to the sentence.
步骤3033,对于样本子文本序列中的每个样本子文本,执行第二计算操作以得到该样本子文本相对于样本文本的注意力特征向量。其中,第二计算操作包括如下第一步到第四步: Step 3033, for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text. Wherein, the second computing operation includes the following first to fourth steps:
第一步,基于该样本子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该样本子文本的注意力特征向量。The first step is to calculate the attention feature vector of the sentence relative to the sample subtext based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the sample subtext.
第二步,基于每个句子相对于该样本子文本的注意力特征向量,计算该样本子文本相对于样本文本的注意力特征向量。The second step is to calculate the attention feature vector of the sample subtext relative to the sample text based on the attention feature vector of each sentence relative to the sample subtext.
这里第一步和第二步的具体操作分别与步骤2031和步骤2032的操作基本相同,在此不再赘述。Here, the specific operations of the first step and the second step are basically the same as the operations of step 2031 and step 2032 respectively, and will not be repeated here.
第三步,拼接样本子文本序列中样本子文本相对于该样本文本的注意力特征向量,得到该样本文本对应的样本文本特征向量。In the third step, the attention feature vector of the sample subtext in the sample subtext sequence relative to the sample text is spliced to obtain the sample text feature vector corresponding to the sample text.
这里,第三步的具体操作与步骤204的操作基本相同,在此不再赘述。Here, the specific operation of the third step is basically the same as the operation of step 204, and will not be repeated here.
第四步,将所得到的样本文本特征向量输入初始分类模型,得到该样本文本属于预设类别文本的概率值。The fourth step is to input the obtained sample text feature vector into the initial classification model to obtain the probability value that the sample text belongs to the preset category text.
步骤3034,基于所得到的概率值与该训练样本中的样本标签之间的差异,调整初始特征提取模型和所述初始分类模型的模型参数。 Step 3034, based on the difference between the obtained probability value and the sample label in the training sample, adjust the model parameters of the initial feature extraction model and the initial classification model.
这里,可以采用各种实现方式基于所得到的概率值与该训练样本中的样本标签之间的差异,调整初始特征提取模型和所述初始分类模型的模型参数。例如,可以采用随机梯度下降(SGD,Stochastic Gradient Descent)、牛顿法(Newton's Method)、拟牛顿法(Quasi-Newton Methods)、共轭梯度法(Conjugate Gradient)、启发式优化方法以及其他现在已知或者未来开发的各种优化算法。Here, various implementation manners may be adopted to adjust model parameters of the initial feature extraction model and the initial classification model based on the difference between the obtained probability value and the sample label in the training sample. For example, stochastic gradient descent (SGD, Stochastic Gradient Descent), Newton's Method, Quasi-Newton Methods, Conjugate Gradient, heuristic optimization methods and other known Or various optimization algorithms developed in the future.
步骤304,将训练得到的初始特征提取模型和初始分类模型确定为预先训练的特征提取模型和分类模型。In step 304, the trained initial feature extraction model and initial classification model are determined as pre-trained feature extraction models and classification models.
在一些可选的实施方式中,初始特征提取模型可以包括词向量特征提取模型和句向量特征提取模型。相应地,步骤3032,对于样本子文本序列中每个样本子文本对应的句子序列中的每个句子,按照初始特征提取模型进行特征提取得到该句子对应的句子特征向量,可以如下执行:In some optional implementation manners, the initial feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model. Correspondingly, step 3032, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature vector corresponding to the sentence, which can be performed as follows:
对于样本子文本序列中每个样本子文本对应的句子序列中的每个句子,首先,对该句子对应的分词序列中每个分词按照词向量特征提取模型进行特征提取得到对应的词向量,再组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,最后对该句子对应的句子特征矩阵按照句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。具体可以参考步骤202中关于词向量特征提取模型和句向量特征提取模型的相应可选方式的描述,在此不再赘述。For each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, first, perform feature extraction on each word segmentation sequence corresponding to the sentence according to the word vector feature extraction model to obtain the corresponding word vector, and then Combine the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, and finally perform feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence . For details, reference may be made to the description of the corresponding optional modes of the word vector feature extraction model and the sentence vector feature extraction model in step 202, which will not be repeated here.
基于上述可选实施方式,可选地,在组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵之前,训练步骤的执行主体还可以:对该句子对应的分词序列中每个分词,响应于确定该分词与预设文本类别关键词集合中的关键词匹配,将该分词对应的词向量设置为预设词向量。作为示例,预设词向量可以为各个向量分量均为0的词向量。这样,通过将与预设文本类别关键词集合中的关键词匹配的词语对应的词向量进行特殊指定,可以提升特征提取模型和分类模型对于预设 文本类别的文本的识别能力。Based on the above optional implementation, optionally, before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the execution subject of the training step can also: For each word segment in the word segment sequence, in response to determining that the word segment matches a keyword in the preset text category keyword set, the word vector corresponding to the word segment is set as the preset word vector. As an example, the preset word vector may be a word vector in which all vector components are 0. In this way, by specifying the word vectors corresponding to the words that match the keywords in the preset text category keyword set, the recognition ability of the feature extraction model and classification model for the text of the preset text category can be improved.
这里,预设文本类别关键词集合可以是采用机器学习或数据挖掘算法从大量语料中动态学习的,也可以是由技术人员根据具体应用场景的需要及经验而人工制定的,预设文本类别关键词集合也可以同时包括动态学习得到的关键词和人工指定的关键词。Here, the preset text category keyword set can be dynamically learned from a large amount of corpus using machine learning or data mining algorithms, or it can be manually formulated by technicians according to the needs and experience of specific application scenarios. The preset text category keywords The word set can also include both dynamically learned keywords and manually specified keywords.
采用图3所示的训练步骤,可以实现自动训练得到特征提取模型和分类模型。Using the training steps shown in Figure 3, automatic training can be implemented to obtain the feature extraction model and classification model.
经过步骤201到步骤205,可以得到待识别文本属于预设类别文本的概率值。Through steps 201 to 205, the probability value of the text to be recognized belonging to the preset category can be obtained.
在一些可选的实施方式中,上述执行主体还可以在步骤205后执行以下步骤206:In some optional implementation manners, the above execution subject may also perform the following step 206 after step 205:
步骤206,确定概率值是否大于预设概率阈值。 Step 206, determine whether the probability value is greater than a preset probability threshold.
如果确定大于,转到步骤207执行。If determined to be greater than, go to step 207 for execution.
步骤207,生成用于指示待识别文本为预设文本类别的第一识别结果信息。 Step 207, generating first recognition result information for indicating that the text to be recognized is a preset text category.
这样,可以通过上述第一识别结果信息,以确定待识别文本属于文本类别。In this way, the above-mentioned first recognition result information can be used to determine that the text to be recognized belongs to the text category.
在一些可选的实施方式中,上述执行主体还可以在步骤206中确定不大于的情况下,转到步骤208执行。In some optional implementation manners, the execution subject may also proceed to step 208 for execution if it is determined in step 206 that the value is not greater than.
步骤208,生成用于指示待识别文本不是预设文本类别的第二识别结果信息。 Step 208, generating second recognition result information for indicating that the text to be recognized is not a preset text category.
这样,可以通过上述第一识别结果信息,确定待识别文本不属于文本类别。In this way, it can be determined that the text to be recognized does not belong to the text category based on the first recognition result information.
在一些可选的实施方式中,上述执行主体还可以在步骤2031之后的其他时间点,例如,可以在步骤204之前,或者步骤205之前,或者步骤205之后,执行以下步骤209:In some optional implementation manners, the above execution subject may also perform the following step 209 at other time points after step 2031, for example, before step 204, or before step 205, or after step 205:
步骤209,对于子文本序列中的每个子文本对应的句子序列中每个句子,基于该句子相对于该子文本的注意力特征向量,计算该句子属于预设文本类别的概率值,根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句子。 Step 209, for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, based on the attention feature vector of the sentence relative to the subtext, calculate the probability value that the sentence belongs to the preset text category, according to the calculated The probability value of determines the presentation manner corresponding to the sentence, and presents the sentence according to the determined presentation manner.
继续沿用步骤2031中的举例,即假设某个句子相对于所属子文本的注意力特征向量为矩阵B i,矩阵B i为S*1的矩阵,其中,矩阵B i中第j行第1列的元素B i,j,1用于表示该子文本对应的句子序列中第i个句子和第j个句子之间的相关度、重要程度或者注意程度,j为1到S之间的正整数。则,基于该句子相对于该子文本的注意力特征向量B i,计算该句子属于预设文本类别的概率值,例如可以是: Continue to use the example in step 2031, that is, assume that the attention feature vector of a certain sentence relative to the subtext to which it belongs is a matrix B i , and the matrix B i is a matrix of S*1, wherein the jth row and the first column of the matrix B i The element B i, j, 1 is used to represent the correlation, importance or attention between the i-th sentence and the j-th sentence in the sentence sequence corresponding to the subtext, and j is a positive integer between 1 and S . Then, based on the attention feature vector B i of the sentence relative to the subtext, the probability value of the sentence belonging to the preset text category is calculated, for example, it can be:
计算B i的各元素之和,或者,计算的各元素平方之和,作为该句子属于预设文本类别的概率值。 The sum of the elements of Bi is calculated, or the sum of the squares of the calculated elements is used as the probability value that the sentence belongs to the preset text category.
而根据计算得到的概率值确定该句子对应的呈现方式,例如,可以通过预先设定不同的概率值范围与相应呈现方式的对应关系,然后在计算得到的概率值属于相应概率值范围时,将相应概率值范围对应的呈现方式确定为该句子对应的呈现方式。例如,当概率值大于0.8,呈现方式为采用红色字体。当概率值大于0.5且小于0.8,呈现方式为采用粉红色字体。And according to the calculated probability value to determine the corresponding presentation mode of the sentence, for example, the corresponding relationship between different probability value ranges and corresponding presentation modes can be set in advance, and then when the calculated probability value belongs to the corresponding probability value range, the The presentation manner corresponding to the corresponding probability value range is determined as the presentation manner corresponding to the sentence. For example, when the probability value is greater than 0.8, the presentation method is to use red font. When the probability value is greater than 0.5 and less than 0.8, it will be presented in pink font.
在一些可选的实施方式中,上述执行主体还可以在步骤2032之后的其他时间点,例如,可以在步骤204之前,或者步骤205之前,或者步骤205之后,执行以下步骤210:In some optional implementation manners, the above execution subject may also perform the following step 210 at other time points after step 2032, for example, before step 204, or before step 205, or after step 205:
步骤210,对于子文本序列中的每个子文本,基于该子文本相对于待识别文本的注意力特征向量,计算该子文本属于预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本。 Step 210, for each subtext in the subtext sequence, based on the attention feature vector of the subtext relative to the text to be recognized, calculate the probability value that the subtext belongs to the preset text category, and determine the subtext according to the calculated probability value. The presenting manner corresponding to the text, and presenting the subtext according to the determined presenting manner.
继续沿用步骤2032中的举例,即假设某个子文本相对待识别文本的注意力特征向量为P*1的矩阵E p,其中,矩阵E p中第q行第1列的元素E p,q,1用于表示待识别文本中第p个子文本和第q个子文本之间的相关度、重要程度或者注意程度。则,基于该子文本相对于待识别文本的注意力特征向量E p,计算该子文本属于预设文本类别的概率值,例如可以是: Continue to use the example in step 2032, that is, assume that the attention feature vector of a certain subtext relative to the text to be recognized is a matrix Ep of P*1, wherein the elements Ep , q, and 1 is used to indicate the degree of relevance, importance or attention between the pth subtext and the qth subtext in the text to be recognized. Then, based on the attention feature vector E p of the subtext relative to the text to be recognized, the probability value of the subtext belonging to the preset text category is calculated, for example, it may be:
计算E p的各元素之和,或者,计算的各元素平方之和,作为该子文本属于预设文本类别的概率值。 The sum of each element of E p is calculated, or the calculated sum of the squares of each element is used as a probability value that the subtext belongs to a preset text category.
而根据计算得到的概率值确定该子文本对应的呈现方式,例如,可以通过预先设定不同的概率值范围与相应呈现方式的对应关系,然后在计算得到的概率值属于相应概率值范围时,将相应概率值范围对应的呈现方式确定为该文本对应的呈现方式。例如,当概率值大于0.9,呈现方式为采用粗体字体。当概率值大于0.6且小于0.9,呈现方式为采用正常字体。And according to the calculated probability value to determine the presentation mode corresponding to the subtext, for example, by pre-setting the corresponding relationship between different probability value ranges and corresponding presentation modes, and then when the calculated probability value belongs to the corresponding probability value range, The presentation manner corresponding to the corresponding probability value range is determined as the presentation manner corresponding to the text. For example, when the probability value is greater than 0.9, it is presented in bold font. When the probability value is greater than 0.6 and less than 0.9, the presentation method is normal font.
继续参见图4,图4是根据本实施例的文本类别识别方法的应用场景的一个示意图。在图4的应用场景中,首先服务器41从终端设备42获取待识别文本43。然后服务器41,将待识别文本43进行拆分得到子文本序列44,将子文本序列44中子文本441、442和443进行拆分得到对应的句子序列451、452和453,句子序列451中包括句子45101到句子45120,句子序列452中包括句子45201到句子45222,句子序列453中包括句子45201到句子45325。服务器41再对于45101到句子45120,句子 45201到句子45222,句子45301到句子45325中的每个句子,按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量46101到句子特征向量46120,句子特征向量46201到句子特征向量46222,句子特征向量46201到句子特征向量46225。而后,服务器41对于子文本序列44中子文本441、442和443,分别执行第一计算操作,并分别得到441、442和443相对于待识别文本43的注意力特征向量471、472和473。Continuing to refer to FIG. 4 , FIG. 4 is a schematic diagram of an application scenario of the text category recognition method according to this embodiment. In the application scenario of FIG. 4 , firstly, the server 41 acquires the text 43 to be recognized from the terminal device 42 . Then the server 41 splits the text to be recognized 43 to obtain a subtext sequence 44, and splits the subtexts 441, 442 and 443 in the subtext sequence 44 to obtain corresponding sentence sequences 451, 452 and 453. The sentence sequence 451 includes Sentence 45101 to sentence 45120, sentence sequence 452 includes sentence 45201 to sentence 45222, and sentence sequence 453 includes sentence 45201 to sentence 45325. For each sentence in 45101 to sentence 45120, sentence 45201 to sentence 45222, and sentence 45301 to sentence 45325, the server 41 performs feature extraction according to the pre-trained feature extraction model to obtain the sentence feature vector 46101 to sentence feature vector 46120 corresponding to the sentence. , sentence feature vector 46201 to sentence feature vector 46222, sentence feature vector 46201 to sentence feature vector 46225. Then, the server 41 respectively performs the first calculation operation on the subtexts 441, 442 and 443 in the subtext sequence 44, and respectively obtains attention feature vectors 471, 472 and 473 of 441, 442 and 443 relative to the text to be recognized 43.
接着,服务器41拼接注意力特征向量471、472和473,得到待识别文本对应的待识别文本特征向量48。最后,将待识别文本特征向量48输入预先训练的分类模型49,得到待识别文本属于预设类别文本的概率值50。Next, the server 41 concatenates the attention feature vectors 471 , 472 and 473 to obtain the text feature vector 48 corresponding to the text to be recognized. Finally, input the feature vector 48 of the text to be recognized into the pre-trained classification model 49 to obtain the probability value 50 that the text to be recognized belongs to the preset category of text.
本公开的上述实施例提供的文本类别识别方法,通过句子相对子文本的注意力特征向量和子文本相对待识别文本的注意力特征向量实现建立句子、子文本和待识别文本之间的层次注意力关系,进而生成待识别文本特征向量计算属于预设文本类别的概率值,实现了自动对待识别文本进行分类,降低了对文本分类的人工成本。The text category recognition method provided by the above-mentioned embodiments of the present disclosure realizes the establishment of hierarchical attention among sentences, subtexts and texts to be recognized through the attention feature vector of the sentence relative to the subtext and the attention feature vector of the subtext relative to the text to be recognized relationship, and then generate the feature vector of the text to be recognized to calculate the probability value belonging to the preset text category, which realizes the automatic classification of the text to be recognized and reduces the labor cost of text classification.
进一步参考图5,作为对上述各图所示方法的实现,本公开提供了一种文本类别识别装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 5 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a text category recognition device, which corresponds to the method embodiment shown in FIG. 2 , and the device specifically It can be applied to various electronic devices.
如图5所示,本实施例的文本类别识别装置500包括:拆分单元501、特征提取单元502、计算单元503、拼接单元504和识别单元505。其中,拆分单元501,被配置成将待识别文本进行拆分得到子文本序列,将所述子文本序列中每个子文本进行拆分得到对应的句子序列;特征提取单元502,被配置成对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量;计算单元503,被配置成对于所述子文本序列中的每个子文本,执行以下第一计算操作:对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量;基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于所述待识别文本的注意力特征向量;拼接单元504,被配置成拼接所述子文本序列中子文本相对于所述待识别文本的注意力特征向量,得到所述待识别文本对应的待识别文本特征向量;识别单元505,被配置成将所述待识别文本特征向量输入预先训练的分类模型,得到所述待识别文本属于预设类别文本的概率值。As shown in FIG. 5 , the text category recognition device 500 of this embodiment includes: a splitting unit 501 , a feature extraction unit 502 , a calculation unit 503 , a splicing unit 504 and a recognition unit 505 . Wherein, the splitting unit 501 is configured to split the text to be recognized to obtain a subtext sequence, and split each subtext in the subtext sequence to obtain a corresponding sentence sequence; the feature extraction unit 502 is configured to Each sentence in the sentence sequence corresponding to each sub-text is subjected to feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence; the calculation unit 503 is configured to for each of the sub-text sequences subtext, perform the following first calculation operation: for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the attention feature of the sentence relative to the subtext Vector; Based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext relative to the text to be recognized; splicing unit 504, configured to splice the subtext in the subtext sequence With respect to the attention feature vector of the text to be recognized, the text feature vector to be recognized corresponding to the text to be recognized is obtained; the recognition unit 505 is configured to input the text feature vector to be recognized into a pre-trained classification model to obtain The probability value that the text to be recognized belongs to a preset category of text.
在本实施例中,文本类别识别装置500的拆分单元501、特征提取单元502、计算单元503、拼接单元504和识别单元505的具体处理及其所带来的技术效果可分别参考图2对应实施例中步骤201、步骤202、步骤203、步骤204和步骤205的相关说明,在此不再赘述。In this embodiment, the specific processing of the splitting unit 501, the feature extraction unit 502, the calculation unit 503, the splicing unit 504, and the recognition unit 505 of the text category recognition device 500 and the technical effects brought about by them can refer to FIG. 2 respectively. The relevant descriptions of step 201, step 202, step 203, step 204 and step 205 in the embodiment will not be repeated here.
在一些可选的实施方式中,所述特征提取模型和所述分类模型可以通过如下方式预先训练得到:In some optional implementation manners, the feature extraction model and the classification model can be obtained through pre-training as follows:
确定初始特征提取模型和初始分类模型;Determine the initial feature extraction model and initial classification model;
获取训练样本集合,其中,所述训练样本包括样本文本和用于表征该样本文本是否属于预设类别文本的样本标签;Obtain a training sample set, wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
对于所述训练样本集合中的训练样本,执行以下参数调整操作直到满足预设训练结束条件:对该训练样本中的样本文本进行拆分得到样本子文本序列,将所述样本子文本序列中每个子文本进行拆分得到对应的句子序列;对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量;对于所述样本子文本序列中的每个样本子文本,执行第二计算操作以得到该样本子文本相对于所述样本文本的注意力特征向量:基于该样本子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该样本子文本的注意力特征向量;基于每个句子相对于该样本子文本的注意力特征向量,计算该样本子文本相对于所述样本文本的注意力特征向量;拼接所述样本子文本序列中样本子文本相对于该样本文本的注意力特征向量,得到该样本文本对应的样本文本特征向量;将所得到的样本文本特征向量输入所述初始分类模型,得到该样本文本属于所述预设类别文本的概率值;基于所得到的概率值与该训练样本中的样本标签之间的差异,调整所述初始特征提取模型和所述初始分类模型的模型参数;For the training samples in the training sample set, the following parameter adjustment operations are performed until the preset training end condition is met: the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text; Input the obtained sample text feature vector into the The initial classification model to obtain the probability value that the sample text belongs to the preset category text; based on the difference between the obtained probability value and the sample label in the training sample, adjust the initial feature extraction model and the initial Model parameters for classification models;
将训练得到的所述初始特征提取模型和所述初始分类模型确定为预先训练的所述特征提取模型和所述分类模型。Determining the trained initial feature extraction model and the initial classification model as the pre-trained feature extraction model and the classification model.
在一些可选的实施方式中,所述特征提取模型可以包括词向量特征提取模型和句向量特征提取模型;以及In some optional implementation manners, the feature extraction model may include a word vector feature extraction model and a sentence vector feature extraction model; and
所述特征提取单元502可以进一步被配置成:The feature extraction unit 502 may be further configured to:
对于每个所述子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子 对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to the word vector feature extraction model for each participle in the word segment sequence corresponding to the sentence to obtain a corresponding word vector, and combine the corresponding word vectors of the sentence The word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
在一些可选的实施方式中,所述词向量特征提取模型可以包括以下至少一项:长短期记忆网络、翻译模型。In some optional implementation manners, the word vector feature extraction model may include at least one of the following: a long short-term memory network, and a translation model.
在一些可选的实施方式中,所述句向量特征提取模型可以包括以下至少一项:卷积神经网络、双向长短期记忆网络。In some optional implementation manners, the sentence vector feature extraction model may include at least one of the following: a convolutional neural network, and a bidirectional long-short-term memory network.
在一些可选的实施方式中,所述对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量,可以包括:In some optional implementation manners, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence features corresponding to the sentence vector, which can include:
对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector Combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, performing feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the sentence corresponding to the sentence Feature vector.
在一些可选的实施方式中,在所述组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵之前,所述训练步骤还可以包括:In some optional implementation manners, before combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, the training step may also include:
对该句子对应的分词序列中每个分词,响应于确定该分词与预设文本类别关键词集合中的关键词匹配,将该分词对应的词向量设置为预设词向量。For each word segment in the word segment sequence corresponding to the sentence, in response to determining that the word segment matches a keyword in the preset text category keyword set, the word vector corresponding to the word segment is set as the preset word vector.
在一些可选的实施方式中,所述装置500还可以包括:In some optional implementation manners, the device 500 may also include:
确定单元506,被配置成确定所述概率值是否大于预设概率阈值;A determining unit 506, configured to determine whether the probability value is greater than a preset probability threshold;
第一生成单元507,被配置成响应于确定大于,生成用于指示所述待识别文本为预设文本类别的第一识别结果信息。The first generating unit 507 is configured to generate first recognition result information indicating that the text to be recognized is a preset text category in response to determining that the value is greater than or equal to greater than 100%.
在一些可选的实施方式中,所述装置500还可以包括:In some optional implementation manners, the device 500 may also include:
第二生成单元508,被配置成响应于确定不大于,生成用于指示所述待识别文本不是预设文本类别的第二识别结果信息。The second generation unit 508 is configured to, in response to determining that it is not greater than, generate second recognition result information for indicating that the text to be recognized is not a preset text category.
在一些可选的实施方式中,所述装置500还可以包括:In some optional implementation manners, the device 500 may also include:
第一呈现单元509,被配置成对于所述子文本序列中的每个子文本对应的句子序列中每个句子,基于该句子相对于该子文本的注意力特征向量,计算该句子属于所述预设文本类别的概率值,根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句子。The first presentation unit 509 is configured to, for each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, calculate that the sentence belongs to the predetermined sentence based on the attention feature vector of the sentence relative to the subtext. Assuming the probability value of the text category, determine the presentation mode corresponding to the sentence according to the calculated probability value, and present the sentence according to the determined presentation mode.
在一些可选的实施方式中,所述装置500还可以包括:In some optional implementation manners, the device 500 may also include:
第二呈现单元510,被配置成对于所述子文本序列中的每个子文本,基于该子文本相对于所述待识别文本的注意力特征向量,计算该子文本属于所述预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本。The second presentation unit 510 is configured to, for each subtext in the subtext sequence, calculate the probability that the subtext belongs to the preset text category based on the attention feature vector of the subtext relative to the text to be recognized. A probability value, determining a presentation manner corresponding to the subtext according to the calculated probability value, and presenting the subtext according to the determined presentation manner.
需要说明的是,本公开的实施例提供的文本类别识别装置中各单元的实现细节和技术效果可以参考本公开中其它实施例的说明,在此不再赘述。It should be noted that, for the implementation details and technical effects of each unit in the text category recognition device provided by the embodiments of the present disclosure, reference may be made to the descriptions of other embodiments of the present disclosure, and details are not repeated here.
下面参考图6,其示出了适于用来实现本公开的电子设备的计算机系统600的结构示意图。图6示出的计算机系统600仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。Referring now to FIG. 6 , it shows a schematic structural diagram of a computer system 600 suitable for implementing the electronic device of the present disclosure. The computer system 600 shown in FIG. 6 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
如图6所示,计算机系统600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有计算机系统600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, a computer system 600 may include a processing device (eg, a central processing unit, a graphics processing unit, etc.) 601 that may be accessed randomly according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the computer system 600 are also stored. The processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604 .
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许计算机系统600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备的计算机系统600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Generally, the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, etc.; output devices 607, including, for example, a liquid crystal display (LCD), speaker, vibrator, etc. ; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609 . The communication means 609 may allow the computer system 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 6 shows a computer system 600 of electronic devices having various means, it should be understood that implementing or possessing all of the illustrated means is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开的实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是—— 但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备实现如图2所示的实施例及其可选实施方式示出的文本类别识别方法。The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by the electronic device, the electronic device realizes the text shown in the embodiment shown in FIG. 2 and its optional implementation manners. Class recognition method.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (such as through an Internet Service Provider). Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包 含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取待识别文本的单元”。The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the unit does not constitute a limitation on the unit itself under certain circumstances, for example, the acquisition unit may also be described as "a unit that acquires the text to be recognized".
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principles. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with technical features disclosed in this disclosure (but not limited to) having similar functions.

Claims (14)

  1. 一种文本类别识别方法,包括:A text category recognition method, comprising:
    将待识别文本进行拆分得到子文本序列,将所述子文本序列中每个子文本进行拆分得到对应的句子序列;Splitting the text to be recognized to obtain a subtext sequence, and splitting each subtext in the subtext sequence to obtain a corresponding sentence sequence;
    对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量;For each sentence in the sentence sequence corresponding to each said subtext, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
    对于所述子文本序列中的每个子文本,执行以下第一计算操作:对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量;基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于所述待识别文本的注意力特征向量;For each subtext in the subtext sequence, the following first calculation operation is performed: for each sentence in the subtext, based on the sentence feature vector corresponding to each sentence in the sentence sequence corresponding to the subtext, calculate the sentence With respect to the attention feature vector of the subtext; based on the attention feature vector of each sentence relative to the subtext, calculate the attention feature vector of the subtext with respect to the text to be identified;
    拼接所述子文本序列中子文本相对于所述待识别文本的注意力特征向量,得到所述待识别文本对应的待识别文本特征向量;Splicing the subtext in the subtext sequence relative to the attention feature vector of the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized;
    将所述待识别文本特征向量输入预先训练的分类模型,得到所述待识别文本属于预设类别文本的概率值。Inputting the feature vector of the text to be recognized into a pre-trained classification model to obtain a probability value that the text to be recognized belongs to a preset category of text.
  2. 根据权利要求1所述的方法,其中,所述特征提取模型和所述分类模型通过如下训练步骤预先训练得到:The method according to claim 1, wherein the feature extraction model and the classification model are pre-trained through the following training steps:
    确定初始特征提取模型和初始分类模型;Determine the initial feature extraction model and initial classification model;
    获取训练样本集合,其中,所述训练样本包括样本文本和用于表征该样本文本是否属于预设类别文本的样本标签;Obtain a training sample set, wherein the training sample includes sample text and a sample label for representing whether the sample text belongs to a preset category of text;
    对于所述训练样本集合中的训练样本,执行以下参数调整操作直到满足预设训练结束条件:对该训练样本中的样本文本进行拆分得到样本子文本序列,将所述样本子文本序列中每个子文本进行拆分得到对应的句子序列;对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量;对于所述样本子文本序列中的每个样本子文本,执行第二计算操作以得到该样本子文本相对于所述样本文本的注意力特征向量:基于该样本子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该样本子文本的注意力特征向量;基于每个句子相对于该样本子文本的注意力特征向量,计算该样本子文本相对于所述样本文本的注意力特征向量;拼接所述样本子文本 序列中样本子文本相对于该样本文本的注意力特征向量,得到该样本文本对应的样本文本特征向量;将所得到的样本文本特征向量输入所述初始分类模型,得到该样本文本属于所述预设类别文本的概率值;基于所得到的概率值与该训练样本中的样本标签之间的差异,调整所述初始特征提取模型和所述初始分类模型的模型参数;For the training samples in the training sample set, the following parameter adjustment operations are performed until the preset training end condition is met: the sample text in the training sample is split to obtain a sample subtext sequence, and each of the sample subtext sequences is The subtexts are split to obtain the corresponding sentence sequence; for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence feature corresponding to the sentence Vector; for each sample subtext in the sample subtext sequence, perform a second calculation operation to obtain the attention feature vector of the sample subtext relative to the sample text: based on the sentence sequence corresponding to the sample subtext The sentence feature vector corresponding to each sentence calculates the attention feature vector of the sentence relative to the sample subtext; based on the attention feature vector of each sentence relative to the sample subtext, calculates the sample subtext relative to the sample The attention feature vector of text; Splicing the attention feature vector of sample subtext in described sample subtext sequence with respect to this sample text, obtain the sample text feature vector corresponding to this sample text; Input the obtained sample text feature vector into the The initial classification model to obtain the probability value that the sample text belongs to the preset category text; based on the difference between the obtained probability value and the sample label in the training sample, adjust the initial feature extraction model and the initial Model parameters for classification models;
    将训练得到的所述初始特征提取模型和所述初始分类模型确定为预先训练的所述特征提取模型和所述分类模型。Determining the trained initial feature extraction model and the initial classification model as the pre-trained feature extraction model and the classification model.
  3. 根据权利要求2所述的方法,其中,所述特征提取模型包括词向量特征提取模型和句向量特征提取模型;以及The method according to claim 2, wherein the feature extraction model comprises a word vector feature extraction model and a sentence vector feature extraction model; and
    所述对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量,包括:For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence, including:
    对于每个所述子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each of the subtexts, perform feature extraction according to the word vector feature extraction model for each participle in the word segment sequence corresponding to the sentence to obtain a corresponding word vector, and combine the corresponding word vectors of the sentence The word vector corresponding to each word segment in the word segmentation sequence is used to form the sentence feature matrix corresponding to the sentence, and the sentence feature matrix corresponding to the sentence is extracted according to the sentence vector feature extraction model to obtain the sentence feature vector corresponding to the sentence.
  4. 根据权利要求3所述的方法,其中,所述词向量特征提取模型包括以下至少一项:长短期记忆网络、翻译模型。The method according to claim 3, wherein the word vector feature extraction model includes at least one of the following: a long short-term memory network, and a translation model.
  5. 根据权利要求3所述的方法,其中,所述句向量特征提取模型包括以下至少一项:卷积神经网络、双向长短期记忆网络。The method according to claim 3, wherein the sentence vector feature extraction model comprises at least one of the following: a convolutional neural network, a bidirectional long-short-term memory network.
  6. 根据权利要求3所述的方法,其中,所述对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子按照所述初始特征提取模型进行特征提取得到该句子对应的句子特征向量,包括:The method according to claim 3, wherein, for each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, perform feature extraction according to the initial feature extraction model to obtain the sentence corresponding to the sentence. Sentence feature vectors, including:
    对于所述样本子文本序列中每个样本子文本对应的句子序列中的每个句子,对该句子对应的分词序列中每个分词按照所述词向量特征提取模型进行特征提取得到对应的词向量,组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵,对该句子对应的句子特征矩阵按照所述句向量特征提取模型进行特征提取得到该句子对应的句子特征向量。For each sentence in the sentence sequence corresponding to each sample subtext in the sample subtext sequence, each word segmentation in the word segmentation sequence corresponding to the sentence is subjected to feature extraction according to the word vector feature extraction model to obtain a corresponding word vector Combining the word vectors corresponding to each word in the word segmentation sequence corresponding to the sentence to form the sentence feature matrix corresponding to the sentence, performing feature extraction on the sentence feature matrix corresponding to the sentence according to the sentence vector feature extraction model to obtain the sentence corresponding to the sentence Feature vector.
  7. 根据权利要求6所述的方法,其中,在所述组合该句子对应的分词序列中各分词对应的词向量以形成该句子对应的句子特征矩阵之前,所述训练步骤还包括:The method according to claim 6, wherein, before the word vector corresponding to each participle in the participle sequence corresponding to the combination of the sentence to form the sentence feature matrix corresponding to the sentence, the training step also includes:
    对该句子对应的分词序列中每个分词,响应于确定该分词与预设文本类别关键词 集合中的关键词匹配,将该分词对应的词向量设置为预设词向量。For each participle in the participle sequence corresponding to the sentence, in response to determining that the participle matches the keyword in the preset text category keyword set, the word vector corresponding to the participle is set as the preset word vector.
  8. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    确定所述概率值是否大于预设概率阈值;determining whether the probability value is greater than a preset probability threshold;
    响应于确定大于,生成用于指示所述待识别文本为预设文本类别的第一识别结果信息。In response to determining that the value is greater than, first recognition result information for indicating that the text to be recognized is a preset text category is generated.
  9. 根据权利要求8所述的方法,其中,所述方法还包括:The method according to claim 8, wherein the method further comprises:
    响应于确定不大于,生成用于指示所述待识别文本不是预设文本类别的第二识别结果信息。In response to determining that it is not greater than, second recognition result information for indicating that the text to be recognized is not a preset text category is generated.
  10. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    对于所述子文本序列中的每个子文本对应的句子序列中每个句子,基于该句子相对于该子文本的注意力特征向量,计算该句子属于所述预设文本类别的概率值,根据计算得到的概率值确定该句子对应的呈现方式,以及按照所确定的呈现方式呈现该句子。For each sentence in the sentence sequence corresponding to each subtext in the subtext sequence, based on the attention feature vector of the sentence relative to the subtext, calculate the probability value that the sentence belongs to the preset text category, according to the calculation The obtained probability value determines the presentation manner corresponding to the sentence, and presents the sentence according to the determined presentation manner.
  11. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    对于所述子文本序列中的每个子文本,基于该子文本相对于所述待识别文本的注意力特征向量,计算该子文本属于所述预设文本类别的概率值,根据计算得到的概率值确定该子文本对应的呈现方式,以及按照所确定的呈现方式呈现该子文本。For each subtext in the subtext sequence, based on the attention feature vector of the subtext relative to the text to be recognized, calculate the probability value that the subtext belongs to the preset text category, according to the calculated probability value A presentation manner corresponding to the subtext is determined, and the subtext is presented according to the determined presentation manner.
  12. 一种文本类别识别装置,包括:A text category recognition device, comprising:
    拆分单元,被配置成将待识别文本进行拆分得到子文本序列,将所述子文本序列中每个子文本进行拆分得到对应的句子序列;The splitting unit is configured to split the text to be recognized to obtain a subtext sequence, and split each subtext in the subtext sequence to obtain a corresponding sentence sequence;
    特征提取单元,被配置成对于每个所述子文本对应的句子序列中的每个句子按照预先训练的特征提取模型进行特征提取得到该句子对应的句子特征向量;The feature extraction unit is configured to perform feature extraction for each sentence in the sentence sequence corresponding to each subtext according to a pre-trained feature extraction model to obtain a sentence feature vector corresponding to the sentence;
    计算单元,被配置成对于所述子文本序列中的每个子文本,执行以下第一计算操作:对于该子文本中的每个句子,基于该子文本对应的句子序列中每个句子对应的句子特征向量,计算该句子相对于该子文本的注意力特征向量;基于每个句子相对于该子文本的注意力特征向量,计算该子文本相对于所述待识别文本的注意力特征向量;The calculation unit is configured to perform the following first calculation operation for each subtext in the subtext sequence: for each sentence in the subtext, based on the sentence corresponding to each sentence in the sentence sequence corresponding to the subtext A feature vector, calculating the attention feature vector of the sentence relative to the subtext; calculating the attention feature vector of the subtext relative to the text to be identified based on the attention feature vector of each sentence relative to the subtext;
    拼接单元,被配置成拼接所述子文本序列中子文本相对于所述待识别文本的注意力特征向量,得到所述待识别文本对应的待识别文本特征向量;The splicing unit is configured to splice the attention feature vectors of the subtexts in the subtext sequence relative to the text to be recognized to obtain the text feature vector to be recognized corresponding to the text to be recognized;
    识别单元,被配置成将所述待识别文本特征向量输入预先训练的分类模型,得到所述待识别文本属于预设类别文本的概率值。The recognition unit is configured to input the feature vector of the text to be recognized into a pre-trained classification model to obtain a probability value that the text to be recognized belongs to a preset category of text.
  13. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储装置,其上存储有一个或多个程序,a storage device on which one or more programs are stored,
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1-11中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 1-11.
  14. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被一个或多个处理器执行时实现如权利要求1-11中任一所述的方法。A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1-11 when executed by one or more processors.
PCT/CN2022/108224 2021-07-27 2022-07-27 Text category recognition method and apparatus, and electronic device and storage medium WO2023005968A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110849917.9A CN113360660A (en) 2021-07-27 2021-07-27 Text type identification method and device, electronic equipment and storage medium
CN202110849917.9 2021-07-27

Publications (1)

Publication Number Publication Date
WO2023005968A1 true WO2023005968A1 (en) 2023-02-02

Family

ID=77540362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/108224 WO2023005968A1 (en) 2021-07-27 2022-07-27 Text category recognition method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN113360660A (en)
WO (1) WO2023005968A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360660A (en) * 2021-07-27 2021-09-07 北京有竹居网络技术有限公司 Text type identification method and device, electronic equipment and storage medium
CN113836303A (en) * 2021-09-26 2021-12-24 平安科技(深圳)有限公司 Text type identification method and device, computer equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145112A (en) * 2018-08-06 2019-01-04 北京航空航天大学 A kind of comment on commodity classification method based on global information attention mechanism
CN110209806A (en) * 2018-06-05 2019-09-06 腾讯科技(深圳)有限公司 File classification method, document sorting apparatus and computer readable storage medium
US20200104369A1 (en) * 2018-09-27 2020-04-02 Apple Inc. Sentiment prediction from textual data
CN111143550A (en) * 2019-11-27 2020-05-12 浙江大学 Method for automatically identifying dispute focus based on hierarchical attention neural network model
CN111984791A (en) * 2020-09-02 2020-11-24 南京信息工程大学 Long text classification method based on attention mechanism
CN113360660A (en) * 2021-07-27 2021-09-07 北京有竹居网络技术有限公司 Text type identification method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536654B (en) * 2018-04-13 2022-05-17 科大讯飞股份有限公司 Method and device for displaying identification text
CN109710940A (en) * 2018-12-28 2019-05-03 安徽知学科技有限公司 A kind of analysis and essay grade method, apparatus of article conception
CN111339288A (en) * 2020-02-25 2020-06-26 北京字节跳动网络技术有限公司 Method, device, equipment and computer readable medium for displaying text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209806A (en) * 2018-06-05 2019-09-06 腾讯科技(深圳)有限公司 File classification method, document sorting apparatus and computer readable storage medium
CN109145112A (en) * 2018-08-06 2019-01-04 北京航空航天大学 A kind of comment on commodity classification method based on global information attention mechanism
US20200104369A1 (en) * 2018-09-27 2020-04-02 Apple Inc. Sentiment prediction from textual data
CN111143550A (en) * 2019-11-27 2020-05-12 浙江大学 Method for automatically identifying dispute focus based on hierarchical attention neural network model
CN111984791A (en) * 2020-09-02 2020-11-24 南京信息工程大学 Long text classification method based on attention mechanism
CN113360660A (en) * 2021-07-27 2021-09-07 北京有竹居网络技术有限公司 Text type identification method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113360660A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
US20210081611A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
CN107066449B (en) Information pushing method and device
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
WO2023005968A1 (en) Text category recognition method and apparatus, and electronic device and storage medium
CN107861954B (en) Information output method and device based on artificial intelligence
CN109740167B (en) Method and apparatus for generating information
KR20210154705A (en) Method, apparatus, device and storage medium for matching semantics
US10579655B2 (en) Method and apparatus for compressing topic model
WO2020182123A1 (en) Method and device for pushing statement
CN111159409B (en) Text classification method, device, equipment and medium based on artificial intelligence
US11615241B2 (en) Method and system for determining sentiment of natural language text content
EP3872652A2 (en) Method and apparatus for processing video, electronic device, medium and product
WO2022001888A1 (en) Information generation method and device based on word vector generation model
CN111582360B (en) Method, apparatus, device and medium for labeling data
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
US11651015B2 (en) Method and apparatus for presenting information
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN111414471B (en) Method and device for outputting information
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN112906380A (en) Method and device for identifying role in text, readable medium and electronic equipment
US20200026767A1 (en) System and method for generating titles for summarizing conversational documents
CN110245334B (en) Method and device for outputting information
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
KR20220115482A (en) Apparatus for evaluating latent value of patent based on deep learning and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22848573

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE