CN108090099B

CN108090099B - Text processing method and device

Info

Publication number: CN108090099B
Application number: CN201611045925.3A
Authority: CN
Inventors: 王栋; 宋巍; 付瑞吉; 王士进; 胡国平; 秦兵; 刘挺
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-11-22
Filing date: 2016-11-22
Publication date: 2022-02-25
Anticipated expiration: 2036-11-22
Also published as: CN108090099A

Abstract

The embodiment of the invention provides a text processing method and a text processing device, wherein the method comprises the following steps: acquiring text data to be processed; respectively acquiring a candidate category of the text data according to a first text classification model and a second text classification model, wherein the first text classification model is used for classifying the text data according to a title of the text data and a sentence contained in the text data, and the second text classification model is used for classifying the text data according to a specified sentence in the sentences contained in the text data; and determining the category of the text data according to the two acquired candidate categories. In the embodiment of the invention, the text to be classified is classified from two angles of title + full text and appointed sentence to obtain two candidate classes, and the class of the text is finally determined on the basis, so that the efficiency of text classification is effectively improved, the accuracy of text classification is improved, and the influence of human subjectivity on the classification result is reduced.

Description

Text processing method and device

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a text processing method and apparatus.

Background

With the development of information technology, the amount of text information faced by people is also increasing dramatically, and the processing technology of text information is also evolving continuously. Taking the field of education as an example, the current automatic paper reading technology begins to completely reveal the corners, and more schools or education institutions begin to adopt the automatic paper reading technology to automatically read the examination papers of students. Many test papers contain composition, but the composition is taken as a test question with strong subjectivity, and the machine is difficult to directly give the score of the composition.

The inventor finds out in the process of implementing the invention that whether the running question is one of the main focus points of scoring when the composition is corrected, so that for the automatic scoring of the composition, it is critical to judge the category of the student composition before scoring, and the compositions of different categories are corresponding to different scoring standards, so that the determination of the category of the student composition is the basis of the automatic scoring of the composition. In the prior art, when texts such as articles need to be classified, a manual method is generally adopted, that is, after contents of the articles are viewed by relevant persons, categories (such as statements, discussion articles, and the like) of the articles are given, for example, compositions written by students are given, and after contents of the compositions are viewed by teachers, categories of the articles of each composition are given. However, when the number of texts is large, the manual workload is large, the classification efficiency is low, the understanding of different people to the texts may be biased, and the labeling of the text categories is subjective.

Disclosure of Invention

The invention provides a text processing method and a text processing device, which are used for improving the efficiency of text classification.

According to a first aspect of the embodiments of the present invention, there is provided a text processing method, including:

acquiring text data to be processed;

respectively acquiring a candidate category of the text data according to a first text classification model and a second text classification model, wherein the first text classification model is used for classifying the text data according to a title of the text data and a sentence contained in the text data, and the second text classification model is used for classifying the text data according to a specified sentence in the sentences contained in the text data;

and determining the category of the text data according to the two acquired candidate categories.

Optionally, the first text classification model is a neural network model obtained by training in advance;

the obtaining a candidate category of the text data according to the first text classification model includes:

acquiring a semantic matrix of the text data title and a semantic matrix of each sentence in the text data;

taking the semantic matrix of the title and the semantic matrix of each sentence as the input of the first text classification model;

and determining a candidate category of the text data according to the probability that the text data output by the first text classification model belongs to each preset category.

Optionally, the obtaining the semantic matrix of the text data title and the semantic matrix of each sentence in the text data includes:

acquiring the title and a word vector of each word contained in each sentence;

forming a semantic matrix of the title by taking a word vector of each word contained in the title as a row;

and forming a semantic matrix of each sentence by taking the word vector of each word contained in each sentence as a row.

Optionally, the first text classification model includes a sentence coding layer, a chapter coding layer, an attention layer, a weighted summation layer, and an output layer;

the sentence coding layer is used for carrying out sentence-level coding on the semantic matrix of the title and the semantic matrix of each sentence to obtain sentence-level coding characteristics;

the chapter coding layer is used for taking the sentence-level coding features output by the sentence coding layer as input, and carrying out chapter-level coding on the title and the sentence-level coding features of each sentence from the perspective of the whole text to obtain chapter-level coding features;

the attention layer is used for taking the chapter-level coding features output by the chapter coding layer as input and calculating the importance weight of each sentence according to the title and the chapter-level coding features of each sentence;

the weighted summation layer is used for calculating to obtain a semantic matrix of the text data by taking the importance weight of each sentence output by the attention layer and the corresponding chapter-level coding feature of each sentence as input, wherein the semantic matrix of the text data is the sum of products of the importance weight of each sentence and the corresponding chapter-level coding feature;

and the output layer is used for taking the semantic matrix of the text data output by the weighted summation layer as input and outputting the probability that the text data belongs to each preset category.

Optionally, the attention layer calculates an importance weight of each sentence according to the title and chapter-level coding features of each sentence, and includes:

calculating the attention value of each sentence according to the chapter-level coding features of each sentence and the attention vector of the attention layer;

calculating similarity between the chapter-level coding features of each sentence and the chapter-level coding features of the title to serve as dominant line weight of each sentence;

and calculating the importance weight of each sentence according to the attention value and the main line weight of each sentence.

Optionally, the obtaining a candidate category of the text data according to the second text classification model includes:

acquiring a specified sentence from sentences contained in the text data according to a preset rule;

extracting text classification features of each of the designated sentences, wherein the text classification features at least comprise one of the following features: the sentence-level text classification characteristic is used for describing the self characteristic of the current sentence, the chapter-level text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the whole text, and the sentence context text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the context of the current sentence;

and determining a candidate category of the text data according to the probability that the text data output by the second text classification model belongs to each preset category.

Optionally, the obtaining a specified sentence from the sentences included in the text data according to a preset rule includes:

acquiring the importance weight of each sentence;

normalizing and standardizing the importance weights of all sentences;

and screening out key sentences from all sentences to be used as the designated sentences according to the relationship between the importance weight after each sentence normalization and normalization processing and a preset threshold value.

Optionally, the sentence-level text classification feature includes at least one of the following features:

sentence length, sentence end punctuation, the number of occurrences of emotion words in the sentence, and the number of occurrences of feature words in the sentence;

the chapter-level text classification features include at least one of the following features:

segment labels of sentences in the text, whether the sentences appear at the head segment of the text, whether the sentences appear at the tail segment of the text, sentence labels of the sentences in the segments where the sentences are located, whether the sentences are the head segments of the segments where the sentences are located, whether the sentences are the tail segments of the segments where the sentences are located, the total number of the sentences in the segments where the sentences are located and the average sentence length of the segments where the sentences are located;

the sentence context text classification feature comprises at least one of the following features:

the sentence-level text classification characteristic and the chapter-level text classification characteristic of one or more sentences before the current sentence, and the sentence-level text classification characteristic and the chapter-level text classification characteristic of one or more sentences after the current sentence.

According to a second aspect of the embodiments of the present invention, there is provided a text processing apparatus, the apparatus including:

the text acquisition unit is used for acquiring text data to be processed;

the first text classification unit is used for acquiring a candidate category of the text data according to a first text classification model, wherein the first text classification model is used for classifying the text data according to a title of the text data and a sentence contained in the text data;

a second text classification unit, configured to obtain a candidate category of the text data according to a second text classification model, where the second text classification model is configured to classify the text data according to a specified sentence in sentences contained in the text data;

and the classification determining unit is used for determining the classification of the text data according to the two acquired candidate classifications.

the first text classification unit includes:

a semantic matrix obtaining subunit, configured to obtain a semantic matrix of the text data title and a semantic matrix of each sentence in the text data;

an input subunit, configured to use the semantic matrix of the title and the semantic matrix of each sentence as input of the first text classification model;

and the output subunit is used for determining a candidate category of the text data according to the probability that the text data output by the first text classification model belongs to each preset category.

Optionally, the semantic matrix obtaining subunit is configured to:

acquiring the title and a word vector of each word contained in each sentence;

Optionally, the second text classification unit includes:

a designated sentence acquisition subunit, configured to acquire a designated sentence from the sentences included in the text data according to a preset rule;

a classification feature extraction subunit, configured to extract a text classification feature of each of the designated sentences, where the text classification feature at least includes one of the following features: the sentence-level text classification characteristic is used for describing the self characteristic of the current sentence, the chapter-level text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the whole text, and the sentence context text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the context of the current sentence;

and the input and output subunit is used for taking the text classification features of all the specified sentences as the input of the second text classification model, and determining a candidate category of the text data according to the probability that the text data output by the second text classification model belongs to each preset category.

Optionally, the specified sentence acquisition subunit is configured to:

acquiring the importance weight of each sentence;

normalizing and standardizing the importance weights of all sentences;

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the text to be classified is analyzed from two angles, namely, the first text classification model is used for classifying the text to be classified from the perspective of a title and a whole text, and the second text classification model is used for classifying the text to be classified from the perspective of a sentence, namely a designated sentence in the text, so as to obtain two candidate categories, and the category of the text is finally determined on the basis, so that the efficiency of text classification is effectively improved, the accuracy of text classification is also improved, and the influence of human subjectivity on classification results is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise. Furthermore, these descriptions should not be construed as limiting the embodiments, wherein elements having the same reference number designation are identified as similar elements throughout the figures, and the drawings are not to scale unless otherwise specified.

FIG. 1 is a flow diagram illustrating a method of text processing according to an exemplary embodiment of the present invention;

FIG. 2 is a flow diagram illustrating a method of text processing according to an exemplary embodiment of the invention;

FIG. 3 is a schematic diagram illustrating the structure of a first text classification model according to an exemplary embodiment of the present invention;

FIG. 4 is a flow diagram illustrating a method of text processing in accordance with an exemplary embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a text processing apparatus according to an exemplary embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a text processing apparatus according to an exemplary embodiment of the present invention;

fig. 7 is a schematic diagram illustrating a text processing apparatus according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a text processing method according to an exemplary embodiment of the present invention. The method may be used for devices such as mobile phones, tablet computers, desktop computers, notebook computers, servers, and the like, as examples.

Referring to fig. 1, the method may include the steps of:

and step S101, acquiring text data to be processed.

The embodiment is not limited to a specific form of the text data to be processed, and may be, for example, an article (e.g., student composition), etc.

One or more categories may be preset as the preset categories, for example, in terms of language composition, the preset categories are classified into a statement category, an discussion article category, a narrative category, and the like according to different expression modes. The purpose of this embodiment is to determine to which preset category or categories the text data to be processed belongs.

Step S102, a candidate category of the text data is respectively obtained according to a first text classification model and a second text classification model, wherein the first text classification model is used for classifying the text data according to a title of the text data and a sentence contained in the text data, and the second text classification model is used for classifying the text data according to a specified sentence in the sentence contained in the text data.

In order to improve the accuracy of text classification, in the embodiment, the text is analyzed from two angles, that is, the text to be classified is classified from the perspective of a chapter such as title + whole text by using the first text classification model, and from the perspective of a sentence such as a specified sentence in the text by using the second text classification model, so as to obtain two candidate categories, and then the category of the text is finally determined on the basis.

The specific sentence in the text is not limited in the embodiment, and for example, the specific sentence may be a key sentence in the text, and the like. Definitions for specific sentences those skilled in the art may choose and design themselves according to different needs/different scenarios without departing from the spirit and scope of the present invention.

As an example, the first and second text classification models may be both neural network models obtained by training in advance. Of course, the present embodiment is not limited to the specific details of the neural network model, and those skilled in the art can design, combine, etc. according to various existing neural network models.

Neural network models can generally be obtained by training. Thus, in this embodiment or some other embodiment of the invention, a large amount of text data may be collected in advance for training of the neural network.

As an example, the text data for training may be collected through a network, or corresponding text obtained by image recognition of text written by the user may be collected as text data. For example, when the collected text is a Chinese composition, the text data of the corresponding Chinese composition, including the title of the composition and the content of the composition, can be obtained by collecting composition test paper written during the examination of the student and performing image recognition.

A large amount of collected text is generally provided or assigned with a corresponding text category label, and the category of the text can be determined according to application requirements, for example, the text can be set as a narrative, an article, a narrative, or the like when the text is a Chinese composition. The text category may be represented by different symbols, for example, for a Chinese composition, 1 may be used to represent a descriptive text, 2 may be used to represent a discussion article, and 3 may be used to represent a descriptive text, which may be expressed by other methods, and the embodiment of the present invention is not limited.

Step S103, determining the category of the text data according to the two acquired candidate categories.

For example, both text classification models may output the probability of the category to which the current text data belongs, on the basis of which it may be finally determined which category the current text data should belong.

Specifically, when the two obtained candidate categories are different, the candidate category with the higher probability value can be directly selected as the final category of the text to be classified. For example, the output of the first text classification model is "80% of narrative", the output of the second text classification model is "70% of discussion", that is, the first text classification model considers that the current text has a probability of 80% belonging to the category of narrative, and the second text classification model considers that the current text has a probability of 70% belonging to the discussion, the higher probability can be selected as the finally determined category of the current text. Or, when the two obtained candidate categories are different, the text to be classified may also be marked as an undetermined category, and then the final category of the text to be classified is determined manually, and so on.

In the embodiment, the text to be classified is analyzed from two angles, that is, the first text classification model is used to classify the text to be classified from the perspective of a chapter such as a title and a whole text, and the second text classification model is used to classify the text to be classified from the perspective of a sentence such as a designated sentence in the text to obtain two candidate categories, and then the category of the text is finally determined on the basis, so that the efficiency of text classification is effectively improved, the accuracy of text classification is also improved, and the influence of human subjectivity on classification results is reduced.

Referring to fig. 2, in this embodiment or some other embodiments of the present invention, the obtaining a candidate category of the text data according to the first text classification model may include:

step S201, a semantic matrix of the text data title and a semantic matrix of each sentence in the text data are obtained.

For text data, such as a composition, there is usually a title, and the semantic matrix of the title can be obtained. For the content of the text data, it is usually composed of a plurality of sentences, and for each sentence, the semantic matrix of the sentence can be obtained. The embodiment is not limited to the specific content of the semantic matrix, for example, the semantic matrix may be generally composed of word vectors.

As an example, the obtaining the semantic matrix of the text data title and the semantic matrix of each sentence in the text data may include:

1) and acquiring the title and a word vector of each word contained in each sentence.

For example, the word segmentation may be performed on the title and the sentence, and the corresponding word vector may be obtained, where the word segmentation method may use a method based on a conditional random field, for example, and when each word after the word segmentation is converted into a word vector, the word vector of each word may be obtained by using a word2vec technology, for example, which is not described in detail in this embodiment.

2) And forming a semantic matrix of the title by taking the word vector of each word contained in the title as a row.

3) And forming a semantic matrix of each sentence by taking the word vector of each word contained in each sentence as a row.

The word vector of the words contained in the text title can be used as each row of the title semantic matrix to obtain a title semantic matrix, the size of which is kt × m, where kt represents the total number of words contained in the title and m represents the dimension of each word vector.

The word vector of the word contained in each sentence in the text can be used as each row of the semantic matrix of each sentence to obtain the semantic matrix of each sentence in the text, wherein the semantic matrix of each sentence has the size of k_c×m，k_cIndicating the number of words contained in the c-th sentence in the current text.

In addition, when the text title and the sentence in the text contain different numbers of words, or each sentence in the text contains different numbers of words, the semantic matrix of the text title and/or the semantic matrix of each sentence in the text can be normalized, so that the semantic matrices are normalized into the same size matrix. Of course, no normalization is required, and this embodiment is not limited.

Step S202, the semantic matrix of the title and the semantic matrix of each sentence are used as the input of the first text classification model.

Step S203, determining a candidate category of the text data according to the probability that the text data output by the first text classification model belongs to each preset category.

The specific structure of the first text classification model is exemplified below.

Taking text data as an example, referring to fig. 3, the first text classification model may include at least a sentence encoding layer, a chapter encoding layer, an attention layer, a weighted sum layer, and an output layer.

a) And the sentence coding layer is used for carrying out sentence-level coding on the semantic matrix of the title and the semantic matrix of each sentence to obtain sentence-level coding characteristics.

The semantic matrix of the current text title and the semantic matrix of each sentence in the text may be used as input (or as an input layer), and X ═ T, C may be used₁,C₂,...C_nDenotes, where T denotes the title semantic matrix, C₁,C₂,...C_nThe semantic matrixes of the sentences in the current text are respectively, and n is the total number of the sentences contained in the current text.

The sentence coding layer may include a sentence-level encoder for performing sentence-level coding on the title of the current text and each sentence in the text to obtain a coded sentence-level coding characteristic. Sentence-level coding features may use S ═ { st, S₁,s₂,...,s_nDenotes, where st denotes the sentence-level coding characteristics of the title obtained by sentence-coding the semantic matrix of the text title, s_nRepresenting sentence level of the sentence obtained by sentence level coding the semantic matrix of the nth sentenceCoding characteristics, st and s₁,s₂,...s_nThe vectors with the same dimension are determined according to application requirements or experimental results. As an example, the sentence encoding layer may be implemented using a convolutional neural network, a cyclic or recursive neural network, or the like.

b) The chapter coding layer is used for taking the sentence-level coding features output by the sentence coding layer as input, and carrying out chapter-level coding on the title and the sentence-level coding features of each sentence again from the perspective of the whole text to obtain chapter-level coding features.

The input of the chapter coding layer is the output of the sentence coding layer. The output of the chapter coding layer is the chapter-level coding characteristic, and H ═ ht, H can be used₁,h₂,...,h_nH, ht represents the chapter-level coding characteristics obtained by chapter-level coding the sentence-level coding characteristics of the text title, h_nAnd expressing the chapter-level coding characteristics obtained after the sentence-level coding characteristics of the nth sentence are subjected to chapter-level coding. ht and h₁,h₁,...h_nThe vectors are all vectors with the same dimension, and the specific vector dimension can be determined according to application requirements or experimental results. The chapter coding layer can adopt a bidirectional Recurrent Neural Network (RNN) structure, and each node is connected in two directions, so that the information of the title of the current text and all sentences of the text can be taken into consideration, and chapter-level coding can be realized. The specific encoding process is not described in detail.

c) The attention (attention) layer is configured to use the chapter-level coding features output by the chapter coding layer as input, and calculate an importance weight of each sentence according to the title and the chapter-level coding features of each sentence.

The importance weight may use P ═ P₁,p₂,...,p_nDenotes where p is_jThe importance weight of the jth sentence of the current text.

The attention layer calculates the importance weight of each sentence according to the title and the chapter-level coding features of each sentence, and may include:

c1) and calculating the attention value of each sentence according to the chapter-level coding features of each sentence and the attention vector of the attention layer.

As an example, in the specific calculation, a calculation value obtained by inner product of the chapter-level coding features of each sentence and the attention vector of the attention layer may be directly used as the attention value of each sentence in the current text, and the specific calculation method is shown as follows:

a_j＝h_j·v^T

wherein, a_jAttention value, h, for the jth sentence of the current text_jFor the chapter-level coding characteristics of the jth sentence of the current text, v is h_jAttention vectors with the same dimension are model parameters, initial values of the attention vectors can be obtained through random initialization, and final values of the attention vectors can be obtained through mass data training in advance.

c2) And calculating the similarity between the chapter-level coding characteristics of each sentence and the chapter-level coding characteristics of the title to serve as the dominant line weight of each sentence.

As an example, the following formula may be used in the specific calculation:

wherein, t_jThe weight of the main line of the jth sentence of the current text.

c3) And calculating the importance weight of each sentence according to the attention value and the main line weight of each sentence.

As an example, during specific calculation, a product of the attention value and the dominant line weight of each sentence is calculated, then the product is normalized, and a calculated value obtained after normalization is used as the importance weight of each sentence, as shown in the following formula:

wherein p is_jImportance of the jth sentence of the current textAnd (4) degree weight.

d) The weighted summation layer is used for calculating to obtain a semantic matrix of the text data by taking the importance weight of each sentence output by the attention layer and the corresponding chapter-level coding feature of each sentence as input, wherein the semantic matrix of the text data is the sum of products of the importance weight of each sentence and the corresponding chapter-level coding feature.

As an example, the following formula may be used in the specific calculation:

wherein A is a semantic matrix of the text data.

e) And the output layer is used for taking the semantic matrix of the text data output by the weighted summation layer as input and outputting the probability that the text data belongs to each preset category.

The probability that the current text data belongs to each preset category is obtained, and a candidate category can be further determined, for example, the preset category with the highest probability can be used as the candidate category.

The embodiment of the specific neural network structure adopted by the output layer is not limited, and the model parameters thereof may be obtained by pre-training, which is not described herein again.

A text often contains some important sentences, such as main line sentences in narrative texts, subject sentences in discussion sentences, statement sentences describing subject matters in explanatory texts, and the like. The inventor finds out in the process of implementing the invention that the category of a text can be basically determined according to the key sentences.

By way of example, the second text classification model may be a classification model commonly used in pattern recognition, such as a support vector machine classification model, a bayesian classification model, a decision tree classification model, a neural network classification model, and so on.

Referring to fig. 4, in this embodiment or some other embodiments of the present invention, the obtaining a candidate category of the text data according to the second text classification model may include:

step S401, acquiring a specified sentence from the sentences included in the text data according to a preset rule.

As an example, the specified sentence may be an emphasized sentence. For example, an importance weight of each sentence in the text may be calculated, and then a sentence with an importance weight higher than a preset threshold may be used as the key sentence. The present embodiment is not limited to how to calculate the importance weight of each sentence, and the importance weight of a sentence may be calculated according to the position of the sentence in the text, the length of the sentence, and the like. Those skilled in the art can make and design decisions and decisions that can be made to achieve the above described objectives without departing from the spirit and scope of the present invention.

As an example, the obtaining of the specified sentence from the sentences contained in the text data according to the preset rule may include:

i) and acquiring the importance weight of each sentence.

For example, the importance weight of each sentence can be calculated by the attention layer above. Of course, those skilled in the art can perform calculation according to other ways, and the present embodiment is not limited thereto.

ii) the importance weights of all sentences are normalized and normalized.

As an example, the normalization may specifically use the following formula:

wherein the content of the first and second substances,

and max (P) is the maximum value of the importance weights of all sentences in the current text.

And then normalizing the normalized importance weight of each sentence in the current text to obtain the normalized sentence importance weight, wherein the specific method can be shown as the following formula:

wherein sp_jThe importance weight of the jth sentence in the current text is normalized, mu is the mean value of all the normalized sentence importance weights in the current text, and sigma is the standard deviation of all the normalized sentence importance weights in the current text.

And iii) according to the relation between the importance weight after each sentence normalization and normalization processing and a preset threshold value, screening out important sentences from all sentences to be used as the designated sentences.

Step S402, extracting the text classification feature of each appointed sentence, wherein the text classification feature at least comprises one of the following features: the sentence-level text classification feature is used for describing the characteristics of the current sentence, the chapter-level text classification feature is used for describing the characteristics of the current sentence from the perspective of the whole text, and the sentence context text classification feature is used for describing the characteristics of the current sentence from the perspective of the context of the current sentence.

Illustratively, the sentence-level text classification features may include at least one of the following features:

the method comprises the steps of sentence length, sentence end punctuation, the occurrence times of emotion words in a sentence and the occurrence times of feature words in the sentence.

The sentence length refers to the length of the current sentence and can be represented by the number of words contained in the sentence;

the sentence-ending punctuation refers to the punctuation marks in the text where the current sentence ends, such as comma "," ", period". "and the like;

the number of the emotion words in the sentence refers to the number of the emotion words contained in the current sentence, the emotion words can be obtained in advance according to application requirements, whether each word in the current sentence is an emotion word or not is sequentially judged during extraction, and the number of the emotion words contained in the current sentence, namely the number of occurrence times of the emotion words, is obtained;

the frequency of occurrence of the feature words in the sentence refers to the frequency of occurrence of the feature words included in the current sentence, when the feature words are specifically extracted, the feature words included in the current sentence need to be found first, then the frequency of occurrence of each feature word in the current sentence is counted, the feature words can be obtained by calculation according to words or phrases included in key sentences of all texts, for example, information gain or mutual information of the words or phrases when classifying texts can be obtained during specific calculation, for example, the words or phrases of which the information gain or mutual information is greater than a threshold value are used as the feature words, the threshold value can be determined according to application requirements, and if the current sentence does not include the feature words, the frequency of occurrence of the feature words is 0.

For example, the chapter-level text classification features may include at least one of the following features:

the segment number of the sentence in the text, whether the sentence appears at the head segment of the text, whether the sentence appears at the tail segment of the text, the sentence number of the sentence in the segment where the sentence is located, whether the sentence is the head segment of the segment where the sentence is located, whether the sentence is the tail segment of the segment where the sentence is located, the total number of the sentences of the segment where the sentence is located and the average sentence length of the segment where the sentence is located.

The segment index may be the sequence number of the current paragraph in all paragraphs, and the sentence index may be the sequence number of the current sentence in all sentences of the current paragraph.

③ by way of example, the sentence context text classification characteristic comprises at least one of the following characteristics:

Step S403, using the text classification features of all the specified sentences as the input of the second text classification model, and determining a candidate category of the text data according to the probability that the text data belongs to each preset category, which is output by the second text classification model.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Fig. 5 is a schematic diagram illustrating a text processing apparatus according to an exemplary embodiment of the present invention. The device can be used for mobile phones, tablet computers, desktop computers, notebook computers, servers and other equipment as examples.

Referring to fig. 5, the apparatus may include:

a text acquiring unit 501, configured to acquire text data to be processed;

a first text classification unit 502, configured to obtain a candidate category of the text data according to a first text classification model, where the first text classification model is configured to classify the text data according to a title of the text data and a sentence included in the text data;

a second text classification unit 503, configured to obtain a candidate category of the text data according to a second text classification model, where the second text classification model is configured to classify the text data according to a specified sentence in sentences contained in the text data;

a classification determining unit 504, configured to determine a category of the text data according to the two acquired candidate categories.

In this embodiment or some other embodiments of the present invention, the first text classification model may be a neural network model obtained by training in advance;

correspondingly, referring to fig. 6, the first text classification unit may include:

a semantic matrix obtaining subunit 601, configured to obtain a semantic matrix of the text data title and a semantic matrix of each sentence in the text data;

an input subunit 602, configured to use the semantic matrix of the title and the semantic matrix of each sentence as input of the first text classification model;

an output subunit 603, configured to determine a candidate category of the text data according to a probability that the text data output by the first text classification model belongs to each preset category.

In this embodiment or some other embodiments of the present invention, the semantic matrix acquiring subunit may be configured to:

acquiring the title and a word vector of each word contained in each sentence;

In this embodiment or some other embodiments of the present invention, the first text classification model may include a sentence coding layer, a chapter coding layer, an attention layer, a weighted sum layer, and an output layer;

In this embodiment or some other embodiments of the present invention, the calculating, by the attention layer, the importance weight of each sentence according to the title and the chapter-level encoding features of each sentence may include:

Referring to fig. 7, in this embodiment or some other embodiments of the present invention, the second text classification unit may include:

a designated sentence acquisition subunit 701, configured to acquire a designated sentence from the sentences included in the text data according to a preset rule;

a classification feature extracting subunit 702, configured to extract a text classification feature of each of the specified sentences, where the text classification feature at least includes one of the following features: the sentence-level text classification characteristic is used for describing the self characteristic of the current sentence, the chapter-level text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the whole text, and the sentence context text classification characteristic is used for describing the characteristic of the current sentence from the perspective of the context of the current sentence;

an input/output subunit 703, configured to use the text classification features of all the specified sentences as the input of the second text classification model, and determine a candidate category of the text data according to the probability that the text data output by the second text classification model belongs to each preset category.

In this embodiment or some other embodiments of the present invention, the specified sentence acquisition subunit may be configured to:

acquiring the importance weight of each sentence;

normalizing and standardizing the importance weights of all sentences;

In this or some other embodiments of the present invention, the sentence-level text classification features may include at least one of the following features:

The specific manner in which each unit \ module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of text processing, the method comprising:

acquiring text data to be processed;

determining the category of the text data according to the two acquired candidate categories; wherein, when a specified sentence is acquired from the sentences contained in the text data, the method includes: acquiring the importance weight of each sentence; normalizing and standardizing the importance weights of all sentences; and screening out key sentences from all sentences to be used as the designated sentences according to the relationship between the importance weight after each sentence normalization and normalization processing and a preset threshold value.

2. The method according to claim 1, wherein the first text classification model is a neural network model obtained by training in advance;

3. The method of claim 2, wherein the obtaining the semantic matrix of the text data title and the semantic matrix of each sentence in the text data comprises:

acquiring the title and a word vector of each word contained in each sentence;

4. The method of claim 2, wherein the first text classification model comprises a sentence coding layer, a chapter coding layer, an attention layer, a weighted sum layer, an output layer;

5. The method of claim 4, wherein the attention layer calculates the importance weight of each sentence according to the title and chapter-level coding features of each sentence, comprising:

6. The method of claim 1, wherein obtaining a candidate category of the text data according to the second text classification model comprises:

7. The method of claim 6, wherein the sentence-level text classification features comprise at least one of the following features:

8. A text processing apparatus, characterized in that the apparatus comprises:

the text acquisition unit is used for acquiring text data to be processed;

a classification determining unit, configured to determine a category of the text data according to the two acquired candidate categories; wherein the second text classification unit, when acquiring a specified sentence from among the sentences contained in the text data, is further configured to: acquiring the importance weight of each sentence; normalizing and standardizing the importance weights of all sentences; and screening out key sentences from all sentences to be used as the designated sentences according to the relationship between the importance weight after each sentence normalization and normalization processing and a preset threshold value.

9. The apparatus according to claim 8, wherein the first text classification model is a neural network model obtained by training in advance;

the first text classification unit includes:

10. The apparatus of claim 9, wherein the semantic matrix obtaining subunit is configured to:

acquiring the title and a word vector of each word contained in each sentence;

11. The apparatus of claim 9, wherein the first text classification model comprises a sentence coding layer, a chapter coding layer, an attention layer, a weighted sum layer, an output layer;

12. The apparatus of claim 11, wherein the attention layer calculates an importance weight of each sentence according to the title and chapter-level encoding features of each sentence, comprising:

13. The apparatus of claim 8, wherein the second text classification unit comprises:

14. The apparatus of claim 13, wherein the sentence-level text classification features comprise at least one of: