CN108090099A

CN108090099A - A kind of text handling method and device

Info

Publication number: CN108090099A
Application number: CN201611045925.3A
Authority: CN
Inventors: 王栋; 宋巍; 付瑞吉; 王士进; 胡国平; 秦兵; 刘挺
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-11-22
Filing date: 2016-11-22
Publication date: 2018-05-29
Anticipated expiration: 2036-11-22
Also published as: CN108090099B

Abstract

An embodiment of the present invention provides a kind of text handling method and device, wherein method includes：Obtain pending text data；Obtain a candidate categories of the text data respectively according to the first textual classification model and the second textual classification model, the sentence that wherein described first textual classification model is used to be included according to the title and the text data of the text data classifies to the text data, and the specified sentence that second textual classification model is used in the sentence included according to the text data classifies to the text data；The classification of the text data is determined according to two acquired candidate categories.In embodiments of the present invention, classifying text is treated from two kinds of angles of title+full text and specified sentence to classify, obtain two candidate categories, the classification of text is finally determined on this basis, so it is effectively improved the efficiency of text classification, the accuracy of text classification is also improved simultaneously, reduces influence of the subjectivity to classification results of people.

Description

A kind of text handling method and device

Technical field

The present invention relates to natural language processing field more particularly to a kind of text handling methods and device.

Background technology

With the development of information technology, the text message amount that people are faced is also in explosive growth, related text message Treatment technology also constantly evolving.By taking education sector as an example, current automatic marking technology starts to show up prominently, more and more School or educational institution start to carry out Automatic Read Overmarginalia to the paper of student using automatic marking technology.Work is included in many papers Text, but write a composition as subjective examination question, machine is difficult the score for directly giving composition.

Inventor has found in the implementation of the present invention, when correcting compositions, if it is being primarily upon for scoring to digress from the subject One of point, therefore for the automatic marking of composition, scoring first judges that the classification of theme is very crucial before, and inhomogeneity Other composition often correspond to it is different read and appraise standard, it can be said that the classification for determining theme is the base of composition automatic marking Plinth.In the prior art, when needing to article when texts are classified, manual method is generally adopted by, i.e., by relevant people After member checks the content of article, the classification (such as expository writing, argumentative writing) of article, such as the composition that student writes are provided, is usually After teacher is needed to check composition content, the article category of every composition is provided.However when amount of text is more, labor workload Very big, classification effectiveness is very low, and different people might have deviation to the understanding of text, exists to the mark of text categories subjective Property.

The content of the invention

The present invention provides a kind of text handling method and device, to improve the efficiency of text classification.

It is according to embodiments of the present invention in a first aspect, provide a kind of text handling method, the described method includes：

Obtain pending text data；

Obtain a candidate of the text data respectively according to the first textual classification model and the second textual classification model Classification, wherein first textual classification model is used for what is included according to the title and the text data of the text data Sentence classifies to the text data, and second textual classification model is used in the sentence included according to the text data Specified sentence classify to the text data；

The classification of the text data is determined according to two acquired candidate categories.

Optionally, first textual classification model is the neural network model obtained beforehand through training；

The candidate categories that the text data is obtained according to the first textual classification model, including：

Obtain the semantic matrix of the text data title and the semantic matrix of each sentence in the text data；

Using the semantic matrix of the semantic matrix of the title and each sentence together as first textual classification model Input；

The text data exported according to first textual classification model belongs to the probability of each pre-set categories, really One candidate categories of the fixed text data.

Optionally, in the semantic matrix for obtaining the text data title and the text data each sentence language Adopted matrix, including：

Obtain the term vector for each word that the title and each sentence are included；

The term vector for each word that the title is included forms the semantic matrix of the title for a line；

The term vector for each word that each sentence is included forms the semantic matrix of each sentence for a line.

Optionally, first textual classification model is asked including sentence coding layer, chapter coding layer, attention layer, weighting With layer, output layer；

The sentence coding layer carries out Sentence-level coding for the semantic matrix of the semantic matrix to title and each sentence To obtain Sentence-level coding characteristic；

The chapter coding layer, for using the Sentence-level coding characteristic that the sentence coding layer exports as input, from whole Piece text angle re-starts the Sentence-level coding characteristic of the title and each sentence chapter grade coding to obtain chapter grade Coding characteristic；

The attention layer, for using the chapter grade coding characteristic that the chapter coding layer exports as input, according to institute The importance weight of each sentence is calculated in the chapter grade coding characteristic for stating title and each sentence；

The weighted sum layer, for the importance weight of each sentence exported with the attention layer and each sentence As inputting the semantic matrix of the text data is calculated, wherein the text data in corresponding chapter grade coding characteristic Semantic matrix for each sentence importance weight and the corresponding chapter grade coding characteristic sum of products；

The output layer, for the semantic matrix of the text data that is exported using the weighted sum layer as input, Export the probability for belonging to each pre-set categories for the text data.

Optionally, the attention layer is calculated each according to the chapter grade coding characteristic of the title and each sentence The importance weight of sentence, including：

According to the chapter grade coding characteristic of each sentence and the attention force vector of attention layer, the attention of each sentence is calculated Force value；

The chapter grade coding characteristic of each sentence and the chapter grade coding characteristic similarity of the title are calculated, using as every The main line weight of a sentence；

According to the attention force value of each sentence and main line weight, the importance weight of each sentence is calculated.

Optionally, the candidate categories that the text data is obtained according to the second textual classification model, including：

It is obtained according to preset rules from the sentence that the text data is included and specifies sentence；

The text classification feature of each specified sentence of extraction, wherein the text classification feature is included at least with next Kind feature：For describing the Sentence-level text classification feature of current sentence self-characteristic, for working as from the description of entire chapter text angle The chapter grade text classification feature of the characteristic of preceding sentence, for describing the characteristic of current sentence from the context angle of current sentence Sentence context text classification feature；

Using the text classification feature of all specified sentences as the input of second textual classification model, according to described The text data that two textual classification models are exported belongs to the probability of each pre-set categories, determines the one of the text data A candidate categories.

Optionally, described obtained according to preset rules from the sentence that the text data is included specifies sentence, including：

Obtain the importance weight of each sentence；

The importance weight of all sentences is normalized and standardized；

According to the importance weight and the relation of predetermined threshold value after the normalization of each sentence and standardization, from all sentences Emphasis sentence is filtered out in son using as the specified sentence.

Optionally, the Sentence-level text classification feature includes at least one of following characteristics：

Sentence length, sentence ending punctuate, emotion word occurrence number in sentence, Feature Words occurrence number in sentence；

The chapter grade text classification feature includes at least one of following characteristics：

Whether the segment labeling of sentence in the text, sentence appear in text first section, and whether sentence appears in text endpiece, sentence Sentence label in son section where it, sentence whether be section where it first sentence, sentence whether be section where it tail sentence, sentence The sentence sum of place section, the average sentence length of section where sentence；

The sentence context text classification feature includes at least one of following characteristics：

The current previous sentence of sentence or the Sentence-level text classification feature of more and chapter grade text classification feature, current sentence it Afterwards one or the Sentence-level text classification feature of more and chapter grade text classification feature.

Second aspect according to embodiments of the present invention, provides a kind of text processing apparatus, and described device includes：

Text acquiring unit, for obtaining pending text data；

First text classification unit, for obtaining candidate's class of the text data according to the first textual classification model Not, wherein first textual classification model is used for the sentence included according to the title and the text data of the text data Son classifies to the text data；

Second text classification unit, for obtaining candidate's class of the text data according to the second textual classification model Not, wherein second textual classification model is used for specified sentence in the sentence that is included according to the text data to described Text data is classified；

Classification determination unit, for determining the classification of the text data according to two acquired candidate categories.

The first text classification unit includes：

Semantic matrix obtains subelement, for obtaining in the semantic matrix and the text data of the text data title The semantic matrix of each sentence；

Input subelement, for using the semantic matrix of the semantic matrix of the title and each sentence together as described the The input of one textual classification model；

Subelement is exported, the text data for being exported according to first textual classification model belongs to each pre- If the probability of classification, a candidate categories of the text data are determined.

Optionally, the semantic matrix obtains subelement and is used for：

Optionally, the second text classification unit includes：

Sentence is specified to obtain subelement, is referred to for being obtained according to preset rules from the sentence that the text data is included Determine sentence；

Characteristic of division extracts subelement, for extracting the text classification feature of each specified sentence, wherein the text This characteristic of division includes at least a kind of following feature：For describing the Sentence-level text classification feature of current sentence self-characteristic, For describing the chapter grade text classification feature of the characteristic of current sentence from entire chapter text angle, for above and below current sentence Literary angle describes the sentence context text classification feature of the characteristic of current sentence；

Input and output subelement, for using the text classification feature of all specified sentences as the second text classification mould The input of type, the text data exported according to second textual classification model belong to the probability of each pre-set categories, Determine a candidate categories of the text data.

Optionally, the specified sentence obtains subelement and is used for：

Obtain the importance weight of each sentence；

The importance weight of all sentences is normalized and standardized；

Technical solution provided in an embodiment of the present invention can include the following benefits：

In embodiments of the present invention, start with simultaneously from two angles and treat classifying text and analyzed, i.e. use the first text This disaggregated model is from this chapter angle of title+full text and uses the second textual classification model from the specified sentence in text This sentence angle treats classifying text and classifies, and obtains two candidate categories, then finally determines text on this basis Classification, is so effectively improved the efficiency of text classification, while also improves the accuracy of text classification, reduces the master of people Influence of the property seen to classification results.

It should be appreciated that above general description and following detailed description are only exemplary and explanatory, not It can the limitation present invention.

Description of the drawings

It in order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without having to pay creative labor, can also be obtained according to these attached drawings other attached drawings.In addition, these are situated between Continue the restriction not formed to embodiment, and the element with same reference numbers label is expressed as similar element in attached drawing, removes Non- have a special statement, and composition does not limit the figure in attached drawing.

Fig. 1 is a kind of flow chart of text handling method shown in an exemplary embodiment according to the present invention；

Fig. 2 is a kind of flow chart of text handling method shown in an exemplary embodiment according to the present invention；

Fig. 3 is the structure diagram of the first textual classification model shown in an exemplary embodiment according to the present invention；

Fig. 4 is a kind of flow chart of text handling method shown in an exemplary embodiment according to the present invention；

Fig. 5 is a kind of schematic diagram of text processing apparatus shown in an exemplary embodiment according to the present invention；

Fig. 6 is a kind of schematic diagram of text processing apparatus shown in an exemplary embodiment according to the present invention；

Fig. 7 is a kind of schematic diagram of text processing apparatus shown in an exemplary embodiment according to the present invention.

Specific embodiment

Here exemplary embodiment will be illustrated in detail, example is illustrated in the accompanying drawings.Following description is related to During attached drawing, unless otherwise indicated, the same numbers in different attached drawings represent the same or similar element.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects being described in detail in claims, of the invention.

Fig. 1 is a kind of flow chart of text handling method shown in an exemplary embodiment according to the present invention.As example This method can be used for the equipment such as mobile phone, tablet computer, desktop computer, laptop, server.

Shown in Figure 1, this method may include steps of：

Step S101 obtains pending text data.

Specific form the present embodiment for pending text data is simultaneously not limited, such as can be an article (such as theme).

One or more classifications can be preset as pre-set categories, such as by taking language composition as an example, according to expression side The difference of formula, pre-set categories are divided into expository writing classification, argumentative writing classification, narrative classification, etc..The purpose of the present embodiment, that is, true Fixed pending text data belongs to which or which pre-set categories.

Step S102 obtains the text data respectively according to the first textual classification model and the second textual classification model One candidate categories, wherein first textual classification model is for the title according to the text data and the text data Comprising sentence classify to the text data, second textual classification model according to the text data for being included Sentence in specified sentence classify to the text data.

In order to improve the accuracy of text classification, start with simultaneously from two angles in the present embodiment and text is divided Analysis, i.e. using the first textual classification model from this chapter angle of title+full text and use the second textual classification model from This sentence angle of specified sentence in text, treats classifying text and classifies, so as to obtain two candidate categories, then in this base The classification of text is finally determined on plinth.

Which sentence is specifically referred to for the specified sentence in text, the present embodiment is simultaneously not limited, such as specifies sentence Son can be emphasis sentence in text, etc..It can be according to different need for the definition those skilled in the art for specifying sentence Ask different scenes and voluntarily select, design, can use here these selection and design all without departing from the present invention essence God and protection domain.

As an example, first, second textual classification model can be the neutral net obtained beforehand through training Model.Detail the present embodiment certainly for neural network model is simultaneously not limited, and those skilled in the art can root It is designed, combines according to various existing neural network models.

Neural network model can generally be obtained by training.Therefore in the present embodiment or other some embodiments of the invention In, training of a large amount of text datas for neutral net can be collected in advance.

As an example, can the text that user is write can also be collected by network collection for trained text data The corresponding text obtained after being identified by image is as text data.Such as collected text be language composition when, can pass through The composition paper write during student examination is collected, the text data of corresponding language composition is obtained after carrying out image identification, including composition Title and composition content.

Corresponding text categories label, the class of the text generally can be carried or be endowed for collected a large amount of texts Can not be determined according to application demand, such as text be language composition when can be set as expository writing, argumentative writing, narrative.Institute Stating text categories can be represented using distinct symbols, such as language composition, 1 can be used to represent expository writing, 2 represent words Text, 3 represent narrative, naturally it is also possible to be represented using other methods, the embodiment of the present invention is not restricted.

Step S103 determines the classification of the text data according to two acquired candidate categories.

For example, two textual classification models can export the probability of the classification belonging to current text data, it is basic herein On finally can determine which classification is current text data should belong to.

Specifically, when two obtained candidate categories differ, the larger candidate's class of probability value can be directly selected Final classification not as text to be sorted.Such as first the output of textual classification model be " narrative 80% ", the second text The probability that the output of disaggregated model is " argumentative writing 70% " namely the first textual classification model thinks current text and has 80% belongs to This classification of narrative, and the probability that the second textual classification model, which then thinks current text, 70% belongs to argumentative writing, then can be with Classification of the select probability the greater as finally definite current text.Alternatively, when two obtained candidate categories differ, Can also be by text mark to be sorted not determine classification, final classification subsequently by manually determining text to be sorted, etc..

In the present embodiment, start with simultaneously from two angles and treat classifying text and analyzed, i.e. use the first text point Class model is this from the specified sentence in text from this chapter angle of title+full text and using the second textual classification model Sentence angle treats classifying text and classifies, and obtains two candidate categories, then finally determines the classification of text on this basis, The efficiency of text classification is so effectively improved, while also improves the accuracy of text classification, reduces the subjectivity of people Influence to classification results.

It is shown in Figure 2, it is described according to the first text classification mould in the present embodiment or other some embodiments of the invention Type obtains a candidate categories of the text data, can include：

Step S201 obtains the semantic matrix of the text data title and the semanteme of each sentence in the text data Matrix.

For text data, such as a composition, it will usually there are one title, the semantic square of the title can be obtained Battle array.It for the content of text data, is generally also made of multiple sentences, for each sentence, the sentence can also be got Semantic matrix.Particular content the present embodiment for semantic matrix is simultaneously not limited, such as semantic matrix usually can be by word Vector composition.

As an example, each sentence in the semantic matrix for obtaining the text data title and the text data Semantic matrix can include：

1) term vector for each word that the title and each sentence are included is obtained.

Such as title and sentence can be segmented, and corresponding term vector is obtained, the segmenting method can use such as The methods of based on condition random field, and when each word after participle is converted to term vector, such as word2vec can be used Technology obtains the term vector of each word, this present embodiment is repeated no more.

2) term vector for each word for being included the title forms the semantic matrix of the title for a line.

3) term vector for each word for being included each sentence forms the semantic matrix of each sentence for a line.

Can often going using the term vector for the word that text header is included as title semantic matrix, obtain title semanteme square Battle array, size are kt × m, and wherein kt represents the word sum that title includes, and m represents the dimension of each term vector.

Term vector often the going as each sentence semantics matrix for the word that each sentence includes in text can be obtained text The semantic matrix of each sentence in this, wherein the semantic matrix size of each sentence is k_c× m, k_cIt represents in current text c-th The word number that sentence includes.

In addition, when sentence includes the word of different number in text header and text or in text, each sentence includes not With quantity word when, can to the semantic matrix of text header and or text in each sentence semantic matrix carry out it is regular, with Make each semantic matrix regular for the identical matrix of size.It is of course also possible to without regular, to this present embodiment and without limit System.

Step S202, using the semantic matrix of the semantic matrix of the title and each sentence together as first text The input of disaggregated model.

Step S203 belongs to each pre-set categories according to the text data that first textual classification model is exported Probability, determine candidate categories of the text data.

The concrete structure of the first textual classification model is illustrated below.

Shown in Figure 3 by taking text data is composition as an example, the first textual classification model can at least include sentence and encode Layer, chapter coding layer, attention layer, weighted sum layer, output layer.

A) the sentence coding layer carries out Sentence-level volume for the semantic matrix of the semantic matrix to title and each sentence Code is to obtain Sentence-level coding characteristic.

It can be using the semantic matrix of each sentence in the semantic matrix and text of current text title as input (in other words As input layer), X={ T, C can be used₁,C₂,...C_nRepresent, wherein T represents title semantic matrix, C₁,C₂,...C_nRespectively For the semantic matrix of each sentence in current text, n is the sentence sum that current text includes.

It can include Sentence-level encoder in sentence coding layer, for each sentence in the title and text to current text Carry out Sentence-level coding, the Sentence-level coding characteristic after being encoded.S={ st, s can be used in Sentence-level coding characteristic₁, s₂,...,s_nRepresent, wherein st represents to carry out the semantic matrix of text header the Sentence-level of the title obtained after sentence coding Coding characteristic, s_nIt represents to obtain the Sentence-level coding spy of the sentence after carrying out the semantic matrix of n-th of sentence Sentence-level coding Sign, st and s₁,s₂,...s_nFor the identical vector of dimension, specific vector dimension size can be tied according to application demand or experiment Fruit determines.Come in fact as an example, the structures such as convolutional neural networks, cycling or recurrent neural network may be employed in sentence coding layer It is existing.

B) the chapter coding layer, for using the Sentence-level coding characteristic that the sentence coding layer exports as input, from Entire chapter text angle re-starts the Sentence-level coding characteristic of the title and each sentence chapter grade coding to obtain chapter Grade coding characteristic.

The input of chapter coding layer is the output of sentence coding layer.The output of chapter coding layer is chapter grade coding characteristic, H={ ht, h can be used₁,h₂,...,h_nRepresent, wherein, ht represents to carry out chapter grade to the Sentence-level coding characteristic of text header The chapter grade coding characteristic obtained after coding, h_nIt represents after carrying out chapter grade coding to the Sentence-level coding characteristic of n-th of sentence Obtained chapter grade coding characteristic.Ht and h₁,h₁,...h_nIt is the identical vector of dimension, specific vector dimension size can be with It is determined according to application demand or experimental result.The structure of bidirectional circulating neutral net (RNN) may be employed in the chapter coding layer, Exist between each node and be bi-directionally connected, examined so as to which the information of the title of current text and all sentences of text is all included The scope of worry, and then can realize the coding of chapter grade.Specific coding process repeats no more.

C) attention (attention) layer, the chapter grade coding characteristic for being exported with the chapter coding layer are made To input, the importance weight of each sentence is calculated according to the chapter grade coding characteristic of the title and each sentence.

P={ p can be used in importance weight₁,p₂,...,p_nRepresent, wherein p_jFor the weight of j-th of sentence of current text Spend weight.

Each sentence is calculated according to the chapter grade coding characteristic of the title and each sentence in the attention layer Importance weight can include：

C1) according to the chapter grade coding characteristic of each sentence and the attention force vector of attention layer, each sentence is calculated Pay attention to force value.

As an example, during specific calculating, it can be directly by the chapter grade coding characteristic of each sentence and the note of attention layer Attention force value of the calculated value obtained after meaning force vector inner product as each sentence in current text, circular such as following formula It is shown：

a_j=h_j·v^T

Wherein, a_jFor the attention force value of j-th of sentence of current text, h_jIt is encoded for the chapter grade of j-th of sentence of current text Feature, v are and h_jThe identical attention force vector of dimension, is model parameter, and initial value can be obtained by random initializtion, Final value can be trained to obtain beforehand through mass data.

C2 chapter grade coding characteristic similarity of the chapter grade coding characteristic with the title of each sentence) is calculated, to make For the main line weight of each sentence.

As an example, following formula can be used during specific calculating：

Wherein, t_jFor the main line weight of j-th of sentence of current text.

C3) according to the attention force value of each sentence and main line weight, the importance weight of each sentence is calculated.

As an example, during specific calculating, the attention force value of each sentence and the product of main line weight are first calculated, then to described Product is normalized, and using the calculated value obtained after normalization as the importance weight of each sentence, is shown below：

Wherein, p_jFor the importance weight of j-th of sentence of current text.

D) the weighted sum layer, for the importance weight of each sentence exported with the attention layer and each sentence As inputting the semantic matrix of the text data is calculated, wherein the textual data in sub corresponding chapter grade coding characteristic According to semantic matrix for each sentence importance weight and the corresponding chapter grade coding characteristic sum of products.

As an example, following formula can be used during specific calculating：

Wherein, A is the semantic matrix of the text data.

E) output layer, for the semantic matrix of the text data that is exported using the weighted sum layer as defeated Enter, export the probability for belonging to each pre-set categories for the text data.

The probability that current text data belong to each pre-set categories has been obtained, can further determine that candidate's class It not, such as can be using the pre-set categories of maximum probability as a candidate categories.

It specific neural network structure the present embodiment and is not limited used by for output layer, model parameter can To be obtained by training in advance, details are not described herein again.

Usually contain some emphasis sentences in one text, for example, the main line sentence in narrative, the theme line in argumentative writing, To declarative sentence of main body things, etc. in expository writing.Inventor has found in the implementation of the present invention, according to these emphasis sentences Son also can substantially determine the classification of a text.

As an example, the second textual classification model can be common classification model in pattern-recognition, such as support vector machines Disaggregated model, Bayesian Classification Model, Decision-Tree Classifier Model and neural network classification model, etc..

It is shown in Figure 4, it is described according to the second text classification mould in the present embodiment or other some embodiments of the invention Type obtains a candidate categories of the text data, can include：

Step S401 is obtained from the sentence that the text data is included according to preset rules and is specified sentence.

As an example, it can be emphasis sentence to specify sentence.Such as the importance weight of every words in text can be calculated, Then using sentence of the importance weight higher than predetermined threshold value as emphasis sentence.It is weighed for the importance for how calculating every words Weight, the present embodiment is simultaneously not limited, such as can calculate one according to the length etc. of the position of sentence in the text, sentence itself The importance weight of a sentence.Those skilled in the art can according to different demands different scenes and voluntarily select, design, can With these selections used here and design the spirit and scope all without departing from the present invention.

As an example, described obtained according to preset rules from the sentence that the text data is included specifies sentence, it can To include：

I) the importance weight of each sentence is obtained.

For example, the importance weight of each sentence can be calculated by attention layer above.Certain this field skill Art personnel can also be calculated according to other modes, to this present embodiment and be not limited.

Ii) the importance weight of all sentences is normalized and standardized.

As an example, normalization can specifically use following formula：

Wherein,For the importance weight after j-th of sentence normalization in current text, max (P) is institute in current text There is the maximum of Sentence significance weight.

The importance weight after each sentence normalization in current text is standardized again, the sentence after being standardized Sub- importance weight, specific method can be shown below：

Wherein, sp_jFor the importance weight after j-th of sentence standardization in current text, μ is all sentences in current text Average after sub- importance weight normalization, σ are the standard deviation after all Sentence significance weight normalization in current text.

Iii) according to the importance weight and the relation of predetermined threshold value after the normalization of each sentence and standardization, from institute Have and emphasis sentence is filtered out in sentence using as the specified sentence.

Step S402 extracts the text classification feature of each specified sentence, wherein the text classification feature is at least Including a kind of following feature：For describing the Sentence-level text classification feature of current sentence self-characteristic, for from entire chapter text Angle describes the chapter grade text classification feature of the characteristic of current sentence, for current from the description of the context angle of current sentence The sentence context text classification feature of the characteristic of sentence.

1. as an example, the Sentence-level text classification feature can include at least one of following characteristics：

Sentence length, sentence ending punctuate, emotion word occurrence number in sentence, Feature Words occurrence number in sentence.

The sentence length refers to the length of current sentence, and the number of words that sentence includes can be used to represent；

Sentence ending punctuate refer at the end of current sentence its punctuation mark in the text, such as comma, ", sentence Number "." etc.；

Emotion word number in the sentence refers to the emotion word number that current sentence includes, and the emotion word can be previously according to should It determines to obtain with demand, when extraction judges that whether each word is emotion word in current sentence, obtains including in current sentence successively Emotion word word number namely emotion word occurrence number；

The number that Feature Words occur in the sentence refers to what the Feature Words included in current sentence occurred in current sentence Number, it is necessary to first find the Feature Words that are included in current sentence during specific extraction, then counts each Feature Words in current sentence The number of middle appearance, the word or phrase that the Feature Words can be included in the emphasis sentence according to all texts are calculated, example If information gain when can calculate word or phrase to text classification during specific calculate or mutual information obtain, such as information gain or mutually Information be more than threshold value word or phrase as Feature Words, which can determine according to application demand, if in current sentence not Comprising Feature Words, then the number that Feature Words occur is 0.

2. as an example, the chapter grade text classification feature can include at least one of following characteristics：

Whether the segment labeling of sentence in the text, sentence appear in text first section, and whether sentence appears in text endpiece, sentence Sentence label in son section where it, sentence whether be section where it first sentence, sentence whether be section where it tail sentence, sentence The sentence sum of place section, the average sentence length of section where sentence.

Wherein segment labeling can be serial number of the current paragraph in all paragraphs, and sentence label can be current sentence current Serial number in Duan Suoyou sentences.

3. as an example, the sentence context text classification feature includes at least one of following characteristics：

Step S403, using the text classification feature of all specified sentences as the input of second textual classification model, The probability that the text data is belonged to each pre-set categories exported according to second textual classification model determines described One candidate categories of text data.

Following is apparatus of the present invention embodiment, can be used for performing the method for the present invention embodiment.It is real for apparatus of the present invention The details not disclosed in example is applied, refer to the method for the present invention embodiment.

Fig. 5 is a kind of schematic diagram of text processing apparatus shown in an exemplary embodiment according to the present invention.As example The device can be used for the equipment such as mobile phone, tablet computer, desktop computer, laptop, server.

Shown in Figure 5, described device can include：

Text acquiring unit 501, for obtaining pending text data；

First text classification unit 502, for obtaining a time of the text data according to the first textual classification model Classification is selected, wherein first textual classification model is used to be included according to the title and the text data of the text data Sentence classify to the text data；

Second text classification unit 503, for obtaining a time of the text data according to the second textual classification model Classification is selected, wherein second textual classification model is used for the specified sentence pair in the sentence included according to the text data The text data classification；

Classification determination unit 504, for determining the class of the text data according to two acquired candidate categories Not.

In the present embodiment or other of the invention some embodiments, first textual classification model can be beforehand through The neural network model that training obtains；

Corresponding shown in Figure 6, the first text classification unit can include：

Semantic matrix obtains subelement 601, for obtaining the semantic matrix of the text data title and the textual data The semantic matrix of each sentence in；

Input subelement 602, for using the semantic matrix of the semantic matrix of the title and each sentence together as institute State the input of the first textual classification model；

Subelement 603 is exported, the text data for being exported according to first textual classification model belongs to every The probability of a pre-set categories determines a candidate categories of the text data.

In the present embodiment or other some embodiments of the invention, the semantic matrix obtains subelement and can be used for：

In the present embodiment or other some embodiments of the invention, first textual classification model can include sentence and compile Code layer, chapter coding layer, attention layer, weighted sum layer, output layer；

In the present embodiment or other some embodiments of the invention, the attention layer is according to the title and each sentence Chapter grade coding characteristic the importance weight of each sentence is calculated, can include：

Shown in Figure 7, in the present embodiment or other some embodiments of the invention, the second text classification unit can To include：

Sentence is specified to obtain subelement 701, for being obtained according to preset rules from the sentence that the text data is included Fetching determines sentence；

Characteristic of division extracts subelement 702, for extracting the text classification feature of each specified sentence, wherein described Text classification feature includes at least a kind of following feature：For describing the Sentence-level text classification of current sentence self-characteristic spy Sign, for describing the chapter grade text classification feature of the characteristic of current sentence from entire chapter text angle, for from current sentence Context angle describes the sentence context text classification feature of the characteristic of current sentence；

Input and output subelement 703, for using the text classification feature of all specified sentences as second text point The input of class model belongs to the general of each pre-set categories according to the text data that second textual classification model is exported Rate determines a candidate categories of the text data.

In the present embodiment or other some embodiments of the invention, the specified sentence obtains subelement and can be used for：

Obtain the importance weight of each sentence；

The importance weight of all sentences is normalized and standardized；

In the present embodiment or other some embodiments of the invention, the Sentence-level text classification feature can include following At least one of feature：

On the device in above-described embodiment, wherein unit module perform the concrete mode of operation related It is described in detail in the embodiment of this method, explanation will be not set forth in detail herein.

Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice invention disclosed herein Its embodiment.This application is intended to cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and including undocumented common knowledge in the art of the invention Or conventional techniques.Description and embodiments are considered only as illustratively, and true scope and spirit of the invention are by appended Claim is pointed out.

It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is only limited by appended claim.

Claims

1. a kind of text handling method, which is characterized in that the described method includes：

Obtain pending text data；

Obtain a candidate categories of the text data respectively according to the first textual classification model and the second textual classification model, Wherein described first textual classification model is used for the sentence included according to the title and the text data of the text data Classify to the text data, second textual classification model is used for the finger in the sentence included according to the text data Determine sentence to classify to the text data；

2. according to the method described in claim 1, it is characterized in that, first textual classification model is beforehand through trained The neural network model arrived；

Using the semantic matrix of the semantic matrix of the title and each sentence together as the defeated of first textual classification model Enter；

The text data exported according to first textual classification model belongs to the probability of each pre-set categories, determines institute State a candidate categories of text data.

3. according to the method described in claim 2, it is characterized in that, the semantic matrix for obtaining the text data title and The semantic matrix of each sentence in the text data, including：

4. according to the method described in claim 2, it is characterized in that, first textual classification model include sentence coding layer, Chapter coding layer, attention layer, weighted sum layer, output layer；

The sentence coding layer carries out Sentence-level coding to obtain for the semantic matrix of the semantic matrix to title and each sentence To Sentence-level coding characteristic；

The chapter coding layer, for using the Sentence-level coding characteristic that the sentence coding layer exports as input, from entire chapter text This angle re-starts the Sentence-level coding characteristic of the title and each sentence chapter grade coding to obtain chapter grade coding Feature；

The attention layer, for using the chapter grade coding characteristic that the chapter coding layer exports as input, according to the mark The importance weight of each sentence is calculated in the chapter grade coding characteristic of topic and each sentence；

The weighted sum layer, it is corresponding for the importance weight of each sentence and each sentence that are exported with the attention layer Chapter grade coding characteristic as input, the semantic matrix of the text data is calculated, wherein the language of the text data Importance weight and the corresponding chapter grade coding characteristic sum of products of the adopted matrix for each sentence；

The output layer, for the semantic matrix of the text data that is exported using the weighted sum layer as input, output Belong to the probability of each pre-set categories for the text data.

5. according to the method described in claim 4, it is characterized in that, the attention layer is according to the title and each sentence The importance weight of each sentence is calculated in chapter grade coding characteristic, including：

According to the attention force vector of the chapter grade coding characteristic of each sentence and attention layer, the attention of each sentence is calculated Value；

The chapter grade coding characteristic of each sentence and the chapter grade coding characteristic similarity of the title are calculated, using as each sentence The main line weight of son；

6. according to the method described in claim 1, it is characterized in that, described obtain the text according to the second textual classification model One candidate categories of data, including：

The text classification feature of each specified sentence of extraction, wherein the text classification feature includes at least a kind of following spy Sign：For describing the Sentence-level text classification feature of current sentence self-characteristic, for describing current sentence from entire chapter text angle The chapter grade text classification feature of the characteristic of son, for describing the sentence of the characteristic of current sentence from the context angle of current sentence Sub- context text classification feature；

Using the text classification feature of all specified sentences as the input of second textual classification model, according to the described second text The text data that this disaggregated model is exported belongs to the probability of each pre-set categories, determines a time of the text data Select classification.

7. according to the method described in claim 6, it is characterized in that, described included according to preset rules from the text data Sentence in obtain specify sentence, including：

Obtain the importance weight of each sentence；

The importance weight of all sentences is normalized and standardized；

According to the importance weight and the relation of predetermined threshold value after the normalization of each sentence and standardization, from all sentences Emphasis sentence is filtered out using as the specified sentence.

8. according to the method described in claim 6, it is characterized in that, the Sentence-level text classification feature is included in following characteristics At least one：

Whether the segment labeling of sentence in the text, sentence appear in text first section, and whether sentence appears in text endpiece, and sentence exists Sentence label in section where it, sentence whether be section where it first sentence, sentence whether be section where it tail sentence, where sentence The sentence sum of section, the average sentence length of section where sentence；

The current previous sentence of sentence or the Sentence-level text classification feature of more and chapter grade text classification feature, after current sentence The Sentence-level text classification feature and chapter grade text classification feature of one or more.

9. a kind of text processing apparatus, which is characterized in that described device includes：

Text acquiring unit, for obtaining pending text data；

First text classification unit, for obtaining a candidate categories of the text data according to the first textual classification model, Wherein described first textual classification model is used for the sentence included according to the title and the text data of the text data Classify to the text data；

Second text classification unit, for obtaining a candidate categories of the text data according to the second textual classification model, Wherein described second textual classification model is used for specified sentence in the sentence that is included according to the text data to the text Notebook data is classified；

10. device according to claim 9, which is characterized in that first textual classification model is beforehand through training Obtained neural network model；

The first text classification unit includes：

Semantic matrix obtains subelement, each in the semantic matrix of the text data title and the text data for obtaining The semantic matrix of sentence；

Subelement is inputted, for the semantic matrix of the semantic matrix of the title and each sentence is literary as described first together The input of this disaggregated model；

Subelement is exported, the text data for being exported according to first textual classification model belongs to each default class Other probability determines a candidate categories of the text data.

11. device according to claim 10, which is characterized in that the semantic matrix obtains subelement and is used for：

12. device according to claim 10, which is characterized in that first textual classification model is encoded including sentence Layer, chapter coding layer, attention layer, weighted sum layer, output layer；

13. device according to claim 12, which is characterized in that the attention layer is according to the title and each sentence Chapter grade coding characteristic the importance weight of each sentence is calculated, including：

14. device according to claim 9, which is characterized in that the second text classification unit includes：

Sentence is specified to obtain subelement, sentence is specified for being obtained according to preset rules from the sentence that the text data is included Son；

Characteristic of division extracts subelement, for extracting the text classification feature of each specified sentence, wherein the text point Category feature includes at least a kind of following feature：For describing the Sentence-level text classification feature of current sentence self-characteristic, it is used for The chapter grade text classification feature of the characteristic of current sentence is described from entire chapter text angle, for from the context angle of current sentence The sentence context text classification feature of the characteristic of degree description current sentence；

Input and output subelement, for using the text classification feature of all specified sentences as second textual classification model Input, the text data exported according to second textual classification model belong to the probability of each pre-set categories, determine One candidate categories of the text data.

15. device according to claim 14, which is characterized in that the specified sentence obtains subelement and is used for：

Obtain the importance weight of each sentence；

The importance weight of all sentences is normalized and standardized；

16. device according to claim 14, which is characterized in that the Sentence-level text classification feature includes following characteristics At least one of：