CN107783958A

CN107783958A - A kind of object statement recognition methods and device

Info

Publication number: CN107783958A
Application number: CN201610792978.5A
Authority: CN
Inventors: 施亮亮; 付瑞吉; 胡国平; 宋巍; 秦兵; 刘挺
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2018-03-09
Anticipated expiration: 2036-08-31
Also published as: CN107783958B

Abstract

The embodiments of the invention provide a kind of object statement recognition methods and device, wherein method includes：Pending text is obtained, wherein the text includes one or more natural language sentence；Extract the identification feature of every sentence, wherein described identification feature include fisrt feature and or second feature, the fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate feature of the sentence in terms of word；According to the identification feature of every sentence in the object statement identification model and the text built in advance, the object statement in the text is identified.The present invention can have found the sentence for belonging to object statement (such as graceful sentence) automatically, so as to substantially increase the recognition efficiency of object statement；Meanwhile criterion of identification of the invention is to be based on clarification of objective and model so that recognition result is also more objective, subjective sex chromosome mosaicism during so as to avoid manual identified.

Description

A kind of object statement recognition methods and device

Technical field

The present invention relates to natural language processing field, more particularly, to a kind of object statement recognition methods and device.

Background technology

People when reading article (such as the composition of student or other content of text), often for certain purpose and Some object statements are found in article, such as graceful sentence.Existing object statement recognition methods is usually by manually to text Zhang Jinhang is read, and then points out the object statement in article.For example, teacher can mark composition when being corrected to composition In graceful sentence, and provide corresponding comment, this improves composition level to student and is of great importance, wherein the graceful sentence one As can refer to that expression is graceful, has a peculiar view etc. sentence, such as using more Chinese idiom, the sentence quoted the classics.

However, inventor has found during the present invention is realized, with the rapid development of information technology, education sector Start the stepped into information epoch, emerge numerous online education platforms, increasing student also begins to be accustomed to teach online The mode educated, on same online education platform, a large amount of students carry out the operation such as on-line study, online testing as user, The student that now teacher faces no longer is traditional tens students of a class, but ten hundreds of platform users.This Under the new situation, the workload of teacher starts to be doubled and redoubled, and especially teacher corrects to composition, even more wastes time and energy.Meanwhile always Teacher it is pieces of when correcting compositions, subjectivity is often larger, different teachers to a same piece write a composition in which be that object statement is commented Sentence that result is probably different, i.e., recognition result is completely dependent on reading the people of article, is unfavorable for the horizontal raising of theme.Cause This, the industry such as current online education need a kind of method that can efficiently, objectively identify object statement.

The content of the invention

The present invention provides a kind of object statement recognition methods and device, when object statement in text being identified with improving Efficiency.

First aspect according to embodiments of the present invention, there is provided a kind of object statement recognition methods, methods described include：

Pending text is obtained, wherein the text includes one or more natural language sentence；

Extract the identification feature of every sentence, wherein the identification feature include fisrt feature and or second feature, The fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate the sentence in text Feature in terms of word；

According to the identification feature of every sentence in the object statement identification model and the text built in advance, identification The object statement gone out in the text.

Optionally, when the identification feature includes fisrt feature, the fisrt feature of every sentence is extracted, including：

Current statement is segmented；

The term vector of each word is obtained after obtaining participle；

According to the term vector of each word of current statement and the first identification model built in advance, the first of current statement is obtained Feature, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.

Optionally, the term vector according to each word of current statement and the first identification model built in advance, obtain and work as The fisrt feature of preceding sentence, including：

The term vector of each word of current statement is inputted into the LSTM-RNN layers；

Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers use pA to Amount and the value of each node carry out dot product operations, are strengthened with the historical information preserved to each node；

Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by described Weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node；

The result of the weighted sum is input to the output layer, worked as by the preset formula in the output layer Preceding sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.

Optionally, the second feature includes one or more of：

Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement；

Average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value；

Maximum word frequency and minimum word frequency, for indicating that each word goes out occurrence in collected all texts in current statement Several maximums and minimum value；

Whether Chinese idiom is included；

Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement；

Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as One type.

Optionally：

The part of speech distribution of current statement is extracted, including：

Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and total word number in current statement Ratio, to obtain the distribution of the part of speech of current statement；

The average word frequency of current statement is extracted, including：

The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement；

The maximum word frequency of current statement and minimum word frequency are extracted, including：

Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement；

The not repetitor accounting of current statement is extracted, including：

The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting；

The repetitor number of types of current statement is extracted, including：

The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, by current statement The number of types of the repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.

Optionally, the knowledge of every sentence in the basis is built in advance object statement identification model and the text Other feature, the object statement in the text is identified, including：

Input using the identification feature of current statement as the object statement identification model；

The output of the object statement identification model is received, wherein the output belongs to the general of object statement for current statement Rate；

When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.

Optionally, after the object statement identified in the text, methods described also includes：

Object statement is marked using predetermined manner in the text.

Second aspect according to embodiments of the present invention, there is provided a kind of object statement identification device, described device include：

Input module, for obtaining pending text, wherein the text includes one or more natural language sentence；

Characteristic extracting module, for extracting the identification feature of every sentence, wherein the identification feature includes first Feature and or second feature, the fisrt feature be used to indicate feature of the sentence in terms of semanteme, the second feature is used In feature of the instruction sentence in terms of word；

Identification module, for according to every sentence in the object statement identification model that builds in advance and the text Identification feature, identify the object statement in the text.

Current statement is segmented；

The term vector of each word is obtained after obtaining participle；

Optionally, the term vector according to each word of current statement and the first identification model built in advance, obtain and work as During the fisrt feature of preceding sentence, including：

Optionally, the second feature includes one or more of：

Whether Chinese idiom is included；

Optionally：

The part of speech distribution of current statement is extracted, including：

The average word frequency of current statement is extracted, including：

The not repetitor accounting of current statement is extracted, including：

The repetitor number of types of current statement is extracted, including：

Optionally, the identification module is used for：

Optionally, described device also includes：

Mark module, for object statement to be marked using predetermined manner in the text.

The technical scheme that embodiments of the invention provide can include the following benefits：

The present invention according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and pre- Each bar natural language sentence in text is identified the object statement identification model first built, so as to find to belong to automatically The sentence of object statement (such as graceful sentence), substantially increase the recognition efficiency of object statement；Meanwhile identification mark of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Problem.

It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not Can the limitation present invention.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without having to pay creative labor, other accompanying drawings can also be obtained according to these accompanying drawings.In addition, these are situated between Continue and do not form restriction to embodiment, the element for having same reference numbers label in accompanying drawing is expressed as similar element, removes Non- have a special statement, and composition does not limit the figure in accompanying drawing.

Fig. 1 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present；

Fig. 2 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present；

Fig. 3 is the structural representation of the first identification model shown according to an exemplary embodiment of the present；

Fig. 4 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present；

Fig. 5 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present；

Fig. 6 is a kind of schematic diagram of the object statement identification device shown according to an exemplary embodiment of the present；

Fig. 7 is a kind of schematic diagram of the object statement identification device shown according to an exemplary embodiment of the present.

Embodiment

Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects being described in detail in claims, of the invention.

Fig. 1 is a kind of flow chart of object statement recognition methods according to an exemplary embodiment of the invention.The party Method terminal or server etc. such as available for mobile phone, computer.

Shown in Figure 1, this method can include：

Step S101, pending text is obtained, wherein the text includes one or more natural language sentence.

Such as theme etc. can be received as pending text.In the present invention, natural language sentence can letter Referred to as sentence, can also be popular be referred to as sentence.Text can be split into by sentence according to the punctuate in text, i.e. will be with sentence Number, the content of the ending such as question mark, exclamation mark, ellipsis is as one.

Step S102, extract every sentence identification feature, wherein the identification feature include fisrt feature and or Second feature, the fisrt feature are used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate institute Feature of the predicate sentence in terms of word.

Fisrt feature and second feature can from semantic, word, sentence be described the two different angles respectively.Make Used time, the identification feature of sentence can include fisrt feature or second feature, or the group including fisrt feature and second feature Close.For the particular content of fisrt feature and second feature, the present embodiment is simultaneously not limited, and those skilled in the art can root According to different demands different scenes and designed, designed, spirit of these designs that can be used here all without departing from the present invention And protection domain.

Step S103, according to the identification of every sentence in the object statement identification model and the text built in advance Feature, identify the object statement in the text.

Such as substantial amounts of text can be collected in advance and carry out handmarking, it is pre- by training so as to be used as training sample First build the object statement identification model.Identified in use, the identification feature of a sentence is input into the object statement In model, so as to judge whether the sentence belongs to object statement according to output.Such as output can be that the sentence belongs to target language The probability of sentence, for this scene of graceful sentence, the probability is properly termed as the graceful degree of the sentence.

The present embodiment according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and Each bar natural language sentence in text is identified the object statement identification model built in advance, so as to find to belong to automatically In the sentence of object statement (such as graceful sentence), the recognition efficiency of object statement is substantially increased；Meanwhile identification of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Sex chromosome mosaicism.

It is shown in Figure 2, in the present embodiment or some other embodiments of the invention, when the identification feature includes first During feature, the fisrt feature of every sentence is extracted, can be included：

Step S201, is segmented to current statement.

For specific participle technique the present embodiment and it is not limited, such as can be in the method pair of use condition random field Text segment, etc..

Step S202, the term vector of each word is obtained after obtaining participle.

For example, it can train to obtain the term vector of each word using word2vec methods.

For a sentence, its term vector can be expressed as (w1, w2 ... wn).

Step S203, according to the term vector of each word of current statement and the first identification model built in advance, obtain current The fisrt feature of sentence, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and defeated successively Go out layer.Wherein RNN is recurrent neural networks Recognition with Recurrent Neural Network, and LSTM is Long-Short Term Memory。

As example reference can be made to shown in Fig. 3, Fig. 3 is a kind of example arrangement of the first identification model, can include LSTM- RNN layers, pA (pseudo Attention) operation layer, weighted sum (weighted sum) layer and output layer.

As an example, the term vector according to each word of current statement and the first identification model built in advance, are obtained The fisrt feature of current statement, it can specifically include：

I) term vector of each word of current statement is inputted into the LSTM-RNN layers.

It is right by LSTM-RNN layers by input of the term vector of a sentence (w1, w2 ... wn) as LSTM-RNN layers Current statement is encoded, and the historical information of each word is preserved in cataloged procedure, obtains t-th of node value of LSTM-RNN layers h_tFor h_t=LSTM (w_t,h_t-1), wherein LSTM () is to the function that is encoded of input term vector, h_t-1For the t-1 node Value, i.e., the historical information of t-th node.LSTM-RNN belongs to prior art, will not be repeated here.

Ii) using the output of the LSTM-RNN layers as the input of the pA operation layers, used in the pA operation layers PA is vectorial to carry out dot product operations with the value of each node, is strengthened with the historical information preserved to each node.

The output of LSTM-RNN layers is the input of pA operation layers.Due to having used node pA vectors to carry out dot product operations, So referred to as pA operation layers.Strengthened by the historical information preserved to each node, the history of node can be prevented There is situation about decaying over time in information.Obtain the value α of enhanced t-th of node_tFor α_t=dot (h_t, a), Wherein, dot () is dot product operations function, and a is the element of pA vectors, and pA vectors are model parameter, and its specific value can pass through A large amount of text datas train to obtain.In addition, node belongs to the prior art in the fields such as neutral net, it is not reinflated to this present invention Repeat.

Iii) the input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by The weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node.

Before specific weighted sum, first the values of pA vector enhancing posterior nodal points can be carried out it is regular, obtain it is regular after T-th of node value β_tForAgain to β_tAnd node value h_tSummation is weighted, that is, obtains h,

Iv the result of the weighted sum) is input to the output layer, obtained by the preset formula in the output layer Belong to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence to current statement.

Can be p=sigmoid (W*h+b) as example preset formula, wherein p is to export, and W and b are model parameter, Its specific value can train to obtain by a large amount of text datas.

Certainly, in other embodiments of the present invention, first identification model can also be described using other models, such as CNN (Convolutional neural networks) or LSTM (Long-Short Term Memory) etc..It can also distinguish The first identification model is described respectively using different neural network models, obtains the fisrt feature of current statement respectively, then this is more Fisrt feature of the individual fisrt feature collectively as current statement.

In the present embodiment or some other embodiments of the invention, the second feature can include following a kind of or more Kind：

1) part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement；

When it is implemented, the part of speech distribution of extraction current statement, can include：

Total word number in current statement is counted, calculates every kind of part of speech in current statement (such as noun, verb, adjective, pair Word, conjunction etc.) word number and the ratio of total word number, to obtain the distribution of the part of speech of current statement.

For example, if current statement is " small grass start under the table drilled out in ground ", obtained after participle " it is small adjective grass Noun start verb stealthily adverbial word other words from other words in noun other words bore verb come out verb ", when Word sum is 10 in preceding sentence, and wherein noun has 2, and verb has 3, and adjective has 1, and adverbial word has 1, conjunction 0, its Its word has 3, then the part of speech of noun in this, verb, adjective, adverbial word, conjunction and other words is distributed as：0.2,0.3, 0.1,0.1,0.0,0.3.

2) average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value；

When it is implemented, the average word frequency of extraction current statement, can include：

The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement.

3) maximum word frequency and minimum word frequency, for indicating that each word occurs in collected all texts in current statement The maximum and minimum value of number；

When it is implemented, the maximum word frequency and minimum word frequency of extraction current statement, can include：

Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement.

4) whether Chinese idiom is included；

When it is implemented, it can detect whether each word in current statement is into successively according to the Chinese idiom table built in advance Language, if it is Chinese idiom to have word in current statement, then it is assumed that Chinese idiom is included in current statement, otherwise it is assumed that not wrapped in current statement Containing Chinese idiom.0 or 1 further can be specifically used to represent, such as 1 expression current sentence includes Chinese idiom, and 0 expression current sentence does not wrap Containing Chinese idiom.

5) not repetitor accounting, for indicating unduplicated word is shared in current statement in current statement word number ratio Example；

When it is implemented, the not repetitor accounting of extraction current statement, can include：

The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting.

For example, current statement is " small grass start under the table drilled out in ground ", obtain that " small grass starts under the table after participle Drilled out in ground ", totally 10 words, wherein comprising 2 same words, i.e., previous " " and the latter " ", and 8 not phases Same word, then repetitor accounting is not in the sentence

6) repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor is calculated Make a type.

When it is implemented, the repetitor number of types of extraction current statement, can include：

For example, current statement is " hello, and hello, and welcome ", wherein " you " and " good " occurs twice, attaching most importance to respectively Compound word, and the font of the two is different, therefore the repetitor number of types of current statement is 2.

It is shown in Figure 4, in the present embodiment or of the invention some other embodiments, target that the basis is built in advance The identification feature of every sentence, identifies the object statement in the text in sentence identification model and the text, can With including：

Step S401, the input using the identification feature of current statement as the object statement identification model.

Step S402, the output of the object statement identification model is received, wherein the output belongs to mesh for current statement The probability of poster sentence.

Step S403, when the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.

As an example, the object statement identification model can be common disaggregated model, such as supporting vector machine model, decision-making Tree-model etc..

The object statement identification model can be obtained by training in advance.Such as by the identification feature of sentence and it can be somebody's turn to do Whether sentence belongs to the artificial mark label of object statement as training sample, and the parameter of the model is trained, updated.

Wherein manually mark label can be divided into two kinds, i.e. current statement is that object statement or current statement are not targets Sentence, such as represented using 0 or 1, when being labeled as 1, represent that current statement is object statement, when being labeled as 0, represent current statement not It is object statement.During specific mark, same sentence can transfer to two mark persons to be labeled respectively, if the annotation results of the two Unanimously, then it is assumed that mark is correct, otherwise domain expert can be transferred to mark current statement, is defined by domain expert's annotation results. The parameter of the model is updated by training sample, after training terminates, obtains the parameter of the object statement identification model Value.Specific training process repeats no more.

In addition, it is shown in Figure 5, it is described to identify the text in the present embodiment or some other embodiments of the invention After object statement in this, methods described can also include：

Step S104, object statement is marked using predetermined manner in the text.

For example, so that object statement is graceful sentence as an example, can be in article after the graceful sentence in identifying article Corresponding graceful sentence is marked, the specific labeling method present invention is not construed as limiting, such as can use other color fonts, thick Body, underscore etc. mark graceful sentence, or the mode using block diagram, graceful sentence is put into block diagram, etc..

Following is apparatus of the present invention embodiment, can be used for performing the inventive method embodiment.It is real for apparatus of the present invention The details not disclosed in example is applied, refer to the inventive method embodiment.

Fig. 6 is a kind of schematic diagram of object statement identification device according to an exemplary embodiment of the invention.The dress Put terminal or server etc. such as available for mobile phone, computer.

Shown in Figure 6, the device can include：

Input module 601, for obtaining pending text, wherein the text includes one or more natural language language Sentence；

Characteristic extracting module 602, for extracting the identification feature of every sentence, wherein the identification feature includes the One feature and or second feature, the fisrt feature be used for indicate feature of the sentence in terms of semanteme, the second feature For indicating feature of the sentence in terms of word；

Identification module 603, for according to every institute's predicate in the object statement identification model and the text built in advance The identification feature of sentence, identifies the object statement in the text.

In the present embodiment or some other embodiments of the invention, when the identification feature includes fisrt feature, extraction The fisrt feature of every sentence, it can include：

Current statement is segmented；

The term vector of each word is obtained after obtaining participle；

In the present embodiment or of the invention some other embodiments, the term vector according to each word of current statement and pre- The first identification model first built, when obtaining the fisrt feature of current statement, it can include：

Whether Chinese idiom is included；

In the present embodiment or some other embodiments of the invention：

The part of speech distribution of current statement is extracted, can be included：

The average word frequency of current statement is extracted, can be included：

The maximum word frequency of current statement and minimum word frequency are extracted, can be included：

The not repetitor accounting of current statement is extracted, can be included：

The repetitor number of types of current statement is extracted, can be included：

In the present embodiment or some other embodiments of the invention, the identification module can be used for：

Shown in Figure 7, in the present embodiment or some other embodiments of the invention, described device can also include：

Mark module 604, for object statement to be marked using predetermined manner in the text.

On the device in above-described embodiment, wherein unit module perform the concrete mode of operation relevant It is described in detail in the embodiment of this method, explanation will be not set forth in detail herein.

Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including undocumented common knowledges in the art of the invention Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by appended Claim is pointed out.

It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims

1. a kind of object statement recognition methods, it is characterised in that methods described includes：

Extract the identification feature of every sentence, wherein the identification feature include fisrt feature and or second feature, it is described Fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate the sentence in word side The feature in face；

According to the identification feature of every sentence in the object statement identification model and the text built in advance, institute is identified State the object statement in text.

2. according to the method for claim 1, it is characterised in that when the identification feature includes fisrt feature, extraction is every The fisrt feature of sentence described in bar, including：

Current statement is segmented；

The term vector of each word is obtained after obtaining participle；

According to the term vector of each word of current statement and the first identification model built in advance, obtain current statement first is special Sign, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.

3. according to the method for claim 2, it is characterised in that the term vector according to each word of current statement and in advance First identification model of structure, the fisrt feature of current statement is obtained, including：

Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers using pA vector and The value of each node carries out dot product operations, is strengthened with the historical information preserved to each node；

Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by the weighting Summation layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node；

The result of the weighted sum is input to the output layer, current language is obtained by the preset formula in the output layer Sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.

4. according to the method for claim 1, it is characterised in that the second feature includes one or more of：

Average word frequency, for indicating the average value of each word occurrence number in collected all texts in current statement；

Maximum word frequency and minimum word frequency, for indicating each word occurrence number in collected all texts in current statement Maximum and minimum value；

Whether Chinese idiom is included；

5. according to the method for claim 4, it is characterised in that：

The part of speech distribution of current statement is extracted, including：

Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and the ratio of total word number in current statement Value, to obtain the distribution of the part of speech of current statement；

The average word frequency of current statement is extracted, including：

The number that each word occurs in collected all texts in current statement is counted respectively, calculates being averaged for the number Value, to obtain the average word frequency of current statement；

Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the maximum of the number Value and maximum word frequency and minimum word frequency of the minimum value respectively as current statement；

The not repetitor accounting of current statement is extracted, including：

The not repetitor in current statement is found respectively, wherein repetitor is not the word that is differed on font, described in statistics not The sum of repetitor, the not repetitor using the ratio of the sum of the not repetitor and the total word number of current statement as current statement Accounting；

The repetitor number of types of current statement is extracted, including：

The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, will be described in current statement The number of types of repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.

6. according to the method for claim 1, it is characterised in that object statement identification model that the basis is built in advance and The identification feature of every sentence in the text, the object statement in the text is identified, including：

The output of the object statement identification model is received, wherein the output belongs to the probability of object statement for current statement；

7. according to the method for claim 1, it is characterised in that after the object statement identified in the text, Methods described also includes：

Object statement is marked using predetermined manner in the text.

8. a kind of object statement identification device, it is characterised in that described device includes：

Characteristic extracting module, for extracting the identification feature of every sentence, wherein the identification feature includes fisrt feature With or second feature, the fisrt feature be used to indicate feature of the sentence in terms of semanteme, the second feature is used to refer to Show feature of the sentence in terms of word；

Identification module, for the identification according to every sentence in the object statement identification model and the text built in advance Feature, identify the object statement in the text.

9. device according to claim 8, it is characterised in that when the identification feature includes fisrt feature, extraction is every The fisrt feature of sentence described in bar, including：

Current statement is segmented；

The term vector of each word is obtained after obtaining participle；

10. device according to claim 9, it is characterised in that the term vector according to each word of current statement and pre- The first identification model first built, when obtaining the fisrt feature of current statement, including：

11. device according to claim 8, it is characterised in that the second feature includes one or more of：

Whether Chinese idiom is included；

12. device according to claim 11, it is characterised in that：

The part of speech distribution of current statement is extracted, including：

The average word frequency of current statement is extracted, including：

The not repetitor accounting of current statement is extracted, including：

The repetitor number of types of current statement is extracted, including：

13. device according to claim 8, it is characterised in that the identification module is used for：

14. device according to claim 8, it is characterised in that described device also includes：