CN107783958A - A kind of object statement recognition methods and device - Google Patents

A kind of object statement recognition methods and device Download PDF

Info

Publication number
CN107783958A
CN107783958A CN201610792978.5A CN201610792978A CN107783958A CN 107783958 A CN107783958 A CN 107783958A CN 201610792978 A CN201610792978 A CN 201610792978A CN 107783958 A CN107783958 A CN 107783958A
Authority
CN
China
Prior art keywords
statement
word
current statement
repetitor
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610792978.5A
Other languages
Chinese (zh)
Other versions
CN107783958B (en
Inventor
施亮亮
付瑞吉
胡国平
宋巍
秦兵
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201610792978.5A priority Critical patent/CN107783958B/en
Publication of CN107783958A publication Critical patent/CN107783958A/en
Application granted granted Critical
Publication of CN107783958B publication Critical patent/CN107783958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiments of the invention provide a kind of object statement recognition methods and device, wherein method includes:Pending text is obtained, wherein the text includes one or more natural language sentence;Extract the identification feature of every sentence, wherein described identification feature include fisrt feature and or second feature, the fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate feature of the sentence in terms of word;According to the identification feature of every sentence in the object statement identification model and the text built in advance, the object statement in the text is identified.The present invention can have found the sentence for belonging to object statement (such as graceful sentence) automatically, so as to substantially increase the recognition efficiency of object statement;Meanwhile criterion of identification of the invention is to be based on clarification of objective and model so that recognition result is also more objective, subjective sex chromosome mosaicism during so as to avoid manual identified.

Description

A kind of object statement recognition methods and device
Technical field
The present invention relates to natural language processing field, more particularly, to a kind of object statement recognition methods and device.
Background technology
People when reading article (such as the composition of student or other content of text), often for certain purpose and Some object statements are found in article, such as graceful sentence.Existing object statement recognition methods is usually by manually to text Zhang Jinhang is read, and then points out the object statement in article.For example, teacher can mark composition when being corrected to composition In graceful sentence, and provide corresponding comment, this improves composition level to student and is of great importance, wherein the graceful sentence one As can refer to that expression is graceful, has a peculiar view etc. sentence, such as using more Chinese idiom, the sentence quoted the classics.
However, inventor has found during the present invention is realized, with the rapid development of information technology, education sector Start the stepped into information epoch, emerge numerous online education platforms, increasing student also begins to be accustomed to teach online The mode educated, on same online education platform, a large amount of students carry out the operation such as on-line study, online testing as user, The student that now teacher faces no longer is traditional tens students of a class, but ten hundreds of platform users.This Under the new situation, the workload of teacher starts to be doubled and redoubled, and especially teacher corrects to composition, even more wastes time and energy.Meanwhile always Teacher it is pieces of when correcting compositions, subjectivity is often larger, different teachers to a same piece write a composition in which be that object statement is commented Sentence that result is probably different, i.e., recognition result is completely dependent on reading the people of article, is unfavorable for the horizontal raising of theme.Cause This, the industry such as current online education need a kind of method that can efficiently, objectively identify object statement.
The content of the invention
The present invention provides a kind of object statement recognition methods and device, when object statement in text being identified with improving Efficiency.
First aspect according to embodiments of the present invention, there is provided a kind of object statement recognition methods, methods described include:
Pending text is obtained, wherein the text includes one or more natural language sentence;
Extract the identification feature of every sentence, wherein the identification feature include fisrt feature and or second feature, The fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate the sentence in text Feature in terms of word;
According to the identification feature of every sentence in the object statement identification model and the text built in advance, identification The object statement gone out in the text.
Optionally, when the identification feature includes fisrt feature, the fisrt feature of every sentence is extracted, including:
Current statement is segmented;
The term vector of each word is obtained after obtaining participle;
According to the term vector of each word of current statement and the first identification model built in advance, the first of current statement is obtained Feature, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.
Optionally, the term vector according to each word of current statement and the first identification model built in advance, obtain and work as The fisrt feature of preceding sentence, including:
The term vector of each word of current statement is inputted into the LSTM-RNN layers;
Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers use pA to Amount and the value of each node carry out dot product operations, are strengthened with the historical information preserved to each node;
Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by described Weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node;
The result of the weighted sum is input to the output layer, worked as by the preset formula in the output layer Preceding sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.
Optionally, the second feature includes one or more of:
Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
Average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value;
Maximum word frequency and minimum word frequency, for indicating that each word goes out occurrence in collected all texts in current statement Several maximums and minimum value;
Whether Chinese idiom is included;
Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement;
Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as One type.
Optionally:
The part of speech distribution of current statement is extracted, including:
Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and total word number in current statement Ratio, to obtain the distribution of the part of speech of current statement;
The average word frequency of current statement is extracted, including:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement;
The maximum word frequency of current statement and minimum word frequency are extracted, including:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement;
The not repetitor accounting of current statement is extracted, including:
The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting;
The repetitor number of types of current statement is extracted, including:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, by current statement The number of types of the repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
Optionally, the knowledge of every sentence in the basis is built in advance object statement identification model and the text Other feature, the object statement in the text is identified, including:
Input using the identification feature of current statement as the object statement identification model;
The output of the object statement identification model is received, wherein the output belongs to the general of object statement for current statement Rate;
When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
Optionally, after the object statement identified in the text, methods described also includes:
Object statement is marked using predetermined manner in the text.
Second aspect according to embodiments of the present invention, there is provided a kind of object statement identification device, described device include:
Input module, for obtaining pending text, wherein the text includes one or more natural language sentence;
Characteristic extracting module, for extracting the identification feature of every sentence, wherein the identification feature includes first Feature and or second feature, the fisrt feature be used to indicate feature of the sentence in terms of semanteme, the second feature is used In feature of the instruction sentence in terms of word;
Identification module, for according to every sentence in the object statement identification model that builds in advance and the text Identification feature, identify the object statement in the text.
Optionally, when the identification feature includes fisrt feature, the fisrt feature of every sentence is extracted, including:
Current statement is segmented;
The term vector of each word is obtained after obtaining participle;
According to the term vector of each word of current statement and the first identification model built in advance, the first of current statement is obtained Feature, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.
Optionally, the term vector according to each word of current statement and the first identification model built in advance, obtain and work as During the fisrt feature of preceding sentence, including:
The term vector of each word of current statement is inputted into the LSTM-RNN layers;
Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers use pA to Amount and the value of each node carry out dot product operations, are strengthened with the historical information preserved to each node;
Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by described Weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node;
The result of the weighted sum is input to the output layer, worked as by the preset formula in the output layer Preceding sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.
Optionally, the second feature includes one or more of:
Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
Average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value;
Maximum word frequency and minimum word frequency, for indicating that each word goes out occurrence in collected all texts in current statement Several maximums and minimum value;
Whether Chinese idiom is included;
Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement;
Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as One type.
Optionally:
The part of speech distribution of current statement is extracted, including:
Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and total word number in current statement Ratio, to obtain the distribution of the part of speech of current statement;
The average word frequency of current statement is extracted, including:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement;
The maximum word frequency of current statement and minimum word frequency are extracted, including:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement;
The not repetitor accounting of current statement is extracted, including:
The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting;
The repetitor number of types of current statement is extracted, including:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, by current statement The number of types of the repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
Optionally, the identification module is used for:
Input using the identification feature of current statement as the object statement identification model;
The output of the object statement identification model is received, wherein the output belongs to the general of object statement for current statement Rate;
When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
Optionally, described device also includes:
Mark module, for object statement to be marked using predetermined manner in the text.
The technical scheme that embodiments of the invention provide can include the following benefits:
The present invention according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and pre- Each bar natural language sentence in text is identified the object statement identification model first built, so as to find to belong to automatically The sentence of object statement (such as graceful sentence), substantially increase the recognition efficiency of object statement;Meanwhile identification mark of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Problem.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not Can the limitation present invention.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without having to pay creative labor, other accompanying drawings can also be obtained according to these accompanying drawings.In addition, these are situated between Continue and do not form restriction to embodiment, the element for having same reference numbers label in accompanying drawing is expressed as similar element, removes Non- have a special statement, and composition does not limit the figure in accompanying drawing.
Fig. 1 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present;
Fig. 2 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present;
Fig. 3 is the structural representation of the first identification model shown according to an exemplary embodiment of the present;
Fig. 4 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present;
Fig. 5 is a kind of flow chart of the object statement recognition methods shown according to an exemplary embodiment of the present;
Fig. 6 is a kind of schematic diagram of the object statement identification device shown according to an exemplary embodiment of the present;
Fig. 7 is a kind of schematic diagram of the object statement identification device shown according to an exemplary embodiment of the present.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects being described in detail in claims, of the invention.
Fig. 1 is a kind of flow chart of object statement recognition methods according to an exemplary embodiment of the invention.The party Method terminal or server etc. such as available for mobile phone, computer.
Shown in Figure 1, this method can include:
Step S101, pending text is obtained, wherein the text includes one or more natural language sentence.
Such as theme etc. can be received as pending text.In the present invention, natural language sentence can letter Referred to as sentence, can also be popular be referred to as sentence.Text can be split into by sentence according to the punctuate in text, i.e. will be with sentence Number, the content of the ending such as question mark, exclamation mark, ellipsis is as one.
Step S102, extract every sentence identification feature, wherein the identification feature include fisrt feature and or Second feature, the fisrt feature are used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate institute Feature of the predicate sentence in terms of word.
Fisrt feature and second feature can from semantic, word, sentence be described the two different angles respectively.Make Used time, the identification feature of sentence can include fisrt feature or second feature, or the group including fisrt feature and second feature Close.For the particular content of fisrt feature and second feature, the present embodiment is simultaneously not limited, and those skilled in the art can root According to different demands different scenes and designed, designed, spirit of these designs that can be used here all without departing from the present invention And protection domain.
Step S103, according to the identification of every sentence in the object statement identification model and the text built in advance Feature, identify the object statement in the text.
Such as substantial amounts of text can be collected in advance and carry out handmarking, it is pre- by training so as to be used as training sample First build the object statement identification model.Identified in use, the identification feature of a sentence is input into the object statement In model, so as to judge whether the sentence belongs to object statement according to output.Such as output can be that the sentence belongs to target language The probability of sentence, for this scene of graceful sentence, the probability is properly termed as the graceful degree of the sentence.
The present embodiment according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and Each bar natural language sentence in text is identified the object statement identification model built in advance, so as to find to belong to automatically In the sentence of object statement (such as graceful sentence), the recognition efficiency of object statement is substantially increased;Meanwhile identification of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Sex chromosome mosaicism.
It is shown in Figure 2, in the present embodiment or some other embodiments of the invention, when the identification feature includes first During feature, the fisrt feature of every sentence is extracted, can be included:
Step S201, is segmented to current statement.
For specific participle technique the present embodiment and it is not limited, such as can be in the method pair of use condition random field Text segment, etc..
Step S202, the term vector of each word is obtained after obtaining participle.
For example, it can train to obtain the term vector of each word using word2vec methods.
For a sentence, its term vector can be expressed as (w1, w2 ... wn).
Step S203, according to the term vector of each word of current statement and the first identification model built in advance, obtain current The fisrt feature of sentence, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and defeated successively Go out layer.Wherein RNN is recurrent neural networks Recognition with Recurrent Neural Network, and LSTM is Long-Short Term Memory。
As example reference can be made to shown in Fig. 3, Fig. 3 is a kind of example arrangement of the first identification model, can include LSTM- RNN layers, pA (pseudo Attention) operation layer, weighted sum (weighted sum) layer and output layer.
As an example, the term vector according to each word of current statement and the first identification model built in advance, are obtained The fisrt feature of current statement, it can specifically include:
I) term vector of each word of current statement is inputted into the LSTM-RNN layers.
It is right by LSTM-RNN layers by input of the term vector of a sentence (w1, w2 ... wn) as LSTM-RNN layers Current statement is encoded, and the historical information of each word is preserved in cataloged procedure, obtains t-th of node value of LSTM-RNN layers htFor ht=LSTM (wt,ht-1), wherein LSTM () is to the function that is encoded of input term vector, ht-1For the t-1 node Value, i.e., the historical information of t-th node.LSTM-RNN belongs to prior art, will not be repeated here.
Ii) using the output of the LSTM-RNN layers as the input of the pA operation layers, used in the pA operation layers PA is vectorial to carry out dot product operations with the value of each node, is strengthened with the historical information preserved to each node.
The output of LSTM-RNN layers is the input of pA operation layers.Due to having used node pA vectors to carry out dot product operations, So referred to as pA operation layers.Strengthened by the historical information preserved to each node, the history of node can be prevented There is situation about decaying over time in information.Obtain the value α of enhanced t-th of nodetFor αt=dot (ht, a), Wherein, dot () is dot product operations function, and a is the element of pA vectors, and pA vectors are model parameter, and its specific value can pass through A large amount of text datas train to obtain.In addition, node belongs to the prior art in the fields such as neutral net, it is not reinflated to this present invention Repeat.
Iii) the input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by The weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node.
Before specific weighted sum, first the values of pA vector enhancing posterior nodal points can be carried out it is regular, obtain it is regular after T-th of node value βtForAgain to βtAnd node value htSummation is weighted, that is, obtains h,
Iv the result of the weighted sum) is input to the output layer, obtained by the preset formula in the output layer Belong to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence to current statement.
Can be p=sigmoid (W*h+b) as example preset formula, wherein p is to export, and W and b are model parameter, Its specific value can train to obtain by a large amount of text datas.
Certainly, in other embodiments of the present invention, first identification model can also be described using other models, such as CNN (Convolutional neural networks) or LSTM (Long-Short Term Memory) etc..It can also distinguish The first identification model is described respectively using different neural network models, obtains the fisrt feature of current statement respectively, then this is more Fisrt feature of the individual fisrt feature collectively as current statement.
In the present embodiment or some other embodiments of the invention, the second feature can include following a kind of or more Kind:
1) part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
When it is implemented, the part of speech distribution of extraction current statement, can include:
Total word number in current statement is counted, calculates every kind of part of speech in current statement (such as noun, verb, adjective, pair Word, conjunction etc.) word number and the ratio of total word number, to obtain the distribution of the part of speech of current statement.
For example, if current statement is " small grass start under the table drilled out in ground ", obtained after participle " it is small adjective grass Noun start verb stealthily adverbial word other words from other words in noun other words bore verb come out verb ", when Word sum is 10 in preceding sentence, and wherein noun has 2, and verb has 3, and adjective has 1, and adverbial word has 1, conjunction 0, its Its word has 3, then the part of speech of noun in this, verb, adjective, adverbial word, conjunction and other words is distributed as:0.2,0.3, 0.1,0.1,0.0,0.3.
2) average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value;
When it is implemented, the average word frequency of extraction current statement, can include:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement.
3) maximum word frequency and minimum word frequency, for indicating that each word occurs in collected all texts in current statement The maximum and minimum value of number;
When it is implemented, the maximum word frequency and minimum word frequency of extraction current statement, can include:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement.
4) whether Chinese idiom is included;
When it is implemented, it can detect whether each word in current statement is into successively according to the Chinese idiom table built in advance Language, if it is Chinese idiom to have word in current statement, then it is assumed that Chinese idiom is included in current statement, otherwise it is assumed that not wrapped in current statement Containing Chinese idiom.0 or 1 further can be specifically used to represent, such as 1 expression current sentence includes Chinese idiom, and 0 expression current sentence does not wrap Containing Chinese idiom.
5) not repetitor accounting, for indicating unduplicated word is shared in current statement in current statement word number ratio Example;
When it is implemented, the not repetitor accounting of extraction current statement, can include:
The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting.
For example, current statement is " small grass start under the table drilled out in ground ", obtain that " small grass starts under the table after participle Drilled out in ground ", totally 10 words, wherein comprising 2 same words, i.e., previous " " and the latter " ", and 8 not phases Same word, then repetitor accounting is not in the sentence
6) repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor is calculated Make a type.
When it is implemented, the repetitor number of types of extraction current statement, can include:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, by current statement The number of types of the repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
For example, current statement is " hello, and hello, and welcome ", wherein " you " and " good " occurs twice, attaching most importance to respectively Compound word, and the font of the two is different, therefore the repetitor number of types of current statement is 2.
It is shown in Figure 4, in the present embodiment or of the invention some other embodiments, target that the basis is built in advance The identification feature of every sentence, identifies the object statement in the text in sentence identification model and the text, can With including:
Step S401, the input using the identification feature of current statement as the object statement identification model.
Step S402, the output of the object statement identification model is received, wherein the output belongs to mesh for current statement The probability of poster sentence.
Step S403, when the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
As an example, the object statement identification model can be common disaggregated model, such as supporting vector machine model, decision-making Tree-model etc..
The object statement identification model can be obtained by training in advance.Such as by the identification feature of sentence and it can be somebody's turn to do Whether sentence belongs to the artificial mark label of object statement as training sample, and the parameter of the model is trained, updated.
Wherein manually mark label can be divided into two kinds, i.e. current statement is that object statement or current statement are not targets Sentence, such as represented using 0 or 1, when being labeled as 1, represent that current statement is object statement, when being labeled as 0, represent current statement not It is object statement.During specific mark, same sentence can transfer to two mark persons to be labeled respectively, if the annotation results of the two Unanimously, then it is assumed that mark is correct, otherwise domain expert can be transferred to mark current statement, is defined by domain expert's annotation results. The parameter of the model is updated by training sample, after training terminates, obtains the parameter of the object statement identification model Value.Specific training process repeats no more.
In addition, it is shown in Figure 5, it is described to identify the text in the present embodiment or some other embodiments of the invention After object statement in this, methods described can also include:
Step S104, object statement is marked using predetermined manner in the text.
For example, so that object statement is graceful sentence as an example, can be in article after the graceful sentence in identifying article Corresponding graceful sentence is marked, the specific labeling method present invention is not construed as limiting, such as can use other color fonts, thick Body, underscore etc. mark graceful sentence, or the mode using block diagram, graceful sentence is put into block diagram, etc..
The present embodiment according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and Each bar natural language sentence in text is identified the object statement identification model built in advance, so as to find to belong to automatically In the sentence of object statement (such as graceful sentence), the recognition efficiency of object statement is substantially increased;Meanwhile identification of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Sex chromosome mosaicism.
Following is apparatus of the present invention embodiment, can be used for performing the inventive method embodiment.It is real for apparatus of the present invention The details not disclosed in example is applied, refer to the inventive method embodiment.
Fig. 6 is a kind of schematic diagram of object statement identification device according to an exemplary embodiment of the invention.The dress Put terminal or server etc. such as available for mobile phone, computer.
Shown in Figure 6, the device can include:
Input module 601, for obtaining pending text, wherein the text includes one or more natural language language Sentence;
Characteristic extracting module 602, for extracting the identification feature of every sentence, wherein the identification feature includes the One feature and or second feature, the fisrt feature be used for indicate feature of the sentence in terms of semanteme, the second feature For indicating feature of the sentence in terms of word;
Identification module 603, for according to every institute's predicate in the object statement identification model and the text built in advance The identification feature of sentence, identifies the object statement in the text.
In the present embodiment or some other embodiments of the invention, when the identification feature includes fisrt feature, extraction The fisrt feature of every sentence, it can include:
Current statement is segmented;
The term vector of each word is obtained after obtaining participle;
According to the term vector of each word of current statement and the first identification model built in advance, the first of current statement is obtained Feature, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.
In the present embodiment or of the invention some other embodiments, the term vector according to each word of current statement and pre- The first identification model first built, when obtaining the fisrt feature of current statement, it can include:
The term vector of each word of current statement is inputted into the LSTM-RNN layers;
Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers use pA to Amount and the value of each node carry out dot product operations, are strengthened with the historical information preserved to each node;
Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by described Weighted sum layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node;
The result of the weighted sum is input to the output layer, worked as by the preset formula in the output layer Preceding sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.
In the present embodiment or some other embodiments of the invention, the second feature can include following a kind of or more Kind:
Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
Average word frequency, for indicating that the occurrence number in collected all texts of each word in current statement is averaged Value;
Maximum word frequency and minimum word frequency, for indicating that each word goes out occurrence in collected all texts in current statement Several maximums and minimum value;
Whether Chinese idiom is included;
Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement;
Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as One type.
In the present embodiment or some other embodiments of the invention:
The part of speech distribution of current statement is extracted, can be included:
Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and total word number in current statement Ratio, to obtain the distribution of the part of speech of current statement;
The average word frequency of current statement is extracted, can be included:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates the number Average value, to obtain the average word frequency of current statement;
The maximum word frequency of current statement and minimum word frequency are extracted, can be included:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the number The maximum word frequency and minimum word frequency of maximum and minimum value respectively as current statement;
The not repetitor accounting of current statement is extracted, can be included:
The not repetitor in current statement is found respectively, wherein repetitor is not the word differed on font, counts institute The sum of not repetitor is stated, using ratio not the weighing as current statement of the sum and the total word number of current statement of the not repetitor Compound word accounting;
The repetitor number of types of current statement is extracted, can be included:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, by current statement The number of types of the repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
In the present embodiment or some other embodiments of the invention, the identification module can be used for:
Input using the identification feature of current statement as the object statement identification model;
The output of the object statement identification model is received, wherein the output belongs to the general of object statement for current statement Rate;
When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
Shown in Figure 7, in the present embodiment or some other embodiments of the invention, described device can also include:
Mark module 604, for object statement to be marked using predetermined manner in the text.
The present embodiment according to sentence the feature in terms of semanteme and or feature in terms of word, and by training and Each bar natural language sentence in text is identified the object statement identification model built in advance, so as to find to belong to automatically In the sentence of object statement (such as graceful sentence), the recognition efficiency of object statement is substantially increased;Meanwhile identification of the invention Standard is to be based on clarification of objective and model so that recognition result is also more objective, subjectivity during so as to avoid manual identified Sex chromosome mosaicism.
On the device in above-described embodiment, wherein unit module perform the concrete mode of operation relevant It is described in detail in the embodiment of this method, explanation will be not set forth in detail herein.
Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including undocumented common knowledges in the art of the invention Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by appended Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (14)

1. a kind of object statement recognition methods, it is characterised in that methods described includes:
Pending text is obtained, wherein the text includes one or more natural language sentence;
Extract the identification feature of every sentence, wherein the identification feature include fisrt feature and or second feature, it is described Fisrt feature is used to indicate feature of the sentence in terms of semanteme, and the second feature is used to indicate the sentence in word side The feature in face;
According to the identification feature of every sentence in the object statement identification model and the text built in advance, institute is identified State the object statement in text.
2. according to the method for claim 1, it is characterised in that when the identification feature includes fisrt feature, extraction is every The fisrt feature of sentence described in bar, including:
Current statement is segmented;
The term vector of each word is obtained after obtaining participle;
According to the term vector of each word of current statement and the first identification model built in advance, obtain current statement first is special Sign, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.
3. according to the method for claim 2, it is characterised in that the term vector according to each word of current statement and in advance First identification model of structure, the fisrt feature of current statement is obtained, including:
The term vector of each word of current statement is inputted into the LSTM-RNN layers;
Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers using pA vector and The value of each node carries out dot product operations, is strengthened with the historical information preserved to each node;
Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by the weighting Summation layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node;
The result of the weighted sum is input to the output layer, current language is obtained by the preset formula in the output layer Sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.
4. according to the method for claim 1, it is characterised in that the second feature includes one or more of:
Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
Average word frequency, for indicating the average value of each word occurrence number in collected all texts in current statement;
Maximum word frequency and minimum word frequency, for indicating each word occurrence number in collected all texts in current statement Maximum and minimum value;
Whether Chinese idiom is included;
Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement;
Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as one Type.
5. according to the method for claim 4, it is characterised in that:
The part of speech distribution of current statement is extracted, including:
Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and the ratio of total word number in current statement Value, to obtain the distribution of the part of speech of current statement;
The average word frequency of current statement is extracted, including:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates being averaged for the number Value, to obtain the average word frequency of current statement;
The maximum word frequency of current statement and minimum word frequency are extracted, including:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the maximum of the number Value and maximum word frequency and minimum word frequency of the minimum value respectively as current statement;
The not repetitor accounting of current statement is extracted, including:
The not repetitor in current statement is found respectively, wherein repetitor is not the word that is differed on font, described in statistics not The sum of repetitor, the not repetitor using the ratio of the sum of the not repetitor and the total word number of current statement as current statement Accounting;
The repetitor number of types of current statement is extracted, including:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, will be described in current statement The number of types of repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
6. according to the method for claim 1, it is characterised in that object statement identification model that the basis is built in advance and The identification feature of every sentence in the text, the object statement in the text is identified, including:
Input using the identification feature of current statement as the object statement identification model;
The output of the object statement identification model is received, wherein the output belongs to the probability of object statement for current statement;
When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
7. according to the method for claim 1, it is characterised in that after the object statement identified in the text, Methods described also includes:
Object statement is marked using predetermined manner in the text.
8. a kind of object statement identification device, it is characterised in that described device includes:
Input module, for obtaining pending text, wherein the text includes one or more natural language sentence;
Characteristic extracting module, for extracting the identification feature of every sentence, wherein the identification feature includes fisrt feature With or second feature, the fisrt feature be used to indicate feature of the sentence in terms of semanteme, the second feature is used to refer to Show feature of the sentence in terms of word;
Identification module, for the identification according to every sentence in the object statement identification model and the text built in advance Feature, identify the object statement in the text.
9. device according to claim 8, it is characterised in that when the identification feature includes fisrt feature, extraction is every The fisrt feature of sentence described in bar, including:
Current statement is segmented;
The term vector of each word is obtained after obtaining participle;
According to the term vector of each word of current statement and the first identification model built in advance, obtain current statement first is special Sign, wherein first identification model includes LSTM-RNN layers, pA operation layers, weighted sum layer and output layer successively.
10. device according to claim 9, it is characterised in that the term vector according to each word of current statement and pre- The first identification model first built, when obtaining the fisrt feature of current statement, including:
The term vector of each word of current statement is inputted into the LSTM-RNN layers;
Using the LSTM-RNN layers output as the pA operation layers input, in the pA operation layers using pA vector and The value of each node carries out dot product operations, is strengthened with the historical information preserved to each node;
Input by the input of pA operation layers and the output of pA operation layers collectively as the weighted sum layer again, by the weighting Summation layer is weighted summation to the value of value and pA the vector enhancing posterior nodal point of node;
The result of the weighted sum is input to the output layer, current language is obtained by the preset formula in the output layer Sentence belongs to the preliminary probability of object statement, and the fisrt feature using the preliminary probability as the sentence.
11. device according to claim 8, it is characterised in that the second feature includes one or more of:
Part of speech is distributed, the word number ratio shared in current statement of the word for indicating every kind of part of speech in current statement;
Average word frequency, for indicating the average value of each word occurrence number in collected all texts in current statement;
Maximum word frequency and minimum word frequency, for indicating each word occurrence number in collected all texts in current statement Maximum and minimum value;
Whether Chinese idiom is included;
Not repetitor accounting, the word number ratio shared in current statement for indicating unduplicated word in current statement;
Repetitor number of types, for indicating the number of types of repetitor in current statement, wherein same repetitor can be regarded as one Type.
12. device according to claim 11, it is characterised in that:
The part of speech distribution of current statement is extracted, including:
Total word number in current statement is counted, calculates the number of the word of every kind of part of speech and the ratio of total word number in current statement Value, to obtain the distribution of the part of speech of current statement;
The average word frequency of current statement is extracted, including:
The number that each word occurs in collected all texts in current statement is counted respectively, calculates being averaged for the number Value, to obtain the average word frequency of current statement;
The maximum word frequency of current statement and minimum word frequency are extracted, including:
Each occurrence number of the word in collected all texts in current statement is counted respectively, chooses the maximum of the number Value and maximum word frequency and minimum word frequency of the minimum value respectively as current statement;
The not repetitor accounting of current statement is extracted, including:
The not repetitor in current statement is found respectively, wherein repetitor is not the word that is differed on font, described in statistics not The sum of repetitor, the not repetitor using the ratio of the sum of the not repetitor and the total word number of current statement as current statement Accounting;
The repetitor number of types of current statement is extracted, including:
The repetitor in current statement is found respectively, and wherein repetitor is the identical word on font, will be described in current statement The number of types of repetitor is as the repetitor number of types, wherein same repetitor can be regarded as a type.
13. device according to claim 8, it is characterised in that the identification module is used for:
Input using the identification feature of current statement as the object statement identification model;
The output of the object statement identification model is received, wherein the output belongs to the probability of object statement for current statement;
When the probability is more than predetermined threshold value, it is determined that current statement belongs to object statement.
14. device according to claim 8, it is characterised in that described device also includes:
Mark module, for object statement to be marked using predetermined manner in the text.
CN201610792978.5A 2016-08-31 2016-08-31 Target statement identification method and device Active CN107783958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610792978.5A CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610792978.5A CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Publications (2)

Publication Number Publication Date
CN107783958A true CN107783958A (en) 2018-03-09
CN107783958B CN107783958B (en) 2021-07-02

Family

ID=61451435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610792978.5A Active CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Country Status (1)

Country Link
CN (1) CN107783958B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325234A (en) * 2018-10-10 2019-02-12 深圳前海微众银行股份有限公司 Sentence processing method, equipment and computer readable storage medium
CN110147542A (en) * 2019-05-23 2019-08-20 联想(北京)有限公司 A kind of information processing method and electronic equipment
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
CN104391837A (en) * 2014-11-19 2015-03-04 熊玮 Intelligent grammatical analysis method based on case semantics
CN104850540A (en) * 2015-05-29 2015-08-19 北京京东尚科信息技术有限公司 Sentence recognizing method and sentence recognizing device
US20150310862A1 (en) * 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
WO2015165372A1 (en) * 2014-04-29 2015-11-05 Tencent Technology (Shenzhen) Company Limited Method and apparatus for classifying object based on social networking service, and storage medium
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
CN105550291A (en) * 2015-12-10 2016-05-04 百度在线网络技术(北京)有限公司 Text classification method and device
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
US20150310862A1 (en) * 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
WO2015165372A1 (en) * 2014-04-29 2015-11-05 Tencent Technology (Shenzhen) Company Limited Method and apparatus for classifying object based on social networking service, and storage medium
CN104391837A (en) * 2014-11-19 2015-03-04 熊玮 Intelligent grammatical analysis method based on case semantics
CN104850540A (en) * 2015-05-29 2015-08-19 北京京东尚科信息技术有限公司 Sentence recognizing method and sentence recognizing device
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
CN105550291A (en) * 2015-12-10 2016-05-04 百度在线网络技术(北京)有限公司 Text classification method and device
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡晓辉 等: "基于语义特征的自动文本分类方法", 《计算机与现代化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325234A (en) * 2018-10-10 2019-02-12 深圳前海微众银行股份有限公司 Sentence processing method, equipment and computer readable storage medium
CN109325234B (en) * 2018-10-10 2023-06-20 深圳前海微众银行股份有限公司 Sentence processing method, sentence processing device and computer readable storage medium
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text
CN110147542A (en) * 2019-05-23 2019-08-20 联想(北京)有限公司 A kind of information processing method and electronic equipment

Also Published As

Publication number Publication date
CN107783958B (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN108737406B (en) Method and system for detecting abnormal flow data
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN108763216A (en) A kind of text emotion analysis method based on Chinese data collection
CN107590134A (en) Text sentiment classification method, storage medium and computer
CN104598611B (en) The method and system being ranked up to search entry
CN104809103A (en) Man-machine interactive semantic analysis method and system
CN103577989B (en) A kind of information classification approach and information classifying system based on product identification
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN106547737A (en) Based on the sequence labelling method in the natural language processing of deep learning
CN107515873A (en) A kind of junk information recognition methods and equipment
CN108491386A (en) natural language understanding method and system
CN107862087A (en) Sentiment analysis method, apparatus and storage medium based on big data and deep learning
CN107870964A (en) A kind of sentence sort method and system applied to answer emerging system
CN110781663A (en) Training method and device of text analysis model and text analysis method and device
CN107247751B (en) LDA topic model-based content recommendation method
CN111581966A (en) Context feature fusion aspect level emotion classification method and device
CN108108349A (en) Long text error correction method, device and computer-readable medium based on artificial intelligence
CN107145573A (en) The problem of artificial intelligence customer service robot, answers method and system
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN107797981A (en) A kind of target text recognition methods and device
CN111475615A (en) Fine-grained emotion prediction method, device and system for emotion enhancement and storage medium
CN107783958A (en) A kind of object statement recognition methods and device
CN108364066B (en) Artificial neural network chip and its application method based on N-GRAM and WFST model
CN105243053B (en) Extract the method and device of document critical sentence
CN111191461B (en) Remote supervision relation extraction method based on course learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant