CN107783958B - Target statement identification method and device - Google Patents

Target statement identification method and device Download PDF

Info

Publication number
CN107783958B
CN107783958B CN201610792978.5A CN201610792978A CN107783958B CN 107783958 B CN107783958 B CN 107783958B CN 201610792978 A CN201610792978 A CN 201610792978A CN 107783958 B CN107783958 B CN 107783958B
Authority
CN
China
Prior art keywords
sentence
word
current
current sentence
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610792978.5A
Other languages
Chinese (zh)
Other versions
CN107783958A (en
Inventor
施亮亮
付瑞吉
胡国平
宋巍
秦兵
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201610792978.5A priority Critical patent/CN107783958B/en
Publication of CN107783958A publication Critical patent/CN107783958A/en
Application granted granted Critical
Publication of CN107783958B publication Critical patent/CN107783958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the invention provides a target statement identification method and a target statement identification device, wherein the method comprises the following steps: acquiring a text to be processed, wherein the text comprises one or more natural language sentences; extracting the identification features of each sentence, wherein the identification features comprise first features and/or second features, the first features are used for indicating the features of the sentences in semantic aspect, and the second features are used for indicating the features of the sentences in literal aspect; and identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text. The invention can automatically find the sentences belonging to the target sentences (such as graceful sentences), thereby greatly improving the recognition efficiency of the target sentences; meanwhile, the identification standard of the invention is based on objective characteristics and models, so that the identification result is objective, thereby avoiding the problem of subjectivity during manual identification.

Description

Target statement identification method and device
Technical Field
The invention relates to the field of natural language processing, in particular to a target sentence recognition method and device.
Background
When reading an article (e.g., the composition of a student or other text content), people often find some target sentences, such as graceful sentences, in the article for some purpose. The existing target sentence recognition method generally relies on reading the article manually and then pointing out the target sentence in the article. For example, when the teacher corrects the composition, the teacher can mark out graceful sentences in the composition and give corresponding comments, which is significant for students to improve the composition level, wherein the graceful sentences can generally refer to sentences expressing graceful, unique knowledge and the like, such as sentences using more idioms, classical sentences and the like.
However, in the process of implementing the present invention, the inventor finds that with the rapid development of information technology, the education industry also starts to step into the information era, numerous online education platforms are emerged, more and more students also start to get used to online education, and on the same online education platform, a large number of students are used as users to perform operations such as online learning, online examination and the like, and at this time, the students facing teachers are not traditional dozens of students in one class, but are tens of thousands of platform users. In this new situation, the workload of teachers is beginning to increase by several times, and especially the batch modification of composition by teachers is time-consuming and labor-consuming. Meanwhile, when the teacher changes the composition at once, the subjectivity is often large, and the judgment results of different teachers on which the target sentences are in the same composition are likely to be different, namely, the identification results completely depend on the people reading the article, which is not beneficial to the improvement of the composition level of students. Therefore, a method for efficiently and objectively identifying a target sentence is urgently needed in the industries such as online education.
Disclosure of Invention
The invention provides a target sentence recognition method and device, which are used for improving the efficiency of recognizing a target sentence in a text.
According to a first aspect of the embodiments of the present invention, there is provided a target sentence recognition method, including:
acquiring a text to be processed, wherein the text comprises one or more natural language sentences;
extracting the identification features of each sentence, wherein the identification features comprise first features and/or second features, the first features are used for indicating the features of the sentences in semantic aspect, and the second features are used for indicating the features of the sentences in literal aspect;
and identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text.
Optionally, when the identification feature includes a first feature, extracting the first feature of each sentence includes:
performing word segmentation on the current sentence;
obtaining a word vector of each word after word segmentation;
and acquiring a first characteristic of the current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer.
Optionally, the obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model includes:
inputting a word vector of each word of the current sentence into the LSTM-RNN layer;
taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node to enhance the historical information stored by each node;
then the input of the pA operation layer and the output of the pA operation layer are jointly used as the input of the weighted summation layer, and the weighted summation layer carries out weighted summation on the value of the node and the value of the node after the pA vector is enhanced;
and inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current sentence belonging to the target sentence through a preset formula in the output layer, and taking the initial probability as the first characteristic of the sentence.
Optionally, the second feature comprises one or more of:
the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
the maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
whether the idiom is contained;
the non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
the repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
Optionally:
extracting part-of-speech distribution of the current sentence, comprising:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech in the current sentence to the total word number to obtain the part of speech distribution of the current sentence;
extracting the average word frequency of the current sentence, comprising:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence;
extracting the maximum word frequency and the minimum word frequency of the current sentence, wherein the extracting comprises the following steps:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence respectively;
extracting the proportion of non-repeated words of the current sentence, comprising the following steps:
respectively finding out non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence;
extracting the number of repeated word types of the current sentence, comprising the following steps:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
Optionally, the identifying the target sentence in the text according to the pre-established target sentence identification model and the identification feature of each sentence in the text includes:
taking the recognition features of the current sentence as the input of the target sentence recognition model;
receiving an output of the target sentence recognition model, wherein the output is a probability that the current sentence belongs to the target sentence;
and when the probability is greater than a preset threshold value, determining that the current statement belongs to the target statement.
Optionally, after the target sentence in the text is identified, the method further includes:
and marking the target sentence in the text by using a preset mode.
According to a second aspect of the embodiments of the present invention, there is provided a target sentence recognition apparatus, the apparatus including:
the input module is used for acquiring a text to be processed, wherein the text comprises one or more natural language sentences;
the characteristic extraction module is used for extracting the identification characteristic of each statement, wherein the identification characteristic comprises a first characteristic and/or a second characteristic, the first characteristic is used for indicating the characteristic of the statement in the aspect of semantics, and the second characteristic is used for indicating the characteristic of the statement in the aspect of characters;
and the identification module is used for identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text.
Optionally, when the identification feature includes a first feature, extracting the first feature of each sentence includes:
performing word segmentation on the current sentence;
obtaining a word vector of each word after word segmentation;
and acquiring a first characteristic of the current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer.
Optionally, when obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model, the method includes:
inputting a word vector of each word of the current sentence into the LSTM-RNN layer;
taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node to enhance the historical information stored by each node;
then the input of the pA operation layer and the output of the pA operation layer are jointly used as the input of the weighted summation layer, and the weighted summation layer carries out weighted summation on the value of the node and the value of the node after the pA vector is enhanced;
and inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current sentence belonging to the target sentence through a preset formula in the output layer, and taking the initial probability as the first characteristic of the sentence.
Optionally, the second feature comprises one or more of:
the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
the maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
whether the idiom is contained;
the non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
the repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
Optionally:
extracting part-of-speech distribution of the current sentence, comprising:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech in the current sentence to the total word number to obtain the part of speech distribution of the current sentence;
extracting the average word frequency of the current sentence, comprising:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence;
extracting the maximum word frequency and the minimum word frequency of the current sentence, wherein the extracting comprises the following steps:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence respectively;
extracting the proportion of non-repeated words of the current sentence, comprising the following steps:
respectively finding out non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence;
extracting the number of repeated word types of the current sentence, comprising the following steps:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
Optionally, the identification module is configured to:
taking the recognition features of the current sentence as the input of the target sentence recognition model;
receiving an output of the target sentence recognition model, wherein the output is a probability that the current sentence belongs to the target sentence;
and when the probability is greater than a preset threshold value, determining that the current statement belongs to the target statement.
Optionally, the apparatus further comprises:
and the marking module is used for marking the target sentence in the text in a preset mode.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method identifies each natural language sentence in the text according to the semantic features and/or the literal features of the sentence and the target sentence identification model constructed in advance through training, so that the sentences belonging to the target sentences (such as graceful sentences) can be automatically found, and the identification efficiency of the target sentences is greatly improved; meanwhile, the identification standard of the invention is based on objective characteristics and models, so that the identification result is objective, thereby avoiding the problem of subjectivity during manual identification.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise. Furthermore, these descriptions should not be construed as limiting the embodiments, wherein elements having the same reference number designation are identified as similar elements throughout the figures, and the drawings are not to scale unless otherwise specified.
FIG. 1 is a flowchart illustrating a target sentence recognition method according to an exemplary embodiment of the present invention;
FIG. 2 is a flowchart illustrating a target sentence recognition method according to an exemplary embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a first recognition model shown in accordance with an exemplary embodiment of the present invention;
FIG. 4 is a flowchart illustrating a target sentence recognition method according to an exemplary embodiment of the present invention;
FIG. 5 is a flowchart illustrating a target sentence recognition method according to an exemplary embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a target sentence recognition apparatus according to an exemplary embodiment of the present invention;
fig. 7 is a schematic diagram illustrating a target sentence recognition apparatus according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating a target sentence recognition method according to an exemplary embodiment of the present invention. The method can be used for terminals such as mobile phones and computers, servers and the like.
Referring to fig. 1, the method may include:
step S101, a text to be processed is obtained, wherein the text comprises one or more natural language sentences.
For example, student compositions and the like may be received as pending text. In the present invention, a natural language sentence may be simply referred to as a sentence, or may be colloquially referred to as a sentence. The text may be split into sentences according to punctuations in the text, that is, contents ending with periods, question marks, exclamation marks, ellipses, etc. are taken as a sentence.
Step S102, extracting the identification characteristics of each statement, wherein the identification characteristics comprise a first characteristic and/or a second characteristic, the first characteristic is used for indicating the characteristic of the statement in the aspect of semantics, and the second characteristic is used for indicating the characteristic of the statement in the aspect of characters.
The first feature and the second feature can describe the sentence from two different perspectives of semantics and words respectively. Where used, an identified feature of a sentence may comprise the first feature or the second feature, or a combination of the first feature and the second feature. The embodiment is not limited to the specific content of the first feature and the second feature, and those skilled in the art can design these features according to different needs and different scenarios, and these designs can be used herein without departing from the spirit and scope of the present invention.
Step S103, identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text.
For example, a large amount of text may be collected in advance and manually labeled, so as to serve as a training sample, and the target sentence recognition model may be constructed in advance through training. When the sentence recognition model is used, the recognition features of a sentence are input into the target sentence recognition model, so that whether the sentence belongs to the target sentence or not is judged according to the output. For example, the output may be a probability that the sentence belongs to the target sentence, and for a scenario of a graceful sentence, the probability may be referred to as a graceful degree of the sentence.
In the embodiment, each natural language sentence in the text is identified according to the semantic features and/or the literal features of the sentence and the target sentence identification model pre-constructed through training, so that the sentences belonging to the target sentence (for example, graceful sentence) can be automatically found, and the identification efficiency of the target sentence is greatly improved; meanwhile, the identification standard of the invention is based on objective characteristics and models, so that the identification result is objective, thereby avoiding the problem of subjectivity during manual identification.
Referring to fig. 2, in this embodiment or some other embodiments of the present invention, when the identification feature includes a first feature, extracting the first feature of each sentence may include:
step S201, performing word segmentation on the current sentence.
The embodiment is not limited to a specific word segmentation technique, and for example, a conditional random field method may be used to segment a text.
Step S202, obtaining word vectors of each word after word segmentation.
For example, word vectors for each word may be trained using the word2vec method.
For a sentence, its word vector may be represented as (w1, w 2.. wn).
Step S203, acquiring a first characteristic of the current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer. Wherein RNN is a recurrent neural network, LSTM is Long-Short Term Memory.
As an example, see fig. 3, fig. 3 is an exemplary structure of the first recognition model, which may include an LSTM-RNN layer, a pa (pseudo-attention) operation layer, a weighted sum (weighted sum) layer, and an output layer.
As an example, the obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model may specifically include:
i) the word vector for each word of the current sentence is input into the LSTM-RNN layer.
Taking a word vector (w1, w 2.. wn) of a statement as the input of an LSTM-RNN layer, coding the current statement through the LSTM-RNN layer, and storing the historical information of each word in the coding process to obtain the value h of the t-th node of the LSTM-RNN layertIs ht=LSTM(wt,ht-1) Where LSTM () is a function encoding the input word vector, ht-1The value of the t-1 th node is the historical information of the t-1 th node. LSTM-RNN belongs to the prior art and is not described in detail herein.
ii) taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node so as to enhance the historical information stored by each node.
The output of the LSTM-RNN layer is the input of the pA operation layer. Since the nodes are dot-product operated using the pA vector, they are called pA operation layer. By enhancing the history information stored in each node, the occurrence of the history information can be preventedThe historical information of the nodes is degraded with the time. Obtaining the value alpha of the enhanced t-th nodetIs alphat=dot(htAnd a), wherein dot () is a dot product operation function, a is an element of a pA vector, and the pA vector is a model parameter, and specific values thereof can be obtained through training of a large amount of text data. In addition, the nodes belong to the prior art in the field of neural networks and the like, and the description of the invention is omitted.
And iii) taking the input of the pA operation layer and the output of the pA operation layer as the input of the weighted summation layer, and carrying out weighted summation on the value of the node and the value of the node after the pA vector is enhanced by the weighted summation layer.
Before specific weighted summation, the values of the nodes after pA vector enhancement can be normalized to obtain the normalized value beta of the tth nodetIs composed of
Figure BDA0001104872930000111
For beta againtAnd node value htCarrying out weighted summation to obtain h,
Figure BDA0001104872930000112
iv) inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current statement belonging to the target statement through a preset formula in the output layer, and taking the initial probability as the first characteristic of the statement.
As an example, the preset formula may be p ═ sigmoid (W × h + b), where p is output, and W and b are model parameters, and specific values thereof may be obtained through training of a large amount of text data.
Of course, in other embodiments of the present invention, the first recognition model may also use other model descriptions, such as cnn (volumetric neural networks) or LSTM (Long-Short Term Memory). Or respectively describing the first recognition models by using different neural network models, respectively obtaining the first features of the current sentence, and then taking the plurality of first features together as the first features of the current sentence.
In this embodiment or some other embodiments of the invention, the second feature may include one or more of the following:
1) the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
in specific implementation, extracting part-of-speech distribution of the current sentence may include:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech (such as nouns, verbs, adjectives, adverbs, conjunctions and the like) in the current sentence to the total word number to obtain the part of speech distribution of the current sentence.
For example, if the current sentence is "a word starts to be drilled out from the ground surreptitiously," a word is segmented to obtain "a little/adjective word grass/a noun starts \ a verb steals/an adverb, other words are drilled out from \ other words, a noun/other words, a verb/a verb", the total number of words in the current sentence is 10, wherein there are 2 nouns, 3 verbs, 1 adjective, 1 adverb, 0 conjunctive word, and 3 other words, then the parts of speech distribution of the noun, verb, adjective, adverb, conjunctive word, and other words in the sentence is: 0.2,0.3,0.1,0.1,0.0,0.3.
2) Average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
in specific implementation, extracting the average word frequency of the current sentence may include:
and respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence.
3) The maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
in specific implementation, extracting the maximum word frequency and the minimum word frequency of the current sentence may include:
and respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence.
4) Whether the idiom is contained;
in specific implementation, whether each word in the current sentence is an idiom or not can be sequentially detected according to a pre-constructed idiom table, if the word in the current sentence is an idiom, the current sentence is considered to contain the idiom, and if not, the current sentence is considered to not contain the idiom. Further details may be represented by 0 or 1, such as 1 indicating that the current sentence contains idioms and 0 indicating that the current sentence does not contain idioms.
5) The non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
in specific implementation, extracting the proportion of non-duplicate words of the current sentence may include:
respectively finding out the non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence.
For example, if the current sentence is "the grass starts to burrow from the ground surreptitiously", 10 words are obtained after the word segmentation, wherein the 10 words comprise 2 identical words, namely the former "ground" and the latter "ground", and 8 different words, the ratio of the non-duplicated words in the sentence is equal to
Figure BDA0001104872930000121
6) The repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
In specific implementation, extracting the number of repeated word types of the current sentence may include:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
For example, the current sentence is "hello, welcome", where "hello" and "good" appear twice respectively, and are repeated words, and the glyphs of the two are different, so that the number of types of repeated words of the current sentence is 2.
Referring to fig. 4, in this embodiment or some other embodiments of the present invention, the recognizing the target sentence in the text according to the pre-constructed target sentence recognition model and the recognition feature of each sentence in the text may include:
step S401, using the recognition feature of the current sentence as the input of the target sentence recognition model.
Step S402, receiving the output of the target sentence recognition model, wherein the output is the probability that the current sentence belongs to the target sentence.
Step S403, when the probability is greater than a preset threshold, determining that the current sentence belongs to the target sentence.
By way of example, the target sentence recognition model may be a common classification model, such as a support vector machine model, a decision tree model, or the like.
The target sentence recognition model can be obtained by pre-training. For example, the recognition features of the sentence and the artificial label indicating whether the sentence belongs to the target sentence may be used as training samples to train and update the parameters of the model.
The artificial labeling tags can be divided into two types, that is, the current sentence is the target sentence or the current sentence is not the target sentence, if 0 or 1 is used for representing, when the label is 1, the current sentence is the target sentence, and when the label is 0, the current sentence is not the target sentence. And during specific marking, the same sentence can be respectively submitted to two marking personnel for marking, if the marking results of the two marking personnel are consistent, the marking is considered to be correct, otherwise, the current sentence can be submitted to a domain expert for marking, and the marking result of the domain expert is used as the standard. And updating the parameters of the model through the training sample, and obtaining the parameter values of the target sentence recognition model after the training is finished. The specific training process is not described in detail.
In addition, referring to fig. 5, in this embodiment or some other embodiments of the present invention, after the target sentence in the text is identified, the method may further include:
and step S104, marking the target sentence in the text in a preset mode.
For example, taking the target sentence as an elegant sentence, after identifying the elegant sentence in the article, the corresponding elegant sentence may be marked in the article, and the specific marking method is not limited in the present invention, for example, the elegant sentence may be marked by using other color fonts, bold, underline, etc., or the elegant sentence may be put into the block diagram by using the block diagram, etc.
In the embodiment, each natural language sentence in the text is identified according to the semantic features and/or the literal features of the sentence and the target sentence identification model pre-constructed through training, so that the sentences belonging to the target sentence (for example, graceful sentence) can be automatically found, and the identification efficiency of the target sentence is greatly improved; meanwhile, the identification standard of the invention is based on objective characteristics and models, so that the identification result is objective, thereby avoiding the problem of subjectivity during manual identification.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Fig. 6 is a schematic diagram illustrating a target sentence recognition apparatus according to an exemplary embodiment of the present invention. The device can be used for terminals such as mobile phones and computers, servers and the like.
Referring to fig. 6, the apparatus may include:
an input module 601, configured to obtain a text to be processed, where the text includes one or more natural language sentences;
a feature extraction module 602, configured to extract an identification feature of each sentence, where the identification feature includes a first feature and/or a second feature, the first feature is used to indicate a feature of the sentence in a semantic aspect, and the second feature is used to indicate a feature of the sentence in a literal aspect;
the identifying module 603 is configured to identify a target sentence in the text according to a pre-constructed target sentence identifying model and the identifying feature of each sentence in the text.
In this embodiment or some other embodiments of the present invention, when the identification feature includes a first feature, extracting the first feature of each sentence may include:
performing word segmentation on the current sentence;
obtaining a word vector of each word after word segmentation;
and acquiring a first characteristic of the current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer.
In this embodiment or some other embodiments of the present invention, when obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model, the obtaining may include:
inputting a word vector of each word of the current sentence into the LSTM-RNN layer;
taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node to enhance the historical information stored by each node;
then the input of the pA operation layer and the output of the pA operation layer are jointly used as the input of the weighted summation layer, and the weighted summation layer carries out weighted summation on the value of the node and the value of the node after the pA vector is enhanced;
and inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current sentence belonging to the target sentence through a preset formula in the output layer, and taking the initial probability as the first characteristic of the sentence.
In this embodiment or some other embodiments of the invention, the second feature may include one or more of the following:
the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
the maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
whether the idiom is contained;
the non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
the repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
In this embodiment or some other embodiment of the invention:
extracting part-of-speech distributions of the current sentence may include:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech in the current sentence to the total word number to obtain the part of speech distribution of the current sentence;
extracting the average word frequency of the current sentence may include:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence;
extracting the maximum word frequency and the minimum word frequency of the current sentence may include:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence respectively;
extracting the non-repeated word proportion of the current sentence may include:
respectively finding out non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence;
extracting the number of repeated word types of the current sentence may include:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
In this embodiment or some other embodiments of the present invention, the identification module may be configured to:
taking the recognition features of the current sentence as the input of the target sentence recognition model;
receiving an output of the target sentence recognition model, wherein the output is a probability that the current sentence belongs to the target sentence;
and when the probability is greater than a preset threshold value, determining that the current statement belongs to the target statement.
Referring to fig. 7, in this embodiment or some other embodiments of the present invention, the apparatus may further include:
a marking module 604, configured to mark the target sentence in the text in a preset manner.
In the embodiment, each natural language sentence in the text is identified according to the semantic features and/or the literal features of the sentence and the target sentence identification model pre-constructed through training, so that the sentences belonging to the target sentence (for example, graceful sentence) can be automatically found, and the identification efficiency of the target sentence is greatly improved; meanwhile, the identification standard of the invention is based on objective characteristics and models, so that the identification result is objective, thereby avoiding the problem of subjectivity during manual identification.
The specific manner in which each unit \ module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (12)

1. A target sentence recognition method, the method comprising:
acquiring a text to be processed, wherein the text comprises one or more natural language sentences;
extracting the identification features of each sentence, wherein the identification features comprise first features and/or second features, the first features are used for indicating the features of the sentences in semantic aspect, the second features are used for indicating the features of the sentences in literal aspect, when the identification features comprise the first features, the first features of each sentence are extracted, and the method comprises the following steps:
performing word segmentation on the current sentence;
obtaining a word vector of each word after word segmentation;
acquiring a first characteristic of a current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer, the LSTM-RNN layer is used for encoding the word vector of the current statement to obtain a corresponding node value, the output of the LSTM-RNN layer is used as the input of the pA operation layer, the pA operation layer is a structural layer which performs dot product operation by using the pA vector and the value of each node, and the pA vector is a model parameter;
and identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text.
2. The method of claim 1, wherein obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model comprises:
inputting a word vector of each word of the current sentence into the LSTM-RNN layer;
taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node to enhance the historical information stored by each node;
then the input of the pA operation layer and the output of the pA operation layer are jointly used as the input of the weighted summation layer, and the weighted summation layer carries out weighted summation on the value of the node and the value of the node after the pA vector is enhanced;
and inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current sentence belonging to the target sentence through a preset formula in the output layer, and taking the initial probability as the first characteristic of the sentence.
3. The method of claim 1, wherein the second feature comprises one or more of:
the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
the maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
whether the idiom is contained;
the non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
the repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
4. The method of claim 3, wherein:
extracting part-of-speech distribution of the current sentence, comprising:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech in the current sentence to the total word number to obtain the part of speech distribution of the current sentence;
extracting the average word frequency of the current sentence, comprising:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence;
extracting the maximum word frequency and the minimum word frequency of the current sentence, wherein the extracting comprises the following steps:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence respectively;
extracting the proportion of non-repeated words of the current sentence, comprising the following steps:
respectively finding out non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence;
extracting the number of repeated word types of the current sentence, comprising the following steps:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
5. The method of claim 1, wherein the identifying the target sentence in the text according to the pre-constructed target sentence identification model and the identification feature of each sentence in the text comprises:
taking the recognition features of the current sentence as the input of the target sentence recognition model;
receiving an output of the target sentence recognition model, wherein the output is a probability that the current sentence belongs to the target sentence;
and when the probability is greater than a preset threshold value, determining that the current statement belongs to the target statement.
6. The method of claim 1, wherein after the identifying the target sentence in the text, the method further comprises:
and marking the target sentence in the text by using a preset mode.
7. An apparatus for recognizing a target sentence, the apparatus comprising:
the input module is used for acquiring a text to be processed, wherein the text comprises one or more natural language sentences;
a feature extraction module, configured to extract an identification feature of each sentence, where the identification feature includes a first feature and/or a second feature, the first feature is used to indicate a feature of the sentence in a semantic aspect, and the second feature is used to indicate a feature of the sentence in a literal aspect, where, when the identification feature includes the first feature, extracting the first feature of each sentence, including:
performing word segmentation on the current sentence;
obtaining a word vector of each word after word segmentation;
acquiring a first characteristic of a current statement according to a word vector of each word of the current statement and a pre-constructed first identification model, wherein the first identification model sequentially comprises an LSTM-RNN layer, a pA operation layer, a weighted summation layer and an output layer, the LSTM-RNN layer is used for encoding the word vector of the current statement to obtain a corresponding node value, the output of the LSTM-RNN layer is used as the input of the pA operation layer, the pA operation layer is a structural layer which performs dot product operation by using the pA vector and the value of each node, and the pA vector is a model parameter;
and the identification module is used for identifying the target sentence in the text according to a pre-constructed target sentence identification model and the identification characteristics of each sentence in the text.
8. The apparatus of claim 7, wherein the obtaining the first feature of the current sentence according to the word vector of each word of the current sentence and the pre-constructed first recognition model comprises:
inputting a word vector of each word of the current sentence into the LSTM-RNN layer;
taking the output of the LSTM-RNN layer as the input of the pA operation layer, and performing dot product operation on the pA operation layer by using pA vectors and the values of each node to enhance the historical information stored by each node;
then the input of the pA operation layer and the output of the pA operation layer are jointly used as the input of the weighted summation layer, and the weighted summation layer carries out weighted summation on the value of the node and the value of the node after the pA vector is enhanced;
and inputting the result of the weighted summation into the output layer, obtaining the initial probability of the current sentence belonging to the target sentence through a preset formula in the output layer, and taking the initial probability as the first characteristic of the sentence.
9. The apparatus of claim 7, wherein the second feature comprises one or more of:
the part-of-speech distribution is used for indicating the number proportion of each part-of-speech word in the current sentence;
average word frequency, which is used for indicating the average value of the occurrence times of each word in the current sentence in all the collected texts;
the maximum word frequency and the minimum word frequency are used for indicating the maximum value and the minimum value of the occurrence times of each word in the current sentence in all the collected texts;
whether the idiom is contained;
the non-repeated word proportion is used for indicating the number proportion of the non-repeated words in the current sentence;
the repeated word type number is used for indicating the type number of repeated words in the current sentence, wherein the same type of repeated words is counted as one type.
10. The apparatus of claim 9, wherein:
extracting part-of-speech distribution of the current sentence, comprising:
counting the total word number in the current sentence, and calculating the ratio of the number of words of each part of speech in the current sentence to the total word number to obtain the part of speech distribution of the current sentence;
extracting the average word frequency of the current sentence, comprising:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and calculating the average value of the times to obtain the average word frequency of the current sentence;
extracting the maximum word frequency and the minimum word frequency of the current sentence, wherein the extracting comprises the following steps:
respectively counting the occurrence times of each word in the current sentence in all the collected texts, and selecting the maximum value and the minimum value of the times as the maximum word frequency and the minimum word frequency of the current sentence respectively;
extracting the proportion of non-repeated words of the current sentence, comprising the following steps:
respectively finding out non-repeated words in the current sentence, wherein the non-repeated words are words with different fonts, counting the total number of the non-repeated words, and taking the ratio of the total number of the non-repeated words to the total number of words of the current sentence as the ratio of the non-repeated words of the current sentence;
extracting the number of repeated word types of the current sentence, comprising the following steps:
and respectively finding repeated words in the current sentence, wherein the repeated words are words with the same font, and the type number of the repeated words in the current sentence is used as the type number of the repeated words, wherein the same type of the repeated words is counted as one type.
11. The apparatus of claim 7, wherein the identification module is configured to:
taking the recognition features of the current sentence as the input of the target sentence recognition model;
receiving an output of the target sentence recognition model, wherein the output is a probability that the current sentence belongs to the target sentence;
and when the probability is greater than a preset threshold value, determining that the current statement belongs to the target statement.
12. The apparatus of claim 7, further comprising:
and the marking module is used for marking the target sentence in the text in a preset mode.
CN201610792978.5A 2016-08-31 2016-08-31 Target statement identification method and device Active CN107783958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610792978.5A CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610792978.5A CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Publications (2)

Publication Number Publication Date
CN107783958A CN107783958A (en) 2018-03-09
CN107783958B true CN107783958B (en) 2021-07-02

Family

ID=61451435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610792978.5A Active CN107783958B (en) 2016-08-31 2016-08-31 Target statement identification method and device

Country Status (1)

Country Link
CN (1) CN107783958B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325234B (en) * 2018-10-10 2023-06-20 深圳前海微众银行股份有限公司 Sentence processing method, sentence processing device and computer readable storage medium
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text
CN110147542A (en) * 2019-05-23 2019-08-20 联想(北京)有限公司 A kind of information processing method and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391837A (en) * 2014-11-19 2015-03-04 熊玮 Intelligent grammatical analysis method based on case semantics
WO2015165372A1 (en) * 2014-04-29 2015-11-05 Tencent Technology (Shenzhen) Company Limited Method and apparatus for classifying object based on social networking service, and storage medium
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
US20150310862A1 (en) * 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
CN104850540A (en) * 2015-05-29 2015-08-19 北京京东尚科信息技术有限公司 Sentence recognizing method and sentence recognizing device
CN105550291B (en) * 2015-12-10 2019-05-31 百度在线网络技术(北京)有限公司 File classification method and device
CN105787461B (en) * 2016-03-15 2019-07-23 浙江大学 Document adverse reaction entity recognition method based on text classification and condition random field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015165372A1 (en) * 2014-04-29 2015-11-05 Tencent Technology (Shenzhen) Company Limited Method and apparatus for classifying object based on social networking service, and storage medium
CN104391837A (en) * 2014-11-19 2015-03-04 熊玮 Intelligent grammatical analysis method based on case semantics
CN105427858A (en) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 Method and system for achieving automatic voice classification
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义特征的自动文本分类方法;胡晓辉 等;《计算机与现代化》;20101130(第183期);第9-11、15页 *

Also Published As

Publication number Publication date
CN107783958A (en) 2018-03-09

Similar Documents

Publication Publication Date Title
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
US11151130B2 (en) Systems and methods for assessing quality of input text using recurrent neural networks
CN108182177A (en) A kind of mathematics knowledge-ID automation mask method and device
US11699275B2 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN106919551B (en) Emotional word polarity analysis method, device and equipment
CN111062220B (en) End-to-end intention recognition system and method based on memory forgetting device
CN108090099B (en) Text processing method and device
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN107797981B (en) Target text recognition method and device
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN111859964A (en) Method and device for identifying named entities in sentences
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN111737968A (en) Method and terminal for automatically correcting and scoring composition
TW201544976A (en) Natural language processing system, natural language processing method, and natural language processing program
CN107783958B (en) Target statement identification method and device
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114595327A (en) Data enhancement method and device, electronic equipment and storage medium
CN112287100A (en) Text recognition method, spelling error correction method and voice recognition method
CN110263321B (en) Emotion dictionary construction method and system
CN110969005B (en) Method and device for determining similarity between entity corpora
CN114139537A (en) Word vector generation method and device
CN113672731A (en) Emotion analysis method, device and equipment based on domain information and storage medium
CN110287396A (en) Text matching technique and device
CN108304366B (en) Hypernym detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant