The content of the invention
The purpose of the embodiment of the present application is to propose a kind of improved method and apparatus for determining text core sentence,
To solve the technical problem that background section above is mentioned.
In a first aspect, the embodiment of the present application provides a kind of method for determining text core sentence, this method includes:
Target text is obtained from default text set, wherein, text set includes multiple texts, and text utilizes predetermined symbol including multiple
The sentence of division;The essential characteristic of the first sentence in target text is calculated, wherein, essential characteristic includes term frequency-inverse document frequency
Rate, comentropy, repetitive rate, the similarity with the title of target text, the first sentence are any sentence in target text;It is based on
The essential characteristic of first sentence, determine first sentence whether be target text core sentence.
In certain embodiments, the essential characteristic of the first sentence in target text is calculated, including:To each in text set
The sentence of text is segmented, and obtains each word after participle, wherein, the word after the first sentence participle is the first word;Meter
The term frequency-inverse document frequency of each first word is calculated, and the first sentence is determined according to the term frequency-inverse document frequency of each first sentence
Term frequency-inverse document frequency;Word frequency of each first word in target text is calculated, and according to each first word in target text
Word frequency determines the comentropy of each first word;Calculate repetitive rate of first sentence in target text;Calculate the first sentence and mesh
Mark the similarity of the title of text.
The above method also includes in certain embodiments:Data cleansing is carried out to each text in text set, obtains each text
This title and text.
In certain embodiments, the term frequency-inverse document frequency of each first word is calculated, and according to the word of each first sentence
Frequently-inverse document frequency determines the term frequency-inverse document frequency of the first sentence, including:Each first word is obtained in target text
Word frequency;Obtain inverse document frequency of each first word in text set;Utilize the word frequency and inverse document frequency of each first word, meter
Calculate the term frequency-inverse document frequency of each first word;The term frequency-inverse document frequency of each first word is summed, determines the first language
The term frequency-inverse document frequency of sentence.
In certain embodiments, word frequency of each first word in target text is calculated, and according to each first word in target
The word frequency of text determines the comentropy of each first word, including:Word frequency of each first word in target text is obtained, is calculated
The comentropy of each first word;The comentropy of each first word is summed, determines the comentropy of the first sentence.
In certain embodiments, the similarity of the title of the first sentence and target text is calculated, including:Calculate the first sentence
With the editing distance of the title of target text;The string length of the string length of first sentence and title is contrasted, from
It is middle to determine that longer string length is the first string length;According to editing distance and the ratio of the first string length, really
The similarity of fixed first sentence and the title of target lyrics text.
In certain embodiments, the essential characteristic based on the first sentence, determine the first sentence whether be target text core
Innermost thoughts and feelings sentence, including:To the term frequency-inverse document frequency of the first sentence, comentropy, repetitive rate, similar to the title of target text
Weighted sum is spent, determines the scoring of the first sentence;Scoring based on the first sentence is more than the first predetermined threshold value, determines the first sentence
For the core sentence of target text.
In certain embodiments, predetermined symbol is newline.
Second aspect, this application provides the device for determining text core sentence, device includes:Acquiring unit, match somebody with somebody
Put for obtaining target text from default text set, wherein, text set includes multiple texts, and text includes multiple using pre-
If the sentence of symbol division;Computing unit, it is configured to calculate the essential characteristic of the first sentence in target text, wherein, base
Eigen includes term frequency-inverse document frequency, comentropy, repetitive rate, the similarity with the title of target text, and the first sentence is mesh
Mark any sentence in text;Determining unit, the essential characteristic based on the first sentence is configured to, whether determines first sentence
For the core sentence of target text.
In certain embodiments, computing unit includes:Word-dividing mode, it is configured to the sentence to each text in text set
Segmented, obtain each word after participle, wherein, the word after the first sentence participle is the first word;Term frequency-inverse document frequency
Rate computing module, it is configured to calculate word frequency of each first word in target text, and according to each first word in target text
Word frequency determine the comentropy of each first word;Comentropy computing module, it is configured to calculate each first word in target text
Word frequency, and the comentropy of each first word is determined according to each first word in the word frequency of target text;Repetitive rate computing module,
It is configured to calculate repetitive rate of first sentence in target text;Similarity calculation module, it is configured to calculate the first sentence
With the similarity of the title of target text.
In certain embodiments, device also includes:Cleaning unit, it is configured to carry out data to each text in text set
Cleaning, obtains the title and text of each text.
In certain embodiments, term frequency-inverse document frequency computing module is further configured to:Each first word is obtained to exist
Word frequency in target text;Obtain inverse document frequency of each first word in text set;Using each first word word frequency and
Inverse document frequency, calculate the term frequency-inverse document frequency of each first word;The term frequency-inverse document frequency of each first word is asked
With determine the term frequency-inverse document frequency of the first sentence.
In certain embodiments, comentropy computing module is further configured to:Each first word is obtained in target text
Word frequency in this, calculate the comentropy of each first word;The comentropy of each first word is summed, determines the information of the first sentence
Entropy.
In certain embodiments, similarity calculation module is further configured to:Calculate the first sentence and target text
The editing distance of title;The string length of the string length of first sentence and title is contrasted, therefrom determined longer
String length is the first string length;According to editing distance and the ratio of the first string length, determine the first sentence with
The similarity of the title of target lyrics text.
In certain embodiments, determining unit is further configured to:To word frequency-inverse document frequency, the letter of the first sentence
Entropy, repetitive rate, the Similarity-Weighted summation with the title of target text are ceased, determines the scoring of the first sentence;Based on the first sentence
Scoring be more than the first predetermined threshold value, determine the first sentence be target text core sentence.
In certain embodiments, predetermined symbol is newline.
The method and apparatus for determining text core sentence that the embodiment of the present application provides, first can be from default text
This concentration obtains target text, and then calculating each first sentence in target text includes term frequency-inverse document frequency, comentropy, again
Multiple rate, the essential characteristic with the similarity of the title of target text, the essential characteristic value for being finally based on the first sentence may determine that
First sentence whether be target text core sentence, so as to realize by using each first sentence essential characteristic raising
Determine the accuracy of text core sentence.
Embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Be easy to describe, illustrate only in accompanying drawing to about the related part of invention.
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combination.Describe the application in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the application for determining the method for text core sentence or for determining text core
The exemplary system architecture 100 of the embodiment of the device of sentence.
As shown in figure 1, system architecture 100 can include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 provide communication link medium.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted with using terminal equipment 101,102,103 by network 104 with server 105, to receive or send out
Send text etc..For example, user can upload text by network 104 with using terminal equipment 101,102,103 to server 105,
Can also the reception server 105 send text.Various client applications can be installed on terminal device 101,102,103,
Such as audio playing software, searching class application etc..
Terminal device 101,102,103 can be the various electronic equipments that web displaying or audio play, including but unlimited
In smart mobile phone, tablet personal computer, smart home, E-book reader, MP3 player (Moving Picture Experts
Group Audio Layer III, dynamic image expert's compression standard audio aspect 3), MP4 (Moving Picture
Experts Group Audio Layer IV, dynamic image expert's compression standard audio aspect 4) it is player, on knee portable
Computer and desktop computer etc..Server 105 can be to provide the server of various services, such as the text to getting
The background server analyzed and processed.
It should be noted that the method for being used to determine text core sentence that the embodiment of the present application is provided is typically by servicing
Device 105 is performed, and correspondingly, the device for determining text core sentence is generally positioned in server 105.
It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realizing need
Will, can have any number of terminal device, network and server.
With continued reference to Fig. 2, one embodiment for being used to determine the method for text core sentence according to the application is shown
Flow 200.This is used for the method for determining text core sentence, comprises the following steps:
Step 201, target text is obtained from default text set.
In the present embodiment, first, for determine text core sentence method run with thereon electronic equipment (such as
Server in Fig. 1) default text set can be obtained.The electronic equipment is with utilization wired connection mode or wireless connection side
Formula receives the text that user uploads or inputted from place terminal device, and above-mentioned electronic equipment can be written by acquired group of text
This collection, and be stored in local memory or External memory equipment, in order to which above-mentioned electronic equipment can obtain text set.Or
Text collection can also be stored directly in the terminal device where user, above-mentioned electronic equipment can utilize wired connection mode or
Person's radio connection obtains above-mentioned text set from the terminal device where user.Here, multiple texts can be included in text set
This, above-mentioned target text can be the text for needing to carry out core sentence determination in text set, and each text can utilize in advance
If symbol is divided into multiple sentences.For example, each text can be lyrics text, predetermined symbol can be newline, each lyrics text
Originally multiple lyrics sentences can be divided into using newline.Certainly, predetermined symbol here is not only limited to newline, and it can be with
For space, "/" etc..
Step 202, the essential characteristic of the first sentence in target text is calculated.
In the present embodiment, target text can be divided into multiple sentences by above-mentioned electronic equipment using predetermined symbol, and
It is the first sentence therefrom to determine any sentence.Then, above-mentioned electronic equipment can calculate the base of the first sentence in target text
Eigen, in order to determine first sentence whether be target text core sentence.Here, the essential characteristic of the first sentence can
With the term frequency-inverse document frequency including first sentence, the comentropy of the first sentence, the repetitive rate of the first sentence, the first sentence
With the similarity of the title of target text.It is understood that here, above-mentioned electronic equipment can also be from the word of the first sentence
Frequently-inverse document frequency, the comentropy of the first sentence, the repetitive rate of the first sentence, the phase of the first sentence and the title of target text
One of them or several essential characteristics as first sentence, such a setting is selected to be equally applicable in this implementation like in degree
The method that example is provided.
The main thought of above-mentioned word frequency-reverse document-frequency method is, if some word or phrase are in an article
The frequency (Term Frequency, TF) of appearance is high, and seldom occurs in other articles, then it is assumed that this word or phrase
With good class discrimination ability, it is adapted to classify.And reverse document-frequency (Inverse Document
Frequency, IDF) be primarily referred to as, if the document comprising some word or phrase is fewer, IDF is bigger, then illustrate the word or
Phrase has good class discrimination ability.Thus, using word frequency-reverse document-frequency (TF-IDF) method, the can be calculated
Importance of one sentence inside target text.Above- mentioned information entropy can be understood as the probability of occurrence of certain customizing messages.Here,
The comentropy of first sentence can be understood as characterizing first sentence in target text using the probability that first sentence occurs
Significance level.The repetitive rate of above-mentioned first sentence can be understood as repetitive rate of above-mentioned first sentence in target text, or
The repetitive rate of the sentence of person first is also understood that the repetitive rate for the first sentence in text set.Generally, the repetition of the first sentence
Rate can also characterize the significance level of first sentence in target text in certain degree.Above-mentioned first sentence and target
The similarity of the title of text can characterize the similarity degree of the title of the first sentence and target text.The title of text generally exists
Occupied an important position in text, therefore, the height of the title similarity of the first sentence and target text can also characterize this
Significance level of one sentence in target text.Above-mentioned electronic equipment can utilize the various means such as cosine similarity formula to calculate
The similarity of the title of first sentence and target text.
Step 203, the essential characteristic based on the first sentence, determine first sentence whether be target text core language
Sentence.
In the present embodiment, the essential characteristic of the first sentence calculated based on step 202, above-mentioned electronic equipment can profits
The essential characteristic of first sentence is handled with various means, the essential characteristic for facilitating the use the first sentence determines first sentence
Whether it is core sentence in target text where it.Above-mentioned electronic equipment can calculate each sentence in target text
Essential characteristic, therefore, above-mentioned electronic equipment can utilize the essential characteristic of each sentence in target text, be determined from target text
Go out at least one core sentence.
For example, above-mentioned electronic equipment can pre-set threshold value, afterwards, each essential characteristic of the first sentence can be calculated
Total and/or average value, then, the threshold comparison that above-mentioned electronic equipment can be by the total and/or average value with pre-setting, most
Afterwards, the total and/or average value of essential characteristic can be more than to core sentence of first sentence as the target text of threshold value.Can
With understanding, above-mentioned electronic equipment can also handle the essential characteristic of the first sentence, the example above using other means
Exemplary only explanation.
As an example, the method for determining text core sentence provided using the present embodiment, it may be determined that lyrics text
Core sentence in this, song corresponding to the core sentence search of song can be utilized in order to which audio plays application.The example
Comprise the following steps that:First, above-mentioned electronic equipment can be obtained it needs to be determined that lyrics core from the lyrics text set pre-set
The lyrics text of innermost thoughts and feelings sentence is target text;Afterwards, to entering line statement division as the lyrics text of target text, can be formed
Multiple lyrics sentences;Then, the essential characteristic of each lyrics sentence can be calculated using each lyrics sentence as the first sentence;Finally, according to each
The essential characteristic of lyrics sentence can be determined as the core sentence of the lyrics text of target text.It is understood that in response to
The core sentence that user inputs lyrics text searches for song, and server can send the song where the core sentence to user
Song, in order to which user can play the song on the terminal device.
The method for determining text core sentence that above-described embodiment of the application provides, first can be from default text
This concentration obtains target text, and can then calculate each first sentence in target text includes term frequency-inverse document frequency, information
Entropy, repetitive rate, the essential characteristic with the similarity of the title of target text, the essential characteristic value for being finally based on the first sentence can be with
Determine the first sentence whether be target text core sentence, so as to realize the essential characteristic by using each first sentence
Improve the accuracy for determining text core sentence.
Referring next to Fig. 3, it illustrates the another of the method for being used to determine text core sentence according to the present embodiment
The flow 300 of one embodiment.The present embodiment is used to determine that the specific steps of the method for text core sentence can to include:
Step 301, target text is obtained from default text set.
In the present embodiment, for determining that the method for text core sentence is run with electronic equipment thereon (such as in Fig. 1
Server) default text set can be obtained, text collection can be made up of multiple texts.In this implementation be used for determine text
The method of this core sentence is determined for the core sentence of each text of text concentration.Above-mentioned electronic equipment can be from upper
State and target text is obtained in text set, the target text can be it needs to be determined that the text of core sentence in text set.Need
Bright, each text in above-mentioned text set can include multiple sentences, and predetermined symbol be present between different sentences.Cause
This, each text can be divided into multiple sentences by above-mentioned electronic equipment using predetermined symbol.
In the optional implementation of in the present embodiment some, default symbol in each text, between different sentences be present
Number can be newline.It can be seen that each text in above-mentioned text set can be the text with special format, such as lyrics text,
Book of Songs etc..Generally, the division of line statement can not be entered in the text such as the lyrics, Book of Songs with punctuation mark, when texts such as song, the Book of Songs
In a Statement Completion when, can utilize newline line feed after show next sentence.It can be seen that can be by text using newline
The texts such as each lyrics of this concentration, the Book of Songs are divided into sentence.It is understood that above-mentioned lyrics text, Book of Songs text are only pair
Text and Chinese version it is pattern for example, being not unique restriction to the pattern of each text in text set.
In the optional implementation of in the present embodiment some, after above-mentioned electronic equipment obtains text set, it may be used also
Data cleansing is carried out with each text concentrated to the text, deletes the dirty data in each text.Here dirty data can be thought as
The content of the core sentence of the text is unlikely to be in text.For example, if each text in above-mentioned text set is lyrics text,
Writing words, wrirting music in lyrics text, music and name etc. are unlikely to be the core sentence in the lyrics text, therefore can will
It is deleted as dirty data, so as to realize the data cleansing to lyrics text.Optionally, of the text in above-mentioned text set
In other version, may there is a situation where to omit repeat statement, now should be according to the polishing that puts in order of each sentence in text
The sentence of text.
Step 302, the sentence of each text in text set is segmented, obtains each word after participle.
In the present embodiment, each text in text set can be divided into multiple by above-mentioned electronic equipment using predetermined symbol
Sentence.Then, for each text in text set, it can be segmented using various means to the sentence in the text,
And obtain the word after each sentence participle.As an example, above-mentioned electronic equipment can use the method for full cutting to each text
In sentence segmented.It is understood that for the target text acquired in electronic equipment, the target text equally also may be used
To be divided into multiple words.It should be noted that it is determined that after the first sentence in target text, first sentence can be obtained
Participle after word, and the first sentence participle after word can be the first word.
In some optional implementations of the present embodiment, using full cutting method, it can be syncopated as first and language
The all possible word of dictionary matching, then optimal cutting result is determined with statistical language model.With with song《Unforgettable the present
Night》Lyrics text sentence " no matter the ends of the earth and cape " exemplified by, language dictionary matching can be carried out first, find the institute of matching
There is word --- no matter, the ends of the earth, with cape, by day, the ends of the earth and cape;These words table in the form of word grid (word lattices)
Show, be next based on word grid and do route searching, then optimal path is found based on statistical language model (such as N-Gram models).
If result shows the language model scores highest of " no matter the ends of the earth and cape ", " no matter the ends of the earth and cape " be " no matter day
The optimal cutting of margin and cape ".
Step 303, the term frequency-inverse document frequency of each first word is calculated, and according to word frequency-inverse text of each first sentence
Shelves frequency determines the term frequency-inverse document frequency of the first sentence.
In the present embodiment, included by the first word and text set included by the first sentence obtained based on step 302
Each word, above-mentioned electronic equipment can calculate the term frequency-inverse document frequency of each first word, and then, it can be according to
Word frequency-inverse document frequency of each first word in one sentence determines the term frequency-inverse document frequency of first sentence.
In some optional implementations of the present embodiment, above-mentioned electronic equipment can calculate any first word first
Word frequency TF in target text.Afterwards, it can calculate inverse document frequency IDF of first word in text set.Then,
The term frequency-inverse document frequency TF-IDF of first word is calculated according to the word frequency TF of first word and inverse document frequency IDF.
In this way, above-mentioned electronic equipment can calculate the term frequency-inverse document frequency TF-IDF of each first word in the first sentence.Most
Afterwards, above-mentioned electronic equipment can sum to the term frequency-inverse document frequency TF-IDF of each first word in the first sentence, so as to
To determine the word frequency of above-mentioned first sentence-inverse document frequency TF-IDF.
As an example, the target text that above-mentioned electronic equipment obtains from text set can be《Remember tonight》The lyrics text
This, above-mentioned text set can include m1 text, wherein, the number of the word after target text participle is m2, if judging above-mentioned
" no matter the ends of the earth and cape " in lyrics text whether be the lyrics text core sentence, it is seen that " no matter the ends of the earth and cape " can
Think the first sentence of target text, and the first word that first sentence includes can be respectively " no matter ", " ends of the earth ",
"AND" and " cape ", wherein " no matter " number that occurs in above-mentioned lyrics text is n1, and in n2 text in text set
Occur first word " no matter ";According to term frequency-inverse document frequency TF-IDF formula, " no matter " word of this first word
Frequently-inverse document frequency" ends of the earth ", "AND" and " cape " can also be calculated according to the above method
Term frequency-inverse document frequency TF-IDF, then, to " no matter ", " ends of the earth ", term frequency-inverse document frequency corresponding to "AND" and " cape "
Rate TF-IDF summations are the term frequency-inverse document frequency TF-IDF ' of the first sentence " no matter the ends of the earth and cape ".It can be seen that above-mentioned electricity
Sub- equipment can utilize the term frequency-inverse document frequency TF-IDF ' of each sentence in this method calculating target text.
Step 304, calculate the word frequency of each first word in target text, and according to each first word target text word
Frequency determines the comentropy of each first word.
In the present embodiment, based on step 302 obtain each first sentence included by the first word and target text institute
Including each word, above-mentioned electronic equipment can calculate word frequency of each first word in target text, and then, it can root
The term frequency-inverse document frequency of first sentence is determined according to the word frequency of each first word in the first sentence.
In some optional implementations of the present embodiment, above-mentioned electronic equipment can calculate any first word first
Word frequency TF in target text.Afterwards, the comentropy H of first word can be calculated.Then, can be according to first word
Comentropy H calculate the comentropy H ' of first sentence.In this way, above-mentioned electronic equipment can calculate each in the first sentence
The comentropy H of one word.Finally, above-mentioned electronic equipment can sum to the comentropy H of each first word in the first sentence, from
And the comentropy H ' of above-mentioned first sentence can be determined.
Still with the example in step 303 as an example, above-mentioned electronic equipment can calculate " no matter " word frequency of this first wordThen according to the formula of comentropy, " no matter " comentropy of this first word
It can be seen that the comentropy H of " ends of the earth ", "AND" and " cape " can also be calculated according to the above method, then, to " no matter ", " my god
Comentropy H summations corresponding to margin ", "AND" and " cape " are the comentropy H ' of the first sentence " no matter the ends of the earth and cape ".It can be seen that
Above-mentioned electronic equipment can utilize the comentropy H ' of each sentence in this method calculating target text.
Step 305, repetitive rate of first sentence in target text is calculated.
In the present embodiment, above-mentioned electronic equipment, which can count target text first, includes the number a of total sentence, then
The number b that the first sentence occurs is determined in target text again, can finally calculate first sentence in the target text
The number b and the number a of total sentence of target text ratio that repetitive rate q, repetitive rate q can occur for the first sentence, i.e. q
=ba.
Step 306, the similarity of the title of the first sentence and target text is calculated.
In the present embodiment, above-mentioned electronic equipment can extract the text and title of the text from target text first.
Then, the similarity of its first sentence and the title of the target text that can utilize various means calculating target text.
In some optional implementations of the present embodiment, above-mentioned electronic equipment can calculate above-mentioned first sentence first
With the editing distance of the title of target text.Editing distance, also known as Levenshtein distances, between two word strings can be referred to,
As the minimum edit operation number needed for one changes into another.It can be seen that the editing distance of the first sentence of calculating and title is first
The character string of the first sentence and the character string of title can be obtained, then calculates the editing distance D between two character strings.It
Afterwards, above-mentioned electronic equipment can obtain the length K1 of the character string of the first sentence and the length K2 of the character string of title, and by K1
Contrasted with K2, the length for therefrom determining the longer character string of character string is the first string length K, i.e. K=max (K1, K2).Most
Afterwards, above-mentioned electronic equipment can determine the first sentence and the target lyrics according to editing distance D and the first string length K ratio
The similarity p of the title of text.Optionally, similarity p=1-KD.It can be seen that above-mentioned electronic equipment can utilize this method to calculate
The similarity p of each sentence and title in target text.
Step 307, to the term frequency-inverse document frequency of the first sentence, comentropy, repetitive rate, with the title of target text
Similarity-Weighted is summed, and determines the scoring of the first sentence.
In the present embodiment, the first sentence determined respectively based on step 303, step 304, step 305 and step 306
Term frequency-inverse document frequency, the comentropy of the first sentence, the repetitive rate of the first sentence and the title of the first sentence and target text
Similarity, above-mentioned electronic equipment can be respectively the term frequency-inverse document frequency of the first sentence, the comentropy of the first sentence,
The repetitive rate of one sentence and the similarity of the title of the first sentence and target text assign weights, then, to the first language after weighting
Term frequency-inverse document frequency, the comentropy of the first sentence, the repetitive rate of the first sentence and the mark of the first sentence and target text of sentence
The similarity summation of topic, and should can think the scoring of above-mentioned first sentence.It is understood that above-mentioned electronic equipment is according to
The term frequency-inverse document frequency of one sentence, the comentropy of the first sentence, the repetitive rate of the first sentence and the first sentence and target text
Title similarity the significance level of target text is come for its assign weights.
In some optional implementations of the present embodiment, above-mentioned electronic equipment is the term frequency-inverse document of the first sentence
Frequency TF-IDF ', the comentropy H ' of the first sentence, the repetitive rate q of the first sentence and the title of the first sentence and target text
The weights that similarity p is assigned are respectively x1, x2, x3 and x4, it is seen then that scoring S=x1 × TF-IDF'+x2 × H' of first sentence
+ x3 × q+x4 × p, above-mentioned electronic equipment can be scored using the scoring formula for each sentence in target text,
Step 308, the scoring based on the first sentence is more than the first predetermined threshold value, determines the core that the first sentence is target text
Innermost thoughts and feelings sentence.
In the present embodiment, based on step 307 determine the first sentence scoring, above-mentioned electronic equipment can by this first
The scoring of sentence contrasts with default first threshold.If the scoring of above-mentioned first sentence is more than the first predetermined threshold value, can be true
Make the core sentence that first sentence can be target text where it.If the scoring of above-mentioned first sentence is less than or equal to the
One predetermined threshold value, then it is not the core sentence for target text that can determine first sentence.
From figure 3, it can be seen that compared with embodiment corresponding to Fig. 2, being used in the present embodiment determines text core language
The step of flow 300 of the method for sentence highlights the calculating to the essential characteristic of the first sentence and the step to the scoring of the first sentence
Suddenly.Thus, the scheme of the present embodiment description can more accurately determine the scoring of the first sentence, so as to improve as target
Text determines the accuracy of core sentence.
With further reference to Fig. 4, as the realization to method shown in above-mentioned each figure, it is used to determine text this application provides one kind
One embodiment of the device of this core sentence, the device embodiment is corresponding with the embodiment of the method shown in Fig. 2, device tool
Body can apply in various electronic equipments.
As shown in figure 4, the present embodiment is used to determine that the device 400 of text core sentence to include:Acquiring unit 401, meter
Calculate unit 402 and determining unit 403.Wherein, acquiring unit 401 is configured to obtain target text from default text set,
Wherein, text set includes multiple texts, and text includes multiple sentences divided using predetermined symbol;The configuration of computing unit 402 is used
In the essential characteristic for calculating the first sentence in target text, wherein, essential characteristic includes word frequency-inverse document frequency, information
Entropy, repetitive rate, the similarity with the title of target text, the first sentence are any sentence in target text;Determining unit 403
Be configured to the essential characteristic based on the first sentence, determine first sentence whether be target text core sentence.
In the present embodiment, the computing unit 402 of the device 400 for determining text core sentence can include participle
Module (not shown), it is configured to segment the sentence of each text in text set, obtains each word after participle, its
In, the word after the first sentence participle is the first word;Term frequency-inverse document frequency computing module (not shown), is configured to count
Word frequency of each first word in target text is calculated, and each first word is determined in the word frequency of target text according to each first word
Comentropy;Comentropy computing module (not shown), it is configured to calculate word frequency of each first word in target text, and according to each
First word determines the comentropy of each first word in the word frequency of target text;Repetitive rate computing module (not shown), configuration are used
In repetitive rate of the first sentence of calculating in target text;Similarity calculation module (not shown), it is configured to calculate the first language
The similarity of sentence and the title of target text.
In the present embodiment, for determining that the device 400 of text core sentence can also include cleaning unit (not shown),
It is configured to carry out data cleansing to each text in text set, obtains the title and text of each text.
In the present embodiment, term frequency-inverse document frequency computing module is further configured to:Each first word is obtained in mesh
Mark the word frequency in text;Obtain inverse document frequency of each first word in text set;Utilize the word frequency of each first word and inverse
Document frequency, calculate the term frequency-inverse document frequency of each first word;The term frequency-inverse document frequency of each first word is summed,
Determine the term frequency-inverse document frequency of the first sentence.
In the present embodiment, comentropy computing module is further configured to:Each first word is obtained in target text
In word frequency, calculate the comentropy of each first word;The comentropy of each first word is summed, determines the information of the first sentence
Entropy.
In the present embodiment, similarity calculation module is further configured to:Calculate the mark of the first sentence and target text
The editing distance of topic;The string length of the string length of first sentence and title is contrasted, therefrom determines longer word
Symbol string length is the first string length;According to editing distance and the ratio of the first string length, the first sentence and mesh are determined
Mark the similarity of the title of lyrics text.
In the present embodiment, determining unit 403 is further configured to:Term frequency-inverse document frequency, letter to the first sentence
Entropy, repetitive rate, the Similarity-Weighted summation with the title of target text are ceased, determines the scoring of the first sentence;Based on the first sentence
Scoring be more than the first predetermined threshold value, determine the first sentence be target text core sentence.
Below with reference to Fig. 5, it illustrates suitable for for realizing the calculating of the terminal device/server of the embodiment of the present application
The structural representation of machine system 500.Terminal device/server shown in Fig. 5 is only an example, and the application should not be implemented
The function and use range of example bring any restrictions.
As shown in figure 5, computer system 500 includes CPU (CPU) 501, it can be read-only according to being stored in
Program in memory (ROM) 502 or the program being loaded into from storage part 508 in random access storage device (RAM) 503
And perform various appropriate actions and processing.In RAM 503, also it is stored with system 500 and operates required various program sums
According to.CPU 501, ROM 502 and RAM 503 are connected with each other by bus 504.Input/output (I/O) interface 505 also connects
To bus 504.
I/O interfaces 505 are connected to lower component:Importation 506 including keyboard, mouse etc.;Penetrated including such as negative electrode
The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage part 508 including hard disk etc.;
And the communications portion 509 of the NIC including LAN card, modem etc..Communications portion 509 via such as because
The network of spy's net performs communication process.Driver 510 is also according to needing to be connected to I/O interfaces 505.Detachable media 511, it is all
Such as disk, CD, magneto-optic disk, semiconductor memory, it is arranged on as needed on driver 510, in order to read from it
The computer program gone out is mounted into storage part 508 as needed.
Especially, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product, it includes being carried on computer-readable medium
On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality
To apply in example, the computer program can be downloaded and installed by communications portion 509 from network, and/or from detachable media
511 are mounted.When the computer program is performed by CPU (CPU) 501, perform and limited in the present processes
Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or
Person's computer-readable recording medium either the two any combination.Computer-readable recording medium for example can be ---
But be not limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than group
Close.The more specifically example of computer-readable recording medium can include but is not limited to:With being electrically connected for one or more wires
Connect, portable computer diskette, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type may be programmed it is read-only
Memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory
Part or above-mentioned any appropriate combination.In this application, computer-readable recording medium can any be included or store
The tangible medium of program, the program can be commanded the either device use or in connection of execution system, device.And
In the application, computer-readable signal media can include believing in a base band or as the data that a carrier wave part is propagated
Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not
It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer
Any computer-readable medium beyond readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use
In by instruction execution system, device either device use or program in connection.Included on computer-readable medium
Program code any appropriate medium can be used to transmit, include but is not limited to:Wirelessly, electric wire, optical cable, RF etc., Huo Zheshang
Any appropriate combination stated.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey
Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation
The part of one module of table, program segment or code, the part of the module, program segment or code include one or more use
In the executable instruction of logic function as defined in realization.It should also be noted that marked at some as in the realization replaced in square frame
The function of note can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actually
It can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also to note
Meaning, the combination of each square frame and block diagram in block diagram and/or flow chart and/or the square frame in flow chart can be with holding
Function as defined in row or the special hardware based system of operation are realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit can also be set within a processor, for example, can be described as:A kind of processor bag
Include acquiring unit, computing unit and determining unit.Wherein, the title of these units is not formed to the unit under certain conditions
The restriction of itself, for example, acquiring unit is also described as " unit that target text is obtained from default text set ".
As on the other hand, present invention also provides a kind of computer-readable medium, the computer-readable medium can be
Included in device described in above-described embodiment;Can also be individualism, and without be incorporated the device in.Above-mentioned calculating
Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the device so that should
Device:Target text is obtained from default text set, wherein, text set includes multiple texts, and text includes multiple using pre-
If the sentence of symbol division;The essential characteristic of the first sentence in target text is calculated, wherein, essential characteristic includes word frequency-inverse
Document frequency, comentropy, repetitive rate, the similarity with the title of target text, the first sentence are any language in target text
Sentence;Based on the essential characteristic of the first sentence, determine first sentence whether be target text core sentence.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art
Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms
Scheme, while should also cover in the case where not departing from foregoing invention design, carried out by above-mentioned technical characteristic or its equivalent feature
The other technical schemes for being combined and being formed.Such as features described above have with (but not limited to) disclosed herein it is similar
The technical scheme that the technical characteristic of function is replaced mutually and formed.