CN106991181A - The method and device that colloquial style sentence is extracted - Google Patents

The method and device that colloquial style sentence is extracted Download PDF

Info

Publication number
CN106991181A
CN106991181A CN201710225009.6A CN201710225009A CN106991181A CN 106991181 A CN106991181 A CN 106991181A CN 201710225009 A CN201710225009 A CN 201710225009A CN 106991181 A CN106991181 A CN 106991181A
Authority
CN
China
Prior art keywords
corpus
word
mixing
spoken
film
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710225009.6A
Other languages
Chinese (zh)
Other versions
CN106991181B (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201710225009.6A priority Critical patent/CN106991181B/en
Publication of CN106991181A publication Critical patent/CN106991181A/en
Application granted granted Critical
Publication of CN106991181B publication Critical patent/CN106991181B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Abstract

The embodiment of the invention discloses the method and device that a kind of colloquial style sentence is extracted, methods described includes:Word frequency of the statistics film corpus with mixing word in corpus, and being ranked up to the film corpus with mixing the word in corpus according to the word frequency respectively;Diversity factor of the word in the film corpus and mixing corpus is calculated according to the word frequency of the word and the information of sequence, and spoken corpus is confirmed according to the diversity factor;Colloquial style sentence in the mixing corpus is extracted based on the spoken corpus.The embodiment of the present invention by counting film corpus with mixing the word frequency of word and the validation of information spoken corpus of sequence in corpus respectively, reuse spoken corpus and extract the colloquial style sentence mixed in corpus, solve the problem of User Defined spoken corpus is wasted time and energy in the prior art, effectively improve the efficiency of colloquial style sentence extraction, perfect whole corpus system.

Description

The method and device that colloquial style sentence is extracted
Technical field
The present embodiments relate to areas of information technology, more particularly to the method and device that a kind of colloquial style sentence is extracted.
Background technology
With the development of science and technology the characteristics of computer storage capacity is big has been used in the storage of language, thus language material Storehouse is developed.
Spoken corpus is also the basic resource that linguistry is carried by carrier of electronic computer, complete spoken language materials Storehouse be used for language model build, lexicography and text classification etc., but the money based on spoken corpus in the prior art Source is few, even if having, be also user one by one word go extract colloquial style sentence builder spoken corpus.
The mode of User Defined spoken corpus wastes time and energy, and with individual factor, lacks authoritative, cause into The missing of the spoken corpus of system is unfavorable for improving whole corpus system.
The content of the invention
The embodiment of the present invention provides the method and device that a kind of colloquial style sentence is extracted, and User Defined can be avoided spoken The mode that wastes time and energy of corpus, to improve the efficiency and reliability of the extraction of colloquial style sentence.
In a first aspect, the embodiments of the invention provide a kind of method that colloquial style sentence is extracted, including:
Word frequency of the statistics film corpus with mixing word in corpus, and to the film corpus and creolized language respectively Word in material storehouse is ranked up according to the word frequency;
The word is calculated in the film corpus and mixing language material according to the word frequency of the word and the information of sequence Diversity factor in storehouse, and spoken corpus is confirmed according to the diversity factor;
Colloquial style sentence in the mixing corpus is extracted based on the spoken corpus.
Second aspect, the embodiment of the present invention additionally provides the device that a kind of colloquial style sentence is extracted, including:
Word frequency statisticses module, for the word frequency for counting film corpus respectively with mixing word in corpus, and to described Film corpus is ranked up with mixing the word in corpus according to the word frequency;
Spoken corpus confirms module, and the word is calculated in institute for the word frequency according to the word and the information of sequence The diversity factor in film corpus and mixing corpus is stated, and spoken corpus is confirmed according to the diversity factor;
Colloquial style sentence extraction module, for extracting the colloquial style in the mixing corpus based on the spoken corpus Sentence.
The embodiments of the invention provide the method and device that a kind of colloquial style sentence is extracted, by counting film language material respectively Storehouse reuses spoken corpus and extracts mixing with mixing the word frequency of word and the validation of information spoken corpus of sequence in corpus Colloquial style sentence in corpus, solves the problem of User Defined spoken corpus is wasted time and energy in the prior art, effectively Improve the efficiency that colloquial style sentence is extracted, perfect whole corpus system.
Brief description of the drawings
Figure 1A is the flow chart that a kind of colloquial style sentence in the embodiment of the present invention one is extracted;
Figure 1B is a kind of schematic diagram of colloquial style sentence extraction process in the embodiment of the present invention one;
Fig. 2A is the flow chart that a kind of colloquial style sentence in the embodiment of the present invention two is extracted;
Fig. 2 B are the flow charts that a kind of colloquial style sentence in the embodiment of the present invention two is extracted;
Fig. 3 is a kind of structure chart of colloquial style sentence extraction element in the embodiment of the present invention three;
Fig. 4 is a kind of structure chart of colloquial style sentence extraction element in the embodiment of the present invention four.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Figure 1A is a kind of flow chart for colloquial style sentence extracting method that the embodiment of the present invention one is provided, and the present embodiment can be fitted Situation about being extracted for various colloquial style sentences, this method can be by colloquial style sentence extraction element provided in an embodiment of the present invention To perform, the device can be realized by the way of software and/or hardware, and the device can be integrated in any offer colloquial style sentence and carry Take in the equipment of function, for example, can be computer, as shown in Figure 1A, specifically include:
S110, respectively word frequency of the statistics film corpus with mixing word in corpus, and to the film corpus and Word in mixing corpus is ranked up according to the word frequency.
Specifically, film corpus is obtained with mixing corpus from internet.Wherein, due in film corpus source Can be specifically subtitle file from the dialogue in film, wherein what is recorded is all interpersonal dialogue, it can be considered that Most of film corpus be spoken language materials, and due in film corpus not only have every-day language content, also the time with it is right Words person's name, it is therefore necessary to first handle film corpus, only retains every-day language content;It is one to mix corpus Existing written word also has the corpus of the mixing of pet phrase.Word frequency refers to what some given word occurred in this document Number of times, counts word frequency of the film corpus with mixing word in corpus respectively.
First, the film corpus of download is stored in respective document respectively with mixing corpus, the document can be with For the document of word forms, or the document of txt forms.Then, using participle instrument and dictionary respectively to film language material Storehouse carries out participle with mixing the sentence in language material database documents, i.e., mark off to save as txt by all words included in sentence The document of form, as shown in table one and table two, table one is the part language material in the film corpus after participle, and table two is after participle Mixing corpus in part language material.
Table one
Before six months
It is daylong in court's consumption to feel
This is to catch outman
One of with the benefit of genetic variation people
The chance that success is prosecuted is few
Car door has been opened
What
Car door is not locked
That's queer indeed
I determines that I locks
It must be clever strange happening part
Table two
Finally, statistics film corpus marks off the word frequency of the word come with mixing respectively in corpus, and according to word Frequency carries out sequence from high to low, saves as the document of Excel forms, as shown in Table 3, and table three is what colloquial style sentence was extracted Part word frequency sequencing information table.As shown in Table 3, the word frequency of word is higher, illustrates that the number of times that the word occurs in a document is got over It is many.For example, statistics word " " word frequency be document in number of words it is most, then word " " word frequency be ordered as first.
Table three
S120, the word frequency according to the word and sequence information calculate the word in the film corpus and mixing Diversity factor in corpus, and spoken corpus is confirmed according to the diversity factor.
Specifically, obtaining the multiple alternative words of the sequence of word frequency in film corpus and mixing corpus within a preset range Language.Wherein, the dynamic value that preset range can set for user, such as sequence meet in preceding 20%, 30% and 40%, selection Multiple alternative words in preset range, calculate diversity factor of the alternative words in film corpus and mixing corpus.More Meet sequence word within a preset range in corpus with mixing specifically, film corpus can be extracted respectively and take intersection, Alternately word;It will can also respectively extract film corpus and meet the word of sequence within a preset range in corpus with mixing Language takes common factor, alternately word.
Wherein, diversity factor calculation formula is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
Wherein, D is diversity factor;
SmFor current term sequence number in mixing corpus;
Sm maxFor the maximum sequence number of word in mixing corpus;
SfFor current term sequence number in film corpus;
Sf maxFor word sequence number maximum sequence number in film corpus;
PfFor current term word frequency percentage in film corpus;
PmFor current term word frequency percentage in mixing corpus;
Wherein, after current term serial number sorts according to word frequency in mixing corpus, alternative words are in mixing corpus Current term sequence number;Mix the maximum sequence number of word in corpus, it may also be said to for total sequence number after being sorted in mixing corpus Number.Similarly, after current term serial number sorts according to word frequency in film corpus, the alternative words working as in film corpus Preceding word sequence number;Word maximum sequence number in film corpus, it may also be said to for total sequence number number after being sorted in film corpus.By Occur that word frequency difference is very big in the word having, but the situation of the poor very little of sequence number, it is therefore necessary to added in diversity factor formula Alternative words current term word frequency percentage and current term word frequency percentage in corpus is mixed in film corpus Difference, to improve the accuracy of formula, and it is bigger to calculate the diversity factor of the alternative words of result, illustrates that the alternative words are The probability of spoken language materials is bigger.Wherein, current term word frequency percentage in film corpus is the alternative word in film corpus The number of times of middle appearance accounts for the ratio of total word number in film corpus;Current term word frequency percentage in corpus is mixed, it is standby for this The number of times for selecting word to occur in mixing corpus accounts for the ratio of total word number in mixing corpus.
Finally, the word that diversity factor is met to predetermined threshold value is used as the spoken corpus.Wherein, predetermined threshold value can be The dynamic value of user's setting, such as 20%, 30% and 40%.If predetermined threshold value is set as 20%, then by above-mentioned formula meter The diversity factor calculated, according to sorting from high to low, the word for extracting 20% is used as spoken corpus.
For example, as shown in Table 3, it is assumed that film corpus is 100 with mixing word maximum sequence number in corpus.Extract Go out word " I " respectively in film corpus and the information for mixing corpus, as shown in Table 4.
Table four
Film corpus Mix corpus
Current term sequence number 2 4
Current term word frequency percentage 4.561598 1.028217
Bring the data of table four into above-mentioned formula, calculate the diversity factor of word " I ", be specially:
Diversity factor=(4/100-2/100)+(4.561598-1.028217)=3.553381
Therefore, the diversity factor for obtaining word " I " is 3.553381.Similarly, each alternative words are calculated in the same way Diversity factor, then diversity factor is met to the word of predetermined threshold value as the spoken corpus, as shown in Table 5, table five is spoken language Change the spoken corpus that statement part is extracted.
Table five
Spoken corpus
Laugh a great ho-ho
Valency is excellent
It is not a problem
Later
Appearance
Ask many questions
OK
Welcome
Whether
In addition, setting up term vector training pattern, the word that diversity factor is met into predetermined threshold value inputs term vector training pattern Be expanded word, expands spoken corpus, as shown in Figure 1B.Wherein, term vector training pattern is real by word2vect softwares Existing, in training, parameter setting is as follows:./word2vec-train result_cropus.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1e-3-threads 4- Binary 1-min_count 3, the concrete meaning of parameter is as follows:
Train is the file of training;Cbow is to use gap bag of words;Size is the dimension that term vector is used; Window is contextual window length;Negative is, whether using the negative method of sampling, 0 expression is without using 1 represents to use;hs Whether to use HS methods, 0 represents without using 1 represents to use;Sample le-3 is represent the threshold value of sampling, if a word The frequency occurred in training sample is bigger, then can more be sampled;Thread is unlatching Thread Count;Binary is for output No is binary file, and 0 represents without using 1 represents to use;Min_count is defaulted as 5 to represent to set low-limit frequency, if The number of times that one word occurs in a document is less than the threshold value, then the word will be rejected.
Then, term vector training pattern meets predetermined threshold value by ./distance vectors.bin orders to diversity factor Word produce extension word, each of which word takes preceding 10 extension words and diversity factor is met to the word of predetermined threshold value Collectively as spoken corpus.
S130, based on the spoken corpus extract it is described mixing corpus in colloquial style sentence.
Specifically, according to the number and current statement of the word occurred in current statement in the spoken corpus The sum of middle word, the spoken rate of current statement in mixing corpus is calculated using equation below:
K=n/l
Wherein, k is spoken rate, and n is the number for occurring the word in the spoken corpus in current statement, l For the sum of word in current statement.
The current statement that spoken rate is met to predetermined threshold value is extracted as colloquial style sentence.Wherein, predetermined threshold value The dynamic value that can be set for user, or the fixed value of system default, such as 0.5.If by spoken rate meet it is pre- If threshold value is set as the fixed value of system default, then the current statement that spoken rate is more than 0.5 is entered as colloquial style sentence Row is extracted.Also, the calculating of spoken rate is carried out to each sentence included in mixing corpus based on spoken corpus, by mouth The sentence that language rate meets predetermined threshold value is extracted, as colloquial style sentence, as shown in Table 6.
Table six
Colloquial style sentence
Which belongs to reduction food
The very first time focuses on Zhejiang grave news event
Serviced wholeheartedly for you
It can all update daily
There is no account number
The women that side loses marriage gives love for change
Sound thinking that oneself is very sad suddenly at me
Do you know for earthquakes in Sichuan
Find old school fellow and get to know new friend
Why others earned than you it is many
I am distracted just to react for several seconds
It is exemplary, the colloquial style sentence in the mixing corpus is extracted based on the spoken corpus, and by spoken language The current statement that rate meets 0.5 is extracted as colloquial style sentence.When judging to mix the sentence in corpus, " which belongs to When whether reduction food " is colloquial style sentence, participle operation first is carried out to the sentence, as a result for " which belongs to reduction food ".It is the word in spoken corpus due to " which ", " belonging to ", " food " and " ", therefore n value is 4, and due to The sum of word is 5 in current statement, therefore l value is 5.Calculating spoken rate based on spoken rate formula is:
K=4/5=0.8
Because the spoken rate calculated according to spoken rate formula is 0.8, more than predetermined threshold value 0.5, therefore, it will mix Sentence " which belongs to reduction food " in corpus is extracted, and is used as colloquial style sentence.
The present embodiment is true with mixing the word frequency of word and the information sorted in corpus by counting film corpus respectively Recognize spoken corpus, reuse spoken corpus and extract the colloquial style sentence mixed in corpus, solve and use in the prior art The problem of self-defined spoken corpus in family is wasted time and energy, effectively improves the efficiency of colloquial style sentence extraction, can extract more Comprehensive spoken corpus, perfect whole corpus system.
Embodiment two
Fig. 2A is a kind of flow chart for colloquial style sentence extracting method that the embodiment of the present invention two is provided, and the present embodiment is upper State the word optimized on the basis of embodiment there is provided the film corpus of statistics respectively of optimization with mixing word in corpus Frequently, and to the film corpus with mixing the processing method that the word in corpus is ranked up according to the word frequency, specifically It is:The film corpus is divided with the sentence mixed in corpus respectively according to reference to dictionary and jieba participles component Word is operated, and obtains the film corpus with mixing the word in corpus;The film corpus and creolized language are counted respectively Expect the word frequency of the word in storehouse;Respectively to the film corpus with mixing word frequency of the word in corpus according to the word It is ranked up from high to low.
Accordingly, the method for the present embodiment includes:
S210, according to reference to dictionary and jieba participles component respectively to the film corpus with mixing in corpus Sentence carries out participle operation, obtains the film corpus with mixing the word in corpus.
Wherein, dictionary is user-defined dictionary, generally dictionary;Jieba participles component is a participle instrument.Tool Body, user can use pycharm platforms to write program, and participle is carried out with mixing the sentence in corpus to film corpus Operation.
The loading of dictionary is carried out by input code jieba.load_userdict (file_name), wherein, file_ Name is the path of Custom Dictionaries.Pass through input code:
File_object=open (read_path)
try:
All_the_text=file_object.read ()
finally:
file_object.close()
The file in read_path paths is read out, all_the_text objects is formed, is then entered using lower array function Row accurate model participle:
Cut_txt=jieba.cut (all_the_text, cut_all=false)
Wherein, all_the_text, which is represented, needs the whole text of participle, and cut_txt represents the whole text after cutting, Cut_all=false represents accurate participle pattern, and accurate participle pattern is to represent according to dictionary and particular algorithm relatively accurately The whole text of cutting, but the cutting of syntype whether is carried out, the cutting of syntype is i.e. by all possible slit mode Show, for example:
The slit mode of syntype is such as:I/come/Beijing/Tsing-Hua University/Tsing-Hua University/Hua Da/university
The slit mode of accurate participle pattern is such as:I/come/Beijing/Tsing-Hua University
Finally, phase is saved as with the sentence mixed in corpus by inputting following code by the film corpus after participle The archiving files in path are answered, i.e., the text cut_txt after cutting are saved in save_path.
File_object=open (save_path, ' w')
file_object.write(cut_txt)
file_object.close()
S220, word frequency of the film corpus with mixing the word in corpus is counted respectively.
S230, respectively to the film corpus with mix the word in corpus according to the word frequency of the word by height to It is low to be ranked up.
S240, the word frequency according to the word and sequence information calculate the word in the film corpus and mixing Diversity factor in corpus, and spoken corpus is confirmed according to the diversity factor.
S250, based on the spoken corpus extract it is described mixing corpus in colloquial style sentence.
In order to calculate diversity factor of the same word respectively in film corpus and mixing corpus, it is necessary to calculate film Corpus and the word frequency for mixing each word in corpus, and by all words after calculating according to word frequency from high to low Order is ranked up.By determining spoken corpus to diversity factor of the word respectively in film corpus and mixing corpus, The colloquial style sentence in mixing corpus finally is extracted using spoken corpus, detailed process is as shown in Figure 2 B.From Fig. 2 B, Caption language material is the word in film corpus;Mixing language material is word in mixing corpus;Bag of words are spoken corpus.
The present embodiment is by combining with reference to dictionary and jieba participles component respectively to film corpus with mixing in corpus Sentence carry out participle operation, the word for obtaining film corpus with mixing in corpus determines spoken corpus, due to jieba Participle component it is intelligent with it is easy to use, the database data of over ten billion can be handled so that the extraction of colloquial style sentence is more Quickly with it is convenient, improve the extraction efficiency of colloquial style sentence.
Embodiment three
Fig. 3 is a kind of structural representation for colloquial style sentence extraction element that the embodiment of the present invention three is provided, the present embodiment The situation that various colloquial style sentences are extracted is applicable to, this method can be extracted by colloquial style sentence provided in an embodiment of the present invention Device is performed, and the device can be realized by the way of software and/or hardware, and the device can be integrated in any offer colloquial style language In the equipment of sentence abstraction function, for example, can be computer, as shown in figure 3, specifically including:Word frequency statisticses module 31, spoken language materials Storehouse confirms module 32 and colloquial style sentence extraction module 33.
Word frequency statisticses module 31, for the word frequency for counting film corpus respectively with mixing word in corpus, and to institute State film corpus and be ranked up with mixing the word in corpus according to the word frequency;
Spoken corpus confirms module 32, and calculating the word for the word frequency according to the word and the information of sequence exists Diversity factor in the film corpus and mixing corpus, and spoken corpus is confirmed according to the diversity factor;
Colloquial style sentence extraction module 33, for extracting the spoken language in the mixing corpus based on the spoken corpus Change sentence.
The colloquial style sentence that colloquial style sentence extraction element described in the present embodiment is used to perform described in the various embodiments described above is carried Method is taken, its technical principle is similar with the technique effect produced, repeats no more here.
Example IV
Fig. 4 show a kind of structural representation of colloquial style sentence extraction element of the offer of the embodiment of the present invention four.Such as Fig. 4 It is shown:
On the basis of above-described embodiment, the word frequency statisticses module specifically for:According to reference to dictionary and jieba participles Component respectively to the film corpus with mix the sentence in corpus carry out participle operation, obtain the film corpus with Mix the word in corpus;Word frequency of the film corpus with mixing the word in corpus is counted respectively;Respectively to institute State film corpus and be ranked up from high to low according to the word frequency of the word with mixing the word in corpus.
On the basis of above-described embodiment, the spoken corpus confirm module specifically for:
Obtain the multiple alternative words of the sequence of word frequency in the film corpus and mixing corpus within a preset range;
According to current term sequence number, the maximum sequence number of word and current term word frequency percentage, the alternative words are calculated Diversity factor in the film corpus and mixing corpus, the calculation formula of wherein diversity factor is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
Wherein, D is diversity factor;
SmFor current term sequence number in mixing corpus;
Sm maxFor the maximum sequence number of word in mixing corpus;
SfFor current term sequence number in film corpus;
Sf maxFor word sequence number maximum sequence number in film corpus;
PfFor current term word frequency percentage in film corpus;
PmFor current term word frequency percentage in mixing corpus;
The word that the diversity factor is met to predetermined threshold value is used as the spoken corpus.
On the basis of above-described embodiment, the colloquial style sentence extraction module is specifically included:Spoken rate computing unit 41 and colloquial style sentence extraction unit 42.
Spoken rate computing unit 41, for according to the word occurred in current statement in the spoken corpus The sum of word in number and current statement, calculates the spoken rate of current statement in the mixing corpus, the spoken language Rate formula is calculated as follows:
K=n/l
Wherein, k is spoken rate, and n is the number for occurring the word in the spoken corpus in current statement, l For the sum of word in current statement;
Colloquial style sentence extraction unit 42, for the spoken rate is met the current statement of predetermined threshold value as The colloquial style sentence is extracted.
On the basis of above-described embodiment, the colloquial style sentence extraction unit specifically for:The spoken rate is big The current statement in 0.5 is extracted as the colloquial style sentence.
On the basis of above-described embodiment, described device also includes:Spoken corpus enlargement module 43.
Spoken corpus enlargement module 43, for setting up term vector training pattern, described in the spoken corpus Word inputs the term vector training pattern and is expanded word;And the extension word for meeting predetermined threshold value is added to institute State spoken corpus.
The device that colloquial style sentence described in the present embodiment is extracted is used to perform the colloquial style sentence described in the various embodiments described above The method of extraction, its technical principle is similar with the technique effect produced, repeats no more here.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art it is various it is obvious change, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

1. a kind of method that colloquial style sentence is extracted, it is characterised in that including:
Word frequency of the statistics film corpus with mixing word in corpus, and to the film corpus with mixing corpus respectively In word be ranked up according to the word frequency;
The word is calculated in the film corpus and mixing corpus according to the word frequency of the word and the information of sequence Diversity factor, and according to the diversity factor confirm spoken corpus;
Colloquial style sentence in the mixing corpus is extracted based on the spoken corpus.
2. according to the method described in claim 1, it is characterised in that the film corpus of statistics respectively is with mixing in corpus The word frequency of word, and the word in the film corpus and mixing corpus is ranked up according to the word frequency, including:
The film corpus is divided with the sentence mixed in corpus respectively according to reference to dictionary and jieba participles component Word is operated, and obtains the film corpus with mixing the word in corpus;
Word frequency of the film corpus with mixing the word in corpus is counted respectively;
The film corpus is arranged from high to low with mixing the word in corpus according to the word frequency of the word respectively Sequence.
3. method according to claim 1 or 2, it is characterised in that the word frequency according to the word and the letter of sequence Breath calculates diversity factor of the word in the film corpus and mixing corpus, and spoken according to diversity factor confirmation Corpus, including:
Obtain the multiple alternative words of the sequence of word frequency in the film corpus and mixing corpus within a preset range;
According to current term sequence number, the maximum sequence number of word and current term word frequency percentage, the alternative words are calculated in institute The diversity factor in film corpus and mixing corpus is stated, the calculation formula of wherein diversity factor is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
Wherein, D is diversity factor;
SmFor current term sequence number in mixing corpus;
SmmaxFor the maximum sequence number of word in mixing corpus;
SfFor current term sequence number in film corpus;
SfmaxFor word sequence number maximum sequence number in film corpus;
PfFor current term word frequency percentage in film corpus;
PmFor current term word frequency percentage in mixing corpus;
The word that the diversity factor is met to predetermined threshold value is used as the spoken corpus.
4. according to the method described in claim 1, it is characterised in that described that the creolized language is extracted based on the spoken corpus Expect the colloquial style sentence in storehouse, including:
According in the number and current statement of the word occurred in current statement in the spoken corpus word it is total Number, calculates the spoken rate of current statement in the mixing corpus, and the spoken rate formula is calculated as follows:
K=n/l
Wherein, k is spoken rate, and n is the number of the word in the spoken corpus occurs in current statement, l is to work as The sum of word in preceding sentence;
The current statement that the spoken rate is met to predetermined threshold value is extracted as the colloquial style sentence.
5. method according to claim 4, it is characterised in that the spoken rate is met into the described current of predetermined threshold value Sentence carries out extraction as the colloquial style sentence to be included:
The current statement that the spoken rate is more than 0.5 is extracted as the colloquial style sentence.
6. according to the method described in claim 1, it is characterised in that described that the creolized language is extracted based on the spoken corpus Before colloquial style sentence in material storehouse, in addition to:
Term vector training pattern is set up, the word in the spoken corpus is inputted into the term vector training pattern obtains Extend word;And the extension word for meeting predetermined threshold value is added to the spoken corpus.
7. the device that a kind of colloquial style sentence is extracted, it is characterised in that including:
Word frequency statisticses module, for the word frequency for counting film corpus respectively with mixing word in corpus, and to the film Corpus is ranked up with mixing the word in corpus according to the word frequency;
Spoken corpus confirms module, and the word is calculated in the electricity for the word frequency according to the word and the information of sequence Diversity factor in shadow corpus and mixing corpus, and spoken corpus is confirmed according to the diversity factor;
Colloquial style sentence extraction module, for extracting the colloquial style language in the mixing corpus based on the spoken corpus Sentence.
8. device according to claim 7, it is characterised in that the word frequency statisticses module specifically for:
The film corpus is divided with the sentence mixed in corpus respectively according to reference to dictionary and jieba participles component Word is operated, and obtains the film corpus with mixing the word in corpus;
Word frequency of the film corpus with mixing the word in corpus is counted respectively;
The film corpus is arranged from high to low with mixing the word in corpus according to the word frequency of the word respectively Sequence.
9. the device according to claim 7 or 8, it is characterised in that the spoken corpus confirm module specifically for:
Obtain the multiple alternative words of the sequence of word frequency in the film corpus and mixing corpus within a preset range;
According to current term sequence number, the maximum sequence number of word and current term word frequency percentage, the alternative words are calculated in institute The diversity factor in film corpus and mixing corpus is stated, the calculation formula of wherein diversity factor is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
Wherein, D is diversity factor;
SmFor current term sequence number in mixing corpus;
SmmaxFor the maximum sequence number of word in mixing corpus;
SfFor current term sequence number in film corpus;
SfmaxFor word sequence number maximum sequence number in film corpus;
PfFor current term word frequency percentage in film corpus;
PmFor current term word frequency percentage in mixing corpus;
The word that the diversity factor is met to predetermined threshold value is used as the spoken corpus.
10. device according to claim 7, it is characterised in that the colloquial style sentence extraction module is specifically included:
Spoken rate computing unit, for the number according to the word occurred in current statement in the spoken corpus with And in current statement word sum, calculate the spoken rate of current statement in the mixing corpus, the spoken rate is public Formula is calculated as follows:
K=n/l
Wherein, k is spoken rate, and n is the number of the word in the spoken corpus occurs in current statement, l is to work as The sum of word in preceding sentence;
Colloquial style sentence extraction unit, the current statement for the spoken rate to be met to predetermined threshold value is used as the mouth Language sentence is extracted.
CN201710225009.6A 2017-04-07 2017-04-07 Method and device for extracting spoken sentences Active CN106991181B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710225009.6A CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710225009.6A CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Publications (2)

Publication Number Publication Date
CN106991181A true CN106991181A (en) 2017-07-28
CN106991181B CN106991181B (en) 2020-04-21

Family

ID=59415480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710225009.6A Active CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Country Status (1)

Country Link
CN (1) CN106991181B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114298A (en) * 2007-08-31 2008-01-30 北京搜狗科技发展有限公司 Method for gaining oral vocabulary entry, device and input method system thereof
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101464856A (en) * 2007-12-20 2009-06-24 株式会社东芝 Alignment method and apparatus for parallel spoken language materials
US20120008821A1 (en) * 2010-05-10 2012-01-12 Videosurf, Inc Video visual and audio query
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN105741831A (en) * 2016-01-27 2016-07-06 广东外语外贸大学 Spoken language evaluation method based on grammatical analysis and spoken language evaluation system
CN106164889A (en) * 2013-12-02 2016-11-23 丘贝斯有限责任公司 System and method for internal storage data library searching
US20170061017A1 (en) * 2015-09-01 2017-03-02 Google Inc. Providing native application search results with web search results
CN106528726A (en) * 2016-11-02 2017-03-22 四川用联信息技术有限公司 Keyword optimization-based search engine optimization realization technology

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114298A (en) * 2007-08-31 2008-01-30 北京搜狗科技发展有限公司 Method for gaining oral vocabulary entry, device and input method system thereof
CN101464856A (en) * 2007-12-20 2009-06-24 株式会社东芝 Alignment method and apparatus for parallel spoken language materials
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
US20120008821A1 (en) * 2010-05-10 2012-01-12 Videosurf, Inc Video visual and audio query
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN106164889A (en) * 2013-12-02 2016-11-23 丘贝斯有限责任公司 System and method for internal storage data library searching
US20170061017A1 (en) * 2015-09-01 2017-03-02 Google Inc. Providing native application search results with web search results
CN105741831A (en) * 2016-01-27 2016-07-06 广东外语外贸大学 Spoken language evaluation method based on grammatical analysis and spoken language evaluation system
CN106528726A (en) * 2016-11-02 2017-03-22 四川用联信息技术有限公司 Keyword optimization-based search engine optimization realization technology

Also Published As

Publication number Publication date
CN106991181B (en) 2020-04-21

Similar Documents

Publication Publication Date Title
CN104268197B (en) A kind of industry comment data fine granularity sentiment analysis method
CN106844346A (en) Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN108509638B (en) Question extraction method and electronic equipment
KR20190063978A (en) Automatic classification method of unstructured data
CN104778209A (en) Opinion mining method for ten-million-scale news comments
CN101261623A (en) Word splitting method and device for word border-free mark language based on search
CN101540017B (en) Feature extracting method based on byte level n-gram and twit filter
CN101782898A (en) Method for analyzing tendentiousness of affective words
CN102929861A (en) Method and system for calculating text emotion index
CN105512347A (en) Information processing method based on geographic topic model
Escudero et al. An empirical study of the domain dependence of supervised word disambiguation systems
CN104899230A (en) Public opinion hotspot automatic monitoring system
CN104881399B (en) Event recognition method and system based on probability soft logic PSL
CN103092966A (en) Vocabulary mining method and device
CN111814476B (en) Entity relation extraction method and device
CN105609116A (en) Speech emotional dimensions region automatic recognition method
Vo et al. Topic classification and sentiment analysis for Vietnamese education survey system
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN107391565A (en) A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
Garlapati et al. Classification of Toxicity in Comments using NLP and LSTM
Indhuja et al. Text based language identification system for indian languages following devanagiri script
Munarko et al. Named entity recognition model for Indonesian tweet using CRF classifier
CN104572628B (en) A kind of science based on syntactic feature defines automatic extraction system and method
Iria T-rex: A flexible relation extraction framework
CN106991181A (en) The method and device that colloquial style sentence is extracted

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant