CN106991181B

CN106991181B - Method and device for extracting spoken sentences

Info

Publication number: CN106991181B
Application number: CN201710225009.6A
Authority: CN
Inventors: 李贤�
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2020-04-21
Anticipated expiration: 2037-04-07
Also published as: CN106991181A

Abstract

The embodiment of the invention discloses a method and a device for extracting spoken sentences, wherein the method comprises the following steps: respectively counting word frequencies of words in a film corpus and a mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies; calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming a spoken language corpus according to the difference degree; and extracting the spoken sentences in the mixed corpus based on the spoken corpus. According to the embodiment of the invention, the spoken language corpus is confirmed by respectively counting the word frequency and the sequencing information of the words in the film corpus and the mixed corpus, and the spoken language corpus is used for extracting the spoken sentences in the mixed corpus, so that the problem that the user self-defines the spoken language corpus in the prior art is time-consuming and labor-consuming is solved, the extraction efficiency of the spoken sentences is effectively improved, and the whole corpus system is perfected.

Description

Method and device for extracting spoken sentences

Technical Field

The embodiment of the invention relates to the technical field of information, in particular to a method and a device for extracting spoken sentences.

Background

With the advancement of technology, the feature of large computer storage capacity is applied to language storage, and thus corpora are developed.

The spoken language database is also a basic resource for bearing language knowledge by taking an electronic computer as a carrier, and the complete spoken language database is used for language model construction, dictionary compilation, text classification and the like.

The user defined spoken language database is time consuming and labor consuming, has personal factors and lacks authority, and the missing of the spoken language database of the system is not favorable for perfecting the whole language database system.

Disclosure of Invention

The embodiment of the invention provides a method and a device for extracting spoken sentences, which can avoid the time-consuming and labor-consuming mode of customizing a spoken language corpus by a user, and improve the efficiency and the reliability of the extraction of the spoken sentences.

In a first aspect, an embodiment of the present invention provides a method for extracting spoken statements, including:

respectively counting word frequencies of words in a film corpus and a mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;

calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming a spoken language corpus according to the difference degree;

and extracting the spoken sentences in the mixed corpus based on the spoken corpus.

In a second aspect, an embodiment of the present invention further provides a device for extracting spoken statements, where the device includes:

the word frequency counting module is used for respectively counting the word frequencies of the words in the film corpus and the mixed corpus and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;

the spoken language corpus confirming module is used for calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words and confirming the spoken language corpus according to the difference degree;

and the spoken sentence extraction module is used for extracting the spoken sentences in the mixed corpus based on the spoken corpus.

The embodiment of the invention provides a method and a device for extracting spoken sentences.

Drawings

FIG. 1A is a flowchart of spoken sentence extraction according to a first embodiment of the present invention;

FIG. 1B is a diagram illustrating a spoken sentence extraction process according to a first embodiment of the present invention;

FIG. 2A is a flowchart of spoken sentence extraction according to a second embodiment of the present invention;

FIG. 2B is a flowchart of spoken sentence extraction according to a second embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for extracting spoken sentences according to a third embodiment of the present invention;

fig. 4 is a block diagram of a spoken sentence extraction apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1A is a flowchart of a spoken sentence extraction method according to an embodiment of the present invention, where this embodiment is applicable to various situations of spoken sentence extraction, and the method may be executed by a spoken sentence extraction apparatus according to an embodiment of the present invention, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated into any device providing a spoken sentence extraction function, for example, a computer, as shown in fig. 1A, and specifically includes:

s110, respectively counting word frequencies of words in the film corpus and the mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies.

Specifically, the film corpus and the mixed corpus are both obtained from the internet. The method comprises the steps that a film corpus is sourced from dialogue in a film, specifically, the film corpus can be a subtitle file, and dialogue between people is recorded, so that the film corpus can be considered as a spoken language material mostly, and the film corpus not only has daily dialogue content, but also has time and a name of a dialogue person, so that the film corpus is required to be processed first, and only the daily dialogue content is reserved; a mixed corpus is a corpus of a mixture of both written and spoken languages. The word frequency refers to the number of times a given word appears in the file, and the word frequencies of the words in the film corpus and the mixed corpus are respectively counted.

Firstly, the downloaded movie corpus and the mixed corpus are stored in respective documents, which may be word-formatted documents or txt-formatted documents. Then, a word segmentation tool and a word bank are adopted to segment the words of the sentences in the film corpus and the mixed corpus documents respectively, namely all the words contained in the sentences are segmented and stored as the documents in the txt format, as shown in a table I and a table II, the table I is a part of the linguistic data in the film corpus after word segmentation, and the table II is a part of the linguistic data in the mixed corpus after word segmentation.

Watch 1

Six months ago
	Feeling of consuming a whole day in court
It is a man catching extraterrestrial objects
	And one of the benefits of genetically variant humans
There is little chance and little chance of successful prosecution
	Vehicle door is opened
What is
	Vehicle door without lock
Strange thing
	I determine that I lock
Is definitely a flexible event

Watch two

Finally, respectively counting the word frequencies of the words divided from the film corpus and the mixed corpus, sequencing from high to low according to the word frequencies, and storing the words as a document in an Excel format, wherein the table three is a partial word frequency sequencing information table extracted from the spoken sentences, as shown in table three. As can be seen from Table three, the higher the word frequency of a word, the more times the word appears in the document. For example, the word frequency of the statistical word "of" is the highest number of words in the document, and then the word frequency ordering of the word "of" is the first.

Watch III

And S120, calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming the spoken language corpus according to the difference degree.

Specifically, a plurality of alternative words with the word frequency ordered within a preset range in the film corpus and the mixed corpus are obtained. The preset range can be a dynamic value set by a user, such as the dynamic values ranked at the top 20%, 30%, 40%, and the like, a plurality of candidate terms conforming to the preset range are selected, and the difference degree of the candidate terms in the film corpus and the mixed corpus is calculated. More specifically, word collection sets which accord with the ordering within a preset range in the film corpus and the mixed corpus can be respectively extracted as alternative words; alternatively, the intersection of the words in the extracted film corpus and the words in the mixed corpus which are in accordance with the ordering in the preset range may be used as the alternative words.

Wherein, the calculation formula of the difference degree is as follows:

D＝S_m/S_{m max}-S_f/S_{f max}+(P_f-P_m)

wherein D is the degree of difference;

S_mthe word sequence number is the current word sequence number in the mixed corpus;

S_{m max}the maximum serial number of the words in the mixed corpus is obtained;

S_fthe sequence number of the current word in the film corpus is obtained;

S_{f max}the maximum sequence number of the word sequence number in the film corpus;

P_fis the word frequency percentage of the current word in the film corpus;

P_mis the word frequency percentage of the current word in the mixed corpus;

the current word serial number in the mixed corpus is the current word serial number of the alternative word in the mixed corpus after being sorted according to the word frequency; the maximum sequence number of the words in the mixed corpus can also be referred to as the total sequence number after sequencing in the mixed corpus. Similarly, the sequence number of the current word in the film corpus is the sequence number of the current word in the film corpus of the alternative word after being sorted according to the word frequency; the maximum sequence number of the words in the film corpus can also be referred to as the total sequence number after sequencing in the film corpus. Because some words have large word frequency difference but small sequence number difference, it is necessary to add the difference between the word frequency percentage of the current word in the film corpus and the word frequency percentage of the current word in the mixed corpus of the candidate words into the difference degree formula to improve the accuracy of the formula, and the greater the difference degree of the candidate words of the calculation result is, the greater the probability that the candidate words are spoken language materials is. The word frequency percentage of the current word in the film corpus is the proportion of the occurrence frequency of the alternative word in the film corpus to the total number of words in the film corpus; the word frequency percentage of the current word in the mixed corpus is the proportion of the number of times of the candidate word appearing in the mixed corpus to the total number of words in the mixed corpus.

And finally, taking the words with the difference degree meeting a preset threshold value as the spoken language corpus. The preset threshold may be a dynamic value set by a user, such as 20%, 30%, 40%, and the like. If the preset threshold value is set to be 20%, 20% of words are extracted as a spoken language corpus according to the sequence from high to low of the difference degree calculated by the formula.

For example, as shown in table three, it is assumed that the maximum number of words in the movie corpus and the mixed corpus are both 100. The information of the word "i" in the film corpus and the mixed corpus, respectively, is extracted, as shown in table four.

Watch four

	Film corpus	Mixed corpus
			Current word sequence number	2	4
Word frequency percentage of current word	4.561598	1.028217

Substituting the data in the table four into the formula, and calculating the difference degree of the word "I", specifically as follows:

degree of difference (4/100-2/100) + (4.561598-1.028217) ═ 3.553381

Thus, the word "i" was found to differ by 3.553381. Similarly, the difference degree of each alternative word is calculated by the same method, and then the word with the difference degree meeting the preset threshold value is used as the spoken language corpus, as shown in table five, the table five is the spoken language corpus extracted from the spoken language sentence part.

Watch five

Spoken language corpus
	He
High price
	No problem
Later on, the
	Sample
Question of this question
	Ai-hen
That good
	Welcome
Whether or not

In addition, a word vector training model is established, words with the difference degree meeting a preset threshold are input into the word vector training model to obtain extended words, and a spoken language corpus is expanded, as shown in fig. 1B. The word vector training model is realized through word2 fact software, and the parameters are set as follows during training: word2vec-train result _ crop.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1 e-3-reads 4-binary 1-min _ count 3, the specific meaning of the parameters are as follows:

train is a training file; cbow is to adopt a jump-over bag-of-words model; size is the dimension adopted by the word vector; window is the length of the context window; negative is whether a negative sampling method is adopted, 0 represents that the sample is not used, and 1 represents that the sample is used; HS is whether HS method is used, 0 represents not used, 1 represents used; sample-3 is a threshold representing sampling, and if a word occurs more frequently in a training sample, the word is sampled more frequently; thread is the number of open threads; bin is whether the output is a binary file, 0 represents unused, and 1 represents used; min count indicates that the lowest frequency is set, defaults to 5, and a word is discarded if the number of occurrences of the word in the document is less than the threshold.

And then, generating extension words by the word vector training model through a/distance vectors.bin command for the words with the difference degree meeting a preset threshold, wherein each word takes the first 10 extension words and the words with the difference degree meeting the preset threshold are jointly used as a spoken language corpus.

And S130, extracting the spoken sentences in the mixed corpus based on the spoken language corpus.

Specifically, according to the number of the words appearing in the spoken language corpus and the total number of the words in the current sentence, the spoken language conversion rate of the current sentence in the mixed corpus is calculated by adopting the following formula:

k＝n/l

wherein k is the spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence.

And taking the current sentence with the spoken language rate meeting a preset threshold value as the spoken language sentence for extraction. The preset threshold may be a dynamic value set by a user, or may be a fixed value default to the system, such as 0.5. And if the preset threshold value met by the spoken language conversion rate is set as a default fixed value of the system, extracting the current sentence with the spoken language conversion rate being more than 0.5 as the spoken language sentence. And calculating the spoken language conversion rate of each sentence contained in the mixed corpus based on the spoken language corpus, and extracting the sentences of which the spoken language conversion rates meet the preset threshold value, namely the spoken sentences, as shown in table six.

Watch six

Spoken language sentence
	Which belong to the restored food wool
Focusing Zhejiang great news events at first time
	Exhaust honesty for you
Update o every day
	Has no account number
Help women who lose marriage to find back love
	Suddenly feel sad to oneself
Notice of Sichuan earthquake
	Find new friends of old classmates
Why others earn more than you
	I am corrugated in seconds to respond

Illustratively, the spoken sentences in the mixed corpus are extracted based on the spoken corpus, and the current sentences with the spoken language conversion rate of 0.5 are extracted as the spoken sentences. When judging whether the sentence which belongs to the reducing food woolen in the mixed corpus is a spoken sentence or not, the sentence is firstly subjected to word segmentation, and the result is that the sentence which belongs to the reducing food woolen. Since "which", "belong to", "food" and "wool" are words in the spoken language corpus, the value of n is 4, and since the total number of words in the current sentence is 5, the value of l is 5. Calculating the spoken language conversion rate based on the spoken language conversion rate formula as follows:

k＝4/5＝0.8

since the spoken language conversion rate calculated by the spoken language conversion rate formula is 0.8, which is greater than the preset threshold value of 0.5, the sentences in the mixed corpus, which belong to the reduced food tweed, are extracted as spoken language sentences.

This embodiment confirms spoken language corpus through the word frequency of statistics film corpus and words in the mixed corpus respectively and the information of sequencing, recycles spoken language corpus and draws the spoken sentence in the mixed corpus, has solved the problem that user-defined spoken language corpus wastes time and energy among the prior art, effectively improves the efficiency that spoken sentence drawed, can extract comparatively comprehensive spoken language corpus, has perfected whole corpus system.

Example two

Fig. 2A is a flowchart of a spoken sentence extraction method according to a second embodiment of the present invention, which is optimized based on the second embodiment, and this embodiment provides a processing method that respectively counts word frequencies of words in a film corpus and a mixed corpus and sorts the words in the film corpus and the mixed corpus according to the word frequencies, specifically: performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus; respectively counting word frequencies of words in the film corpus and the mixed corpus; and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.

Correspondingly, the method of the embodiment includes:

and S210, performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to the reference word bank and the jieba word segmentation component to obtain words in the film corpus and the mixed corpus.

The word stock is a user-defined word stock and is generally a dictionary; the jieba word segmentation component is a word segmentation tool. Specifically, a user can write a program using a pycharm platform to perform word segmentation on sentences in the movie corpus and the mixed corpus.

Loading a thesaurus by inputting a code jieba, load _ user (file _ name), wherein the file _ name is a path of a user-defined dictionary. By entering the code:

file_object＝open(read_path)

try:

all_the_text＝file_object.read()

finally:

file_object.close()

reading a file of a read _ path to form an all _ the _ text object, and then performing accurate mode word segmentation by using the following functions:

cut_txt＝jieba.cut(all_the_text,cut_all＝false)

the all _ the _ text represents the whole text needing word segmentation, the cut _ txt represents the whole text after segmentation, the cut _ all indicates an accurate word segmentation mode, the accurate word segmentation mode indicates that the whole text file is accurately segmented according to a dictionary and a special algorithm, but the segmentation of a full mode is not performed, and the segmentation of the full mode indicates all possible segmentation modes, for example:

the full-mode segmentation mode is as follows: i/his/her arrival/Beijing/Qinghua university/Huada/university

The segmentation mode of the accurate word segmentation mode is as follows: I/come/Beijing/Qinghua university

And finally, storing the sentences in the segmented film corpus and the mixed corpus into files of corresponding paths by inputting the following codes, namely storing the segmented text cut _ txt into the save _ path.

file_object＝open(save_path,'w')

file_object.write(cut_txt)

file_object.close()

And S220, respectively counting the word frequency of the words in the film corpus and the mixed corpus.

And S230, respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.

S240, calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming the spoken language corpus according to the difference degree.

And S250, extracting the spoken sentences in the mixed corpus based on the spoken language corpus.

In order to calculate the difference degree of the same word in the film corpus and the mixed corpus, the word frequency of each word in the film corpus and the mixed corpus needs to be calculated, and all the calculated words are sorted in the order of the word frequency from high to low. Determining a spoken language corpus by the difference degree of the words in the film corpus and the mixed corpus respectively, and finally extracting spoken sentences in the mixed corpus by using the spoken language corpus, wherein the specific process is shown in fig. 2B. As shown in fig. 2B, the movie subtitle corpus is the words in the movie corpus; mixing the linguistic data, namely words in the mixed linguistic database; the bag of words is the spoken language corpus.

This embodiment carries out the word segmentation operation to the sentence in film corpus and the mixed corpus respectively through combining reference word bank and jieba word segmentation subassembly, obtains the word in film corpus and the mixed corpus and confirms spoken language corpus, because jieba word segmentation subassembly intelligent with use portably, can handle the database data of billions for the extraction of spoken sentence is more quick and convenient, has improved the extraction efficiency of spoken sentence.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a spoken sentence extraction apparatus according to a third embodiment of the present invention, where this embodiment is applicable to various situations of spoken sentence extraction, and the method may be executed by the spoken sentence extraction apparatus according to the third embodiment of the present invention, and the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated into any device providing a spoken sentence extraction function, such as a computer, and specifically includes, as shown in fig. 3: a word frequency statistic module 31, a spoken language corpus confirming module 32 and a spoken sentence extracting module 33.

The word frequency counting module 31 is configured to count word frequencies of words in the film corpus and the mixed corpus respectively, and sort the words in the film corpus and the mixed corpus according to the word frequencies;

a spoken language corpus confirming module 32, configured to calculate a degree of difference between the film corpus and the mixed corpus according to the word frequency and the sequence information of the words, and confirm the spoken language corpus according to the degree of difference;

and a spoken sentence extraction module 33, configured to extract spoken sentences in the mixed corpus based on the spoken corpus.

The spoken sentence extraction apparatus of this embodiment is configured to execute the spoken sentence extraction method of each embodiment, and the technical principle and the generated technical effect are similar, which are not described herein again.

Example four

Fig. 4 is a schematic structural diagram of a spoken sentence extraction apparatus according to a fourth embodiment of the present invention. As shown in fig. 4:

on the basis of the above embodiment, the word frequency statistics module is specifically configured to: performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus; respectively counting word frequencies of words in the film corpus and the mixed corpus; and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.

On the basis of the foregoing embodiment, the spoken language corpus confirming module is specifically configured to:

acquiring a plurality of alternative words with the word frequency ordered in a preset range in the film corpus and the mixed corpus;

calculating the difference degree of the alternative words in the film corpus and the mixed corpus according to the sequence number of the current word, the maximum sequence number of the word and the word frequency percentage of the current word, wherein the calculation formula of the difference degree is as follows:

D＝S_m/S_{m max}-S_f/S_{f max}+(P_f-P_m)

wherein D is the degree of difference;

S_fthe sequence number of the current word in the film corpus is obtained;

P_fis the word frequency percentage of the current word in the film corpus;

P_mis the word frequency percentage of the current word in the mixed corpus;

and taking the words with the difference degree meeting a preset threshold value as the spoken language corpus.

On the basis of the above embodiment, the spoken sentence extraction module specifically includes: a spoken language rate calculation unit 41 and a spoken language sentence extraction unit 42.

A spoken language conversion rate calculating unit 41, configured to calculate a spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus in the current sentence and the total number of the words in the current sentence, where the spoken language conversion rate is calculated by the following formula:

k＝n/l

wherein k is a spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence;

a spoken sentence extracting unit 42, configured to extract the current sentence with the spoken language rate meeting a preset threshold as the spoken sentence.

On the basis of the foregoing embodiment, the spoken sentence extraction unit is specifically configured to: and taking the current statement with the spoken language rate more than 0.5 as the spoken language statement for extraction.

On the basis of the above embodiment, the apparatus further includes: spoken language corpus expansion module 43.

A spoken language corpus expansion module 43, configured to establish a word vector training model, and input the words in the spoken language corpus into the word vector training model to obtain expanded words; and adding the expanded words meeting a preset threshold to the spoken language corpus.

The apparatus for extracting spoken language statements according to this embodiment is configured to execute the method for extracting spoken language statements according to the foregoing embodiments, and the technical principle and the generated technical effect are similar, which are not described herein again.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for spoken sentence extraction, comprising:

extracting spoken sentences in the mixed corpus based on the spoken language corpus;

the calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words and confirming the spoken language corpus according to the difference degree comprises the following steps:

D＝S_m/S_mmax+S_f/S_fmax+(P_f-P_m)

wherein D is the degree of difference;

S_mmaxthe maximum serial number of the words in the mixed corpus is obtained;

S_fthe sequence number of the current word in the film corpus is obtained;

S_fmaxthe maximum sequence number of the word sequence number in the film corpus;

P_fis the word frequency percentage of the current word in the film corpus;

P_mis the word frequency percentage of the current word in the mixed corpus;

2. The method according to claim 1, wherein the separately counting word frequencies of words in the film corpus and the mixed corpus and ordering the words in the film corpus and the mixed corpus according to the word frequencies comprises:

performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus;

respectively counting word frequencies of words in the film corpus and the mixed corpus;

and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.

3. The method of claim 1, wherein the extracting spoken sentences in the mixed corpus based on the spoken corpus comprises:

calculating the spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus and the total number of the words in the current sentence, wherein the spoken language conversion rate formula is calculated as follows:

k＝n/l

and taking the current sentence with the spoken language rate meeting a preset threshold value as the spoken language sentence for extraction.

4. The method of claim 3, wherein extracting the current sentence with the spoken language rate satisfying a preset threshold as the spoken sentence comprises:

and taking the current statement with the spoken language rate more than 0.5 as the spoken language statement for extraction.

5. The method of claim 1, wherein before extracting spoken sentences in the mixed corpus based on the spoken corpus, further comprising:

establishing a word vector training model, and inputting the words in the spoken language corpus into the word vector training model to obtain expanded words; and adding the expanded words meeting a preset threshold to the spoken language corpus.

6. An apparatus for spoken sentence extraction, comprising:

the spoken sentence extraction module is used for extracting spoken sentences in the mixed corpus based on the spoken corpus;

the spoken language corpus confirmation module is specifically configured to:

D＝S_m/S_mmax+S_f/S_fmax+(P_f-P_m)

wherein D is the degree of difference;

S_mmaxthe maximum serial number of the words in the mixed corpus is obtained;

S_fthe sequence number of the current word in the film corpus is obtained;

P_fas a film languageThe word frequency percentage of the current words in the material library;

P_mis the word frequency percentage of the current word in the mixed corpus;

7. The apparatus of claim 6, wherein the word frequency statistics module is specifically configured to:

8. The apparatus according to claim 6, wherein the spoken sentence extraction module specifically comprises:

a spoken language conversion rate calculating unit, configured to calculate a spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus in the current sentence and the total number of the words in the current sentence, where the spoken language conversion rate is calculated according to the following formula:

k＝n/l

and the spoken sentence extraction unit is used for extracting the current sentence of which the spoken language conversion rate meets a preset threshold as the spoken sentence.