CN106991181B - Method and device for extracting spoken sentences - Google Patents

Method and device for extracting spoken sentences Download PDF

Info

Publication number
CN106991181B
CN106991181B CN201710225009.6A CN201710225009A CN106991181B CN 106991181 B CN106991181 B CN 106991181B CN 201710225009 A CN201710225009 A CN 201710225009A CN 106991181 B CN106991181 B CN 106991181B
Authority
CN
China
Prior art keywords
corpus
words
word
film
mixed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710225009.6A
Other languages
Chinese (zh)
Other versions
CN106991181A (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201710225009.6A priority Critical patent/CN106991181B/en
Publication of CN106991181A publication Critical patent/CN106991181A/en
Application granted granted Critical
Publication of CN106991181B publication Critical patent/CN106991181B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention discloses a method and a device for extracting spoken sentences, wherein the method comprises the following steps: respectively counting word frequencies of words in a film corpus and a mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies; calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming a spoken language corpus according to the difference degree; and extracting the spoken sentences in the mixed corpus based on the spoken corpus. According to the embodiment of the invention, the spoken language corpus is confirmed by respectively counting the word frequency and the sequencing information of the words in the film corpus and the mixed corpus, and the spoken language corpus is used for extracting the spoken sentences in the mixed corpus, so that the problem that the user self-defines the spoken language corpus in the prior art is time-consuming and labor-consuming is solved, the extraction efficiency of the spoken sentences is effectively improved, and the whole corpus system is perfected.

Description

Method and device for extracting spoken sentences
Technical Field
The embodiment of the invention relates to the technical field of information, in particular to a method and a device for extracting spoken sentences.
Background
With the advancement of technology, the feature of large computer storage capacity is applied to language storage, and thus corpora are developed.
The spoken language database is also a basic resource for bearing language knowledge by taking an electronic computer as a carrier, and the complete spoken language database is used for language model construction, dictionary compilation, text classification and the like.
The user defined spoken language database is time consuming and labor consuming, has personal factors and lacks authority, and the missing of the spoken language database of the system is not favorable for perfecting the whole language database system.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting spoken sentences, which can avoid the time-consuming and labor-consuming mode of customizing a spoken language corpus by a user, and improve the efficiency and the reliability of the extraction of the spoken sentences.
In a first aspect, an embodiment of the present invention provides a method for extracting spoken statements, including:
respectively counting word frequencies of words in a film corpus and a mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;
calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming a spoken language corpus according to the difference degree;
and extracting the spoken sentences in the mixed corpus based on the spoken corpus.
In a second aspect, an embodiment of the present invention further provides a device for extracting spoken statements, where the device includes:
the word frequency counting module is used for respectively counting the word frequencies of the words in the film corpus and the mixed corpus and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;
the spoken language corpus confirming module is used for calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words and confirming the spoken language corpus according to the difference degree;
and the spoken sentence extraction module is used for extracting the spoken sentences in the mixed corpus based on the spoken corpus.
The embodiment of the invention provides a method and a device for extracting spoken sentences.
Drawings
FIG. 1A is a flowchart of spoken sentence extraction according to a first embodiment of the present invention;
FIG. 1B is a diagram illustrating a spoken sentence extraction process according to a first embodiment of the present invention;
FIG. 2A is a flowchart of spoken sentence extraction according to a second embodiment of the present invention;
FIG. 2B is a flowchart of spoken sentence extraction according to a second embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus for extracting spoken sentences according to a third embodiment of the present invention;
fig. 4 is a block diagram of a spoken sentence extraction apparatus according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1A is a flowchart of a spoken sentence extraction method according to an embodiment of the present invention, where this embodiment is applicable to various situations of spoken sentence extraction, and the method may be executed by a spoken sentence extraction apparatus according to an embodiment of the present invention, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated into any device providing a spoken sentence extraction function, for example, a computer, as shown in fig. 1A, and specifically includes:
s110, respectively counting word frequencies of words in the film corpus and the mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies.
Specifically, the film corpus and the mixed corpus are both obtained from the internet. The method comprises the steps that a film corpus is sourced from dialogue in a film, specifically, the film corpus can be a subtitle file, and dialogue between people is recorded, so that the film corpus can be considered as a spoken language material mostly, and the film corpus not only has daily dialogue content, but also has time and a name of a dialogue person, so that the film corpus is required to be processed first, and only the daily dialogue content is reserved; a mixed corpus is a corpus of a mixture of both written and spoken languages. The word frequency refers to the number of times a given word appears in the file, and the word frequencies of the words in the film corpus and the mixed corpus are respectively counted.
Firstly, the downloaded movie corpus and the mixed corpus are stored in respective documents, which may be word-formatted documents or txt-formatted documents. Then, a word segmentation tool and a word bank are adopted to segment the words of the sentences in the film corpus and the mixed corpus documents respectively, namely all the words contained in the sentences are segmented and stored as the documents in the txt format, as shown in a table I and a table II, the table I is a part of the linguistic data in the film corpus after word segmentation, and the table II is a part of the linguistic data in the mixed corpus after word segmentation.
Watch 1
Six months ago
Feeling of consuming a whole day in court
It is a man catching extraterrestrial objects
And one of the benefits of genetically variant humans
There is little chance and little chance of successful prosecution
Vehicle door is opened
What is
Vehicle door without lock
Strange thing
I determine that I lock
Is definitely a flexible event
Watch two
Figure BDA0001264936300000041
Figure BDA0001264936300000051
Finally, respectively counting the word frequencies of the words divided from the film corpus and the mixed corpus, sequencing from high to low according to the word frequencies, and storing the words as a document in an Excel format, wherein the table three is a partial word frequency sequencing information table extracted from the spoken sentences, as shown in table three. As can be seen from Table three, the higher the word frequency of a word, the more times the word appears in the document. For example, the word frequency of the statistical word "of" is the highest number of words in the document, and then the word frequency ordering of the word "of" is the first.
Watch III
Figure BDA0001264936300000052
Figure BDA0001264936300000061
And S120, calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming the spoken language corpus according to the difference degree.
Specifically, a plurality of alternative words with the word frequency ordered within a preset range in the film corpus and the mixed corpus are obtained. The preset range can be a dynamic value set by a user, such as the dynamic values ranked at the top 20%, 30%, 40%, and the like, a plurality of candidate terms conforming to the preset range are selected, and the difference degree of the candidate terms in the film corpus and the mixed corpus is calculated. More specifically, word collection sets which accord with the ordering within a preset range in the film corpus and the mixed corpus can be respectively extracted as alternative words; alternatively, the intersection of the words in the extracted film corpus and the words in the mixed corpus which are in accordance with the ordering in the preset range may be used as the alternative words.
Wherein, the calculation formula of the difference degree is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
wherein D is the degree of difference;
Smthe word sequence number is the current word sequence number in the mixed corpus;
Sm maxthe maximum serial number of the words in the mixed corpus is obtained;
Sfthe sequence number of the current word in the film corpus is obtained;
Sf maxthe maximum sequence number of the word sequence number in the film corpus;
Pfis the word frequency percentage of the current word in the film corpus;
Pmis the word frequency percentage of the current word in the mixed corpus;
the current word serial number in the mixed corpus is the current word serial number of the alternative word in the mixed corpus after being sorted according to the word frequency; the maximum sequence number of the words in the mixed corpus can also be referred to as the total sequence number after sequencing in the mixed corpus. Similarly, the sequence number of the current word in the film corpus is the sequence number of the current word in the film corpus of the alternative word after being sorted according to the word frequency; the maximum sequence number of the words in the film corpus can also be referred to as the total sequence number after sequencing in the film corpus. Because some words have large word frequency difference but small sequence number difference, it is necessary to add the difference between the word frequency percentage of the current word in the film corpus and the word frequency percentage of the current word in the mixed corpus of the candidate words into the difference degree formula to improve the accuracy of the formula, and the greater the difference degree of the candidate words of the calculation result is, the greater the probability that the candidate words are spoken language materials is. The word frequency percentage of the current word in the film corpus is the proportion of the occurrence frequency of the alternative word in the film corpus to the total number of words in the film corpus; the word frequency percentage of the current word in the mixed corpus is the proportion of the number of times of the candidate word appearing in the mixed corpus to the total number of words in the mixed corpus.
And finally, taking the words with the difference degree meeting a preset threshold value as the spoken language corpus. The preset threshold may be a dynamic value set by a user, such as 20%, 30%, 40%, and the like. If the preset threshold value is set to be 20%, 20% of words are extracted as a spoken language corpus according to the sequence from high to low of the difference degree calculated by the formula.
For example, as shown in table three, it is assumed that the maximum number of words in the movie corpus and the mixed corpus are both 100. The information of the word "i" in the film corpus and the mixed corpus, respectively, is extracted, as shown in table four.
Watch four
Film corpus Mixed corpus
Current word sequence number 2 4
Word frequency percentage of current word 4.561598 1.028217
Substituting the data in the table four into the formula, and calculating the difference degree of the word "I", specifically as follows:
degree of difference (4/100-2/100) + (4.561598-1.028217) ═ 3.553381
Thus, the word "i" was found to differ by 3.553381. Similarly, the difference degree of each alternative word is calculated by the same method, and then the word with the difference degree meeting the preset threshold value is used as the spoken language corpus, as shown in table five, the table five is the spoken language corpus extracted from the spoken language sentence part.
Watch five
Spoken language corpus
He
High price
No problem
Later on, the
Sample
Question of this question
Ai-hen
That good
Welcome
Whether or not
In addition, a word vector training model is established, words with the difference degree meeting a preset threshold are input into the word vector training model to obtain extended words, and a spoken language corpus is expanded, as shown in fig. 1B. The word vector training model is realized through word2 fact software, and the parameters are set as follows during training: word2vec-train result _ crop.txt-output vectors.bin-cbow 0-size 50-window 5-negative 0-hs 1-sample 1 e-3-reads 4-binary 1-min _ count 3, the specific meaning of the parameters are as follows:
train is a training file; cbow is to adopt a jump-over bag-of-words model; size is the dimension adopted by the word vector; window is the length of the context window; negative is whether a negative sampling method is adopted, 0 represents that the sample is not used, and 1 represents that the sample is used; HS is whether HS method is used, 0 represents not used, 1 represents used; sample-3 is a threshold representing sampling, and if a word occurs more frequently in a training sample, the word is sampled more frequently; thread is the number of open threads; bin is whether the output is a binary file, 0 represents unused, and 1 represents used; min count indicates that the lowest frequency is set, defaults to 5, and a word is discarded if the number of occurrences of the word in the document is less than the threshold.
And then, generating extension words by the word vector training model through a/distance vectors.bin command for the words with the difference degree meeting a preset threshold, wherein each word takes the first 10 extension words and the words with the difference degree meeting the preset threshold are jointly used as a spoken language corpus.
And S130, extracting the spoken sentences in the mixed corpus based on the spoken language corpus.
Specifically, according to the number of the words appearing in the spoken language corpus and the total number of the words in the current sentence, the spoken language conversion rate of the current sentence in the mixed corpus is calculated by adopting the following formula:
k=n/l
wherein k is the spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence.
And taking the current sentence with the spoken language rate meeting a preset threshold value as the spoken language sentence for extraction. The preset threshold may be a dynamic value set by a user, or may be a fixed value default to the system, such as 0.5. And if the preset threshold value met by the spoken language conversion rate is set as a default fixed value of the system, extracting the current sentence with the spoken language conversion rate being more than 0.5 as the spoken language sentence. And calculating the spoken language conversion rate of each sentence contained in the mixed corpus based on the spoken language corpus, and extracting the sentences of which the spoken language conversion rates meet the preset threshold value, namely the spoken sentences, as shown in table six.
Watch six
Spoken language sentence
Which belong to the restored food wool
Focusing Zhejiang great news events at first time
Exhaust honesty for you
Update o every day
Has no account number
Help women who lose marriage to find back love
Suddenly feel sad to oneself
Notice of Sichuan earthquake
Find new friends of old classmates
Why others earn more than you
I am corrugated in seconds to respond
Illustratively, the spoken sentences in the mixed corpus are extracted based on the spoken corpus, and the current sentences with the spoken language conversion rate of 0.5 are extracted as the spoken sentences. When judging whether the sentence which belongs to the reducing food woolen in the mixed corpus is a spoken sentence or not, the sentence is firstly subjected to word segmentation, and the result is that the sentence which belongs to the reducing food woolen. Since "which", "belong to", "food" and "wool" are words in the spoken language corpus, the value of n is 4, and since the total number of words in the current sentence is 5, the value of l is 5. Calculating the spoken language conversion rate based on the spoken language conversion rate formula as follows:
k=4/5=0.8
since the spoken language conversion rate calculated by the spoken language conversion rate formula is 0.8, which is greater than the preset threshold value of 0.5, the sentences in the mixed corpus, which belong to the reduced food tweed, are extracted as spoken language sentences.
This embodiment confirms spoken language corpus through the word frequency of statistics film corpus and words in the mixed corpus respectively and the information of sequencing, recycles spoken language corpus and draws the spoken sentence in the mixed corpus, has solved the problem that user-defined spoken language corpus wastes time and energy among the prior art, effectively improves the efficiency that spoken sentence drawed, can extract comparatively comprehensive spoken language corpus, has perfected whole corpus system.
Example two
Fig. 2A is a flowchart of a spoken sentence extraction method according to a second embodiment of the present invention, which is optimized based on the second embodiment, and this embodiment provides a processing method that respectively counts word frequencies of words in a film corpus and a mixed corpus and sorts the words in the film corpus and the mixed corpus according to the word frequencies, specifically: performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus; respectively counting word frequencies of words in the film corpus and the mixed corpus; and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.
Correspondingly, the method of the embodiment includes:
and S210, performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to the reference word bank and the jieba word segmentation component to obtain words in the film corpus and the mixed corpus.
The word stock is a user-defined word stock and is generally a dictionary; the jieba word segmentation component is a word segmentation tool. Specifically, a user can write a program using a pycharm platform to perform word segmentation on sentences in the movie corpus and the mixed corpus.
Loading a thesaurus by inputting a code jieba, load _ user (file _ name), wherein the file _ name is a path of a user-defined dictionary. By entering the code:
file_object=open(read_path)
try:
all_the_text=file_object.read()
finally:
file_object.close()
reading a file of a read _ path to form an all _ the _ text object, and then performing accurate mode word segmentation by using the following functions:
cut_txt=jieba.cut(all_the_text,cut_all=false)
the all _ the _ text represents the whole text needing word segmentation, the cut _ txt represents the whole text after segmentation, the cut _ all indicates an accurate word segmentation mode, the accurate word segmentation mode indicates that the whole text file is accurately segmented according to a dictionary and a special algorithm, but the segmentation of a full mode is not performed, and the segmentation of the full mode indicates all possible segmentation modes, for example:
the full-mode segmentation mode is as follows: i/his/her arrival/Beijing/Qinghua university/Huada/university
The segmentation mode of the accurate word segmentation mode is as follows: I/come/Beijing/Qinghua university
And finally, storing the sentences in the segmented film corpus and the mixed corpus into files of corresponding paths by inputting the following codes, namely storing the segmented text cut _ txt into the save _ path.
file_object=open(save_path,'w')
file_object.write(cut_txt)
file_object.close()
And S220, respectively counting the word frequency of the words in the film corpus and the mixed corpus.
And S230, respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.
S240, calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming the spoken language corpus according to the difference degree.
And S250, extracting the spoken sentences in the mixed corpus based on the spoken language corpus.
In order to calculate the difference degree of the same word in the film corpus and the mixed corpus, the word frequency of each word in the film corpus and the mixed corpus needs to be calculated, and all the calculated words are sorted in the order of the word frequency from high to low. Determining a spoken language corpus by the difference degree of the words in the film corpus and the mixed corpus respectively, and finally extracting spoken sentences in the mixed corpus by using the spoken language corpus, wherein the specific process is shown in fig. 2B. As shown in fig. 2B, the movie subtitle corpus is the words in the movie corpus; mixing the linguistic data, namely words in the mixed linguistic database; the bag of words is the spoken language corpus.
This embodiment carries out the word segmentation operation to the sentence in film corpus and the mixed corpus respectively through combining reference word bank and jieba word segmentation subassembly, obtains the word in film corpus and the mixed corpus and confirms spoken language corpus, because jieba word segmentation subassembly intelligent with use portably, can handle the database data of billions for the extraction of spoken sentence is more quick and convenient, has improved the extraction efficiency of spoken sentence.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a spoken sentence extraction apparatus according to a third embodiment of the present invention, where this embodiment is applicable to various situations of spoken sentence extraction, and the method may be executed by the spoken sentence extraction apparatus according to the third embodiment of the present invention, and the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated into any device providing a spoken sentence extraction function, such as a computer, and specifically includes, as shown in fig. 3: a word frequency statistic module 31, a spoken language corpus confirming module 32 and a spoken sentence extracting module 33.
The word frequency counting module 31 is configured to count word frequencies of words in the film corpus and the mixed corpus respectively, and sort the words in the film corpus and the mixed corpus according to the word frequencies;
a spoken language corpus confirming module 32, configured to calculate a degree of difference between the film corpus and the mixed corpus according to the word frequency and the sequence information of the words, and confirm the spoken language corpus according to the degree of difference;
and a spoken sentence extraction module 33, configured to extract spoken sentences in the mixed corpus based on the spoken corpus.
The spoken sentence extraction apparatus of this embodiment is configured to execute the spoken sentence extraction method of each embodiment, and the technical principle and the generated technical effect are similar, which are not described herein again.
Example four
Fig. 4 is a schematic structural diagram of a spoken sentence extraction apparatus according to a fourth embodiment of the present invention. As shown in fig. 4:
on the basis of the above embodiment, the word frequency statistics module is specifically configured to: performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus; respectively counting word frequencies of words in the film corpus and the mixed corpus; and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.
On the basis of the foregoing embodiment, the spoken language corpus confirming module is specifically configured to:
acquiring a plurality of alternative words with the word frequency ordered in a preset range in the film corpus and the mixed corpus;
calculating the difference degree of the alternative words in the film corpus and the mixed corpus according to the sequence number of the current word, the maximum sequence number of the word and the word frequency percentage of the current word, wherein the calculation formula of the difference degree is as follows:
D=Sm/Sm max-Sf/Sf max+(Pf-Pm)
wherein D is the degree of difference;
Smthe word sequence number is the current word sequence number in the mixed corpus;
Sm maxthe maximum serial number of the words in the mixed corpus is obtained;
Sfthe sequence number of the current word in the film corpus is obtained;
Sf maxthe maximum sequence number of the word sequence number in the film corpus;
Pfis the word frequency percentage of the current word in the film corpus;
Pmis the word frequency percentage of the current word in the mixed corpus;
and taking the words with the difference degree meeting a preset threshold value as the spoken language corpus.
On the basis of the above embodiment, the spoken sentence extraction module specifically includes: a spoken language rate calculation unit 41 and a spoken language sentence extraction unit 42.
A spoken language conversion rate calculating unit 41, configured to calculate a spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus in the current sentence and the total number of the words in the current sentence, where the spoken language conversion rate is calculated by the following formula:
k=n/l
wherein k is a spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence;
a spoken sentence extracting unit 42, configured to extract the current sentence with the spoken language rate meeting a preset threshold as the spoken sentence.
On the basis of the foregoing embodiment, the spoken sentence extraction unit is specifically configured to: and taking the current statement with the spoken language rate more than 0.5 as the spoken language statement for extraction.
On the basis of the above embodiment, the apparatus further includes: spoken language corpus expansion module 43.
A spoken language corpus expansion module 43, configured to establish a word vector training model, and input the words in the spoken language corpus into the word vector training model to obtain expanded words; and adding the expanded words meeting a preset threshold to the spoken language corpus.
The apparatus for extracting spoken language statements according to this embodiment is configured to execute the method for extracting spoken language statements according to the foregoing embodiments, and the technical principle and the generated technical effect are similar, which are not described herein again.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (8)

1. A method for spoken sentence extraction, comprising:
respectively counting word frequencies of words in a film corpus and a mixed corpus, and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;
calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words, and confirming a spoken language corpus according to the difference degree;
extracting spoken sentences in the mixed corpus based on the spoken language corpus;
the calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words and confirming the spoken language corpus according to the difference degree comprises the following steps:
acquiring a plurality of alternative words with the word frequency ordered in a preset range in the film corpus and the mixed corpus;
calculating the difference degree of the alternative words in the film corpus and the mixed corpus according to the sequence number of the current word, the maximum sequence number of the word and the word frequency percentage of the current word, wherein the calculation formula of the difference degree is as follows:
D=Sm/Smmax+Sf/Sfmax+(Pf-Pm)
wherein D is the degree of difference;
Smthe word sequence number is the current word sequence number in the mixed corpus;
Smmaxthe maximum serial number of the words in the mixed corpus is obtained;
Sfthe sequence number of the current word in the film corpus is obtained;
Sfmaxthe maximum sequence number of the word sequence number in the film corpus;
Pfis the word frequency percentage of the current word in the film corpus;
Pmis the word frequency percentage of the current word in the mixed corpus;
and taking the words with the difference degree meeting a preset threshold value as the spoken language corpus.
2. The method according to claim 1, wherein the separately counting word frequencies of words in the film corpus and the mixed corpus and ordering the words in the film corpus and the mixed corpus according to the word frequencies comprises:
performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus;
respectively counting word frequencies of words in the film corpus and the mixed corpus;
and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.
3. The method of claim 1, wherein the extracting spoken sentences in the mixed corpus based on the spoken corpus comprises:
calculating the spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus and the total number of the words in the current sentence, wherein the spoken language conversion rate formula is calculated as follows:
k=n/l
wherein k is a spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence;
and taking the current sentence with the spoken language rate meeting a preset threshold value as the spoken language sentence for extraction.
4. The method of claim 3, wherein extracting the current sentence with the spoken language rate satisfying a preset threshold as the spoken sentence comprises:
and taking the current statement with the spoken language rate more than 0.5 as the spoken language statement for extraction.
5. The method of claim 1, wherein before extracting spoken sentences in the mixed corpus based on the spoken corpus, further comprising:
establishing a word vector training model, and inputting the words in the spoken language corpus into the word vector training model to obtain expanded words; and adding the expanded words meeting a preset threshold to the spoken language corpus.
6. An apparatus for spoken sentence extraction, comprising:
the word frequency counting module is used for respectively counting the word frequencies of the words in the film corpus and the mixed corpus and sequencing the words in the film corpus and the mixed corpus according to the word frequencies;
the spoken language corpus confirming module is used for calculating the difference degree of the words in the film corpus and the mixed corpus according to the word frequency and the sequencing information of the words and confirming the spoken language corpus according to the difference degree;
the spoken sentence extraction module is used for extracting spoken sentences in the mixed corpus based on the spoken corpus;
the spoken language corpus confirmation module is specifically configured to:
acquiring a plurality of alternative words with the word frequency ordered in a preset range in the film corpus and the mixed corpus;
calculating the difference degree of the alternative words in the film corpus and the mixed corpus according to the sequence number of the current word, the maximum sequence number of the word and the word frequency percentage of the current word, wherein the calculation formula of the difference degree is as follows:
D=Sm/Smmax+Sf/Sfmax+(Pf-Pm)
wherein D is the degree of difference;
Smthe word sequence number is the current word sequence number in the mixed corpus;
Smmaxthe maximum serial number of the words in the mixed corpus is obtained;
Sfthe sequence number of the current word in the film corpus is obtained;
Sfmaxthe maximum sequence number of the word sequence number in the film corpus;
Pfas a film languageThe word frequency percentage of the current words in the material library;
Pmis the word frequency percentage of the current word in the mixed corpus;
and taking the words with the difference degree meeting a preset threshold value as the spoken language corpus.
7. The apparatus of claim 6, wherein the word frequency statistics module is specifically configured to:
performing word segmentation operation on the sentences in the film corpus and the mixed corpus respectively according to a reference word bank and a jieba word segmentation component to obtain words in the film corpus and the mixed corpus;
respectively counting word frequencies of words in the film corpus and the mixed corpus;
and respectively sequencing the words in the film corpus and the mixed corpus from high to low according to the word frequency of the words.
8. The apparatus according to claim 6, wherein the spoken sentence extraction module specifically comprises:
a spoken language conversion rate calculating unit, configured to calculate a spoken language conversion rate of the current sentence in the mixed corpus according to the number of the words appearing in the spoken language corpus in the current sentence and the total number of the words in the current sentence, where the spoken language conversion rate is calculated according to the following formula:
k=n/l
wherein k is a spoken language conversion rate, n is the number of the words appearing in the spoken language corpus in the current sentence, and l is the total number of the words in the current sentence;
and the spoken sentence extraction unit is used for extracting the current sentence of which the spoken language conversion rate meets a preset threshold as the spoken sentence.
CN201710225009.6A 2017-04-07 2017-04-07 Method and device for extracting spoken sentences Active CN106991181B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710225009.6A CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710225009.6A CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Publications (2)

Publication Number Publication Date
CN106991181A CN106991181A (en) 2017-07-28
CN106991181B true CN106991181B (en) 2020-04-21

Family

ID=59415480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710225009.6A Active CN106991181B (en) 2017-04-07 2017-04-07 Method and device for extracting spoken sentences

Country Status (1)

Country Link
CN (1) CN106991181B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114298A (en) * 2007-08-31 2008-01-30 北京搜狗科技发展有限公司 Method for gaining oral vocabulary entry, device and input method system thereof
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101464856A (en) * 2007-12-20 2009-06-24 株式会社东芝 Alignment method and apparatus for parallel spoken language materials
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN105741831A (en) * 2016-01-27 2016-07-06 广东外语外贸大学 Spoken language evaluation method based on grammatical analysis and spoken language evaluation system
CN106164889A (en) * 2013-12-02 2016-11-23 丘贝斯有限责任公司 System and method for internal storage data library searching
CN106528726A (en) * 2016-11-02 2017-03-22 四川用联信息技术有限公司 Keyword optimization-based search engine optimization realization technology

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9508011B2 (en) * 2010-05-10 2016-11-29 Videosurf, Inc. Video visual and audio query
US10579687B2 (en) * 2015-09-01 2020-03-03 Google Llc Providing native application search results with web search results

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114298A (en) * 2007-08-31 2008-01-30 北京搜狗科技发展有限公司 Method for gaining oral vocabulary entry, device and input method system thereof
CN101464856A (en) * 2007-12-20 2009-06-24 株式会社东芝 Alignment method and apparatus for parallel spoken language materials
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN106164889A (en) * 2013-12-02 2016-11-23 丘贝斯有限责任公司 System and method for internal storage data library searching
CN105741831A (en) * 2016-01-27 2016-07-06 广东外语外贸大学 Spoken language evaluation method based on grammatical analysis and spoken language evaluation system
CN106528726A (en) * 2016-11-02 2017-03-22 四川用联信息技术有限公司 Keyword optimization-based search engine optimization realization technology

Also Published As

Publication number Publication date
CN106991181A (en) 2017-07-28

Similar Documents

Publication Publication Date Title
KR101716905B1 (en) Method for calculating entity similarities
CN111767403B (en) Text classification method and device
KR101708508B1 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
CN103425777B (en) A kind of based on the short message intelligent classification and the searching method that improve Bayes's classification
CN110377725B (en) Data generation method and device, computer equipment and storage medium
CN111222305A (en) Information structuring method and device
CN109062895B (en) Intelligent semantic processing method
CN110198464B (en) Intelligent voice broadcasting method and device, computer equipment and storage medium
CN110502742B (en) Complex entity extraction method, device, medium and system
Ginting et al. Hate speech detection on twitter using multinomial logistic regression classification method
CN107885717B (en) Keyword extraction method and device
CN112199588A (en) Public opinion text screening method and device
CN104573030A (en) Textual emotion prediction method and device
CN111767378A (en) Method and device for intelligently recommending scientific and technical literature
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN112784011A (en) Emotional problem processing method, device and medium based on CNN and LSTM
CN106991181B (en) Method and device for extracting spoken sentences
CN110555199B (en) Article generation method, device, equipment and storage medium based on hotspot materials
CN108073567A (en) A kind of Feature Words extraction process method, system and server
CN110597985A (en) Data classification method, device, terminal and medium based on data analysis
CN114417010A (en) Knowledge graph construction method and device for real-time workflow and storage medium
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
JP2000132553A (en) Keyword extraction method, device therefor and computer-readable recording medium recording keyword extraction program
Li et al. News-oriented keyword indexing with maximum entropy principle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant