CN109858427A

CN109858427A - A kind of corpus extraction method, device and terminal device

Info

Publication number: CN109858427A
Application number: CN201910077238.7A
Authority: CN
Inventors: 周发升; 何伟宝; 詹逸; 陈渤; 杨敬慈; 皮樾; 李锦韬
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2019-06-07

Abstract

This application discloses a kind of corpus extraction method, device and terminal devices, the described method includes: passing through acquisition audio, video data, and after obtaining the not caption area phonetic image of the audio, video data comprising caption text data, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data；Subtitling image in multiple phonetic image data is converted into multiple texts, calculates the cosine value of multiple texts between any two, and the text that cosine value reaches threshold value is merged；The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is matched after text file by being converted to the audio-video subtitling image of no subtitle file with voice data, method to extract corpus, the problem of need to carrying out corpus extraction by multiple playback environ-ments is overcome, and then has achieved the purpose that reduce the cost that corpus extracts.

Description

A kind of corpus extraction method, device and terminal device

Technical field

This application involves audio-video speech information retrieval technical field more particularly to a kind of corpus extraction method, device and Terminal device.

Background technique

In automatic speech recognition system, the performance and robustness of system depend greatly on identification model modeling Whether there is corpus data abundant enough, i.e. the corpus data resources bank key foundation ring that is intelligent sound technology in the process Section.And in corpus data resources bank corpus scale and quality, largely determine the wide of various intelligent sounds application Degree and depth, while also strong influence the experience of user.

In the prior art, corpus is extracted by way of recording, to establish corpus data resources bank.But using existing When technology carries out corpus extraction, discovery is that trained library and survey are provided for speech recognition system due to the purpose for establishing and collecting corpus Library is tried, therefore the selection of speaker need to cover national different regions, age, gender and schooling, and need to be from multiple recording rings Border carries out corpus extraction, it is ensured that and the matching degree of subsequent speech recognition is too high so as to cause the extraction cost of corpus.

Summary of the invention

The embodiment of the present application technical problem to be solved is how to reduce the cost of corpus extraction.

To solve the above problems, the embodiment of the present application provides a kind of corpus extraction method, suitable for being executed in calculating equipment, Including at least following steps:

Acquire the audio, video data of audio-visual-materials；

Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and ash Difference statistics are spent, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the subtitle region Domain phonetic image is intercepted, and N number of phonetic image data are obtained；Wherein, a phonetic image data include a subtitling image The first voice data corresponding with the subtitling image；N is positive integer；

By OCR technique, after N number of subtitling image is converted to M text, M text between any two remaining is calculated String value, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image；Wherein, M >=N and M are positive integer；

Multiple texts that judgement belongs to the same subtitling image are merged, are obtained a pair of with N number of subtitling image one After the N number of merging text answered, by first voice data corresponding with the subtitling image in the phonetic image data, root Cutting is carried out according to N number of merging text, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e., The corpus of each first text unit.

Further, further includes:

Using the audio, video data comprising the caption text data as second processing data, pass through Regularization Technique The caption text data is parsed, and cutting is carried out according to second speech data of the time shaft to the second processing data, is obtained After taking multiple second text-to-speech data, according to each second text-to-speech data, to the every of the caption text data A second text unit is labeled one by one, obtains the corpus of each second text unit.

Further, described to be counted by edge detection and grey scale difference, obtain the subtitle region of the first processing data After phonetic image, the caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image data, had Body are as follows:

The frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to progress gradation conversion After frame image afterwards carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out the edge detection Curtain region is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area voice Image is intercepted.

Further, described by OCR technique, after N number of subtitling image is converted to M text, calculate M text Cosine value between any two, specifically:

By OCR technique, after N number of subtitling image is converted to M text, M text is formed into contrast groups two-by-two Afterwards, multiple keywords of the contrast groups, and the going out in the contrast groups according to the multiple keyword are obtained by TF-IDF Existing frequency generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency Vector obtains the cosine value of the contrast groups.

Further, described by the first voice number corresponding with the subtitling image in the phonetic image data According to, cutting is carried out according to N number of merging text, specifically:

First voice data is handled by VAD technology, and will treated the first voice data root Cutting is carried out according to N number of merging text.

Further, a kind of corpus extraction element is also provided, comprising:

Data acquisition module, for acquiring the audio, video data of audio-visual-materials；

Data cutout module, for will not include the audio, video data of caption text data as the first processing number According to by edge detection and grey scale difference statistics, after the caption area phonetic image for obtaining the first processing data, according to pre- If frame number intercepts the caption area phonetic image, N number of phonetic image data are obtained；Wherein, a phonetic image number According to including a subtitling image and the first voice data corresponding with the subtitling image；N is positive integer；

Data judgment module, for after N number of subtitling image is converted to M text, calculating M by OCR technique The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same institute State subtitling image；Wherein, M >=N and M are positive integer；

First data match module, multiple texts for judgement to be belonged to the same subtitling image are merged, are obtained With N number of subtitling image it is one-to-one it is N number of merge text after, will be corresponding with the subtitling image in the phonetic image data First voice data, cutting is carried out according to N number of merging text, obtains each first text unit in N number of merging text The first text-to-speech data, i.e., the corpus of each first text unit.

Further, further includes:

Second data match module, for that will include the audio, video data of the caption text data as at second Data are managed, the caption text data is parsed by Regularization Technique, and according to time shaft to the of the second processing data Two voice data carry out cutting, right according to each second text-to-speech data after obtaining multiple second text-to-speech data The second text unit of each of described caption text data is labeled one by one, obtains the corpus of each second text unit.

Further, a kind of corpus is also provided and extracts terminal device, including processor, memory and is stored in described deposit In reservoir and it is configured as the computer program executed by the processor, the processor executes real when the computer program Now such as any one of above-described embodiment corpus extraction method.

Implement the embodiment of the present application, has the following beneficial effects:

A kind of corpus extraction method, device and terminal device provided by the embodiments of the present application, which comprises by adopting Collect audio, video data, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtains After taking caption area phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image numbers According to；After subtitling image in multiple phonetic image data is converted into multiple texts, between any two by the multiple texts of calculating Cosine value, judges whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged, The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains each first text unit Corpus.Compared with prior art, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with Voice data is matched, thus the method for extracting corpus, asking for corpus extraction need to be carried out by multiple playback environ-ments by overcoming Topic, and then achieved the purpose that reduce the cost that corpus extracts.

Detailed description of the invention

Fig. 1 is the flow diagram for the corpus extraction method that one embodiment of the application provides；

Fig. 2 is the flow diagram for the corpus extraction method that another embodiment of the application provides；

Fig. 3 is the flow diagram for the corpus extraction method that the further embodiment of the application provides；

Fig. 4 is the TF-IDF flow chart that one embodiment of the application provides；

Fig. 5 is the structural schematic diagram for the corpus extraction element that one embodiment of the application provides；

Fig. 6 is the structural schematic diagram for the corpus extraction element that another embodiment of the application provides；

Fig. 7 is the edge detection effect picture that one embodiment of the application provides；

Fig. 8 is that the caption area image that one embodiment of the application provides obtains effect picture.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

Referring to Figure 1.

It is the flow diagram of the corpus extraction method of one embodiment offer of the application referring to Fig. 1, as shown in Figure 1, The task processing method includes step S11 to step S14.Each step is specific as follows:

Step S11 acquires the audio, video data of audio-visual-materials.

Step S12 passes through edge detection using the audio, video data not comprising caption text data as the first processing data It is counted with grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number to caption area language Sound image is intercepted, and N number of phonetic image data are obtained.

Wherein, a phonetic image data include a subtitling image and the first voice data corresponding with subtitling image；N For positive integer.

Step S13 after N number of subtitling image is converted to M text, calculates M text between any two by OCR technique Cosine value, and be up to two texts of the cosine value of preset threshold, be judged as and belong to same subtitling image.

Wherein, M >=N and M are positive integer.

Step S14 merges multiple texts that judgement belongs to same subtitling image, obtains and N number of subtitling image one After one corresponding N number of merging text, by the first voice data corresponding with subtitling image in phonetic image data, according to N number of conjunction And text carries out cutting, the first text-to-speech data of each first text unit in the N number of merging text of acquisition, i.e., each first The corpus of text unit.

For step S11, specifically, choose the audio, video data of audio-visual-materials to be processed, and by the audio-video number According to by whether being divided comprising caption text data.

For step S12, specifically, data are handled as first using for the audio, video data comprising caption text data, And the frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to the frame image after carrying out gradation conversion After carrying out edge detection, is positioned, obtained by caption area of the grey scale difference statistics to the frame image after carrying out edge detection To after caption area phonetic image, caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image number According to.

Since the edge feature of caption area is more obvious, the position that subtitle occurs is relatively fixed, and same section of subtitle is logical Often can be in same position stay longer, in addition to this, subtitle color often has biggish difference with ambient background color.Cause The frame image of first processing data is loaded into RGB color image space and carries out gray proces, conversion by this in the present embodiment At gray level image, specific conversion formula are as follows:

Y (x, y)=0.229 × R (x, y)+0.587 × G (x, y)+0.114 × B (x, y)

Wherein, Y (x, y) is the gray value of pixel (z, y), and R (x, y), G (x, y) and B (x, y) are the position (z, y) pixel The red, green, blue component of RGB color.

In the present embodiment, the gray level image after conversion is subjected to edge detection by Sobel Operator, specifically:

Assuming that gray level image is I, the kernel Gx of I and odd sized is subjected to convolution in the horizontal direction.For example, working as When kernel size is 3, the calculated result of Gx are as follows:

After completing the convolutional calculation in horizontal direction, the kernel G_y of I and odd sized is carried out in vertical direction Convolution.For example, when kernel size is 3, the calculated result of G_y are as follows:

According to the convolution of I in the horizontal direction and the vertical direction, the approximate gradient of every bit on I is obtained:

The specific processing result of the present embodiment can be as shown in Figure 7.

The caption area of frame image will be positioned by grey scale difference statistics by the image after edge detection, is obtained Caption area phonetic image, specifically:

Wherein, E (x) indicates that the pixel grey scale absolute value of the difference of two neighboring frame image in audio-visual-materials adds up, f (x, y), f (x, y+1) are respectively the gray value of respective pixel point.

The specific processing result of the present embodiment can be as shown in Figure 8.

In the present embodiment, after obtaining caption area phonetic image, the caption area phonetic image is cut by 7 frames It takes, obtains multiple phonetic image data.Each phonetic image data include a subtitling image and corresponding with subtitling image the One voice data.

For step S13, specifically, by OCR technique, after N number of subtitling image is converted to M text, by M text After forming contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to multiple keywords in contrast groups The frequency of occurrences, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to two word frequency vectors, acquisition pair Than the cosine value of group, and two texts of the cosine value of preset threshold are up to, are judged as and belong to same subtitling image.

In the present embodiment, as shown in figure 4, after generating a variety of texts by OCR technique, subtitle text is obtained by TF-IDF Keyword in this, specifically:

TF-IDF=TF_i,j×IDF_i

Wherein, TF_i,jIt indicates to obtain a certain text unit t in text_iImportance, be represented by n_i,jIt is the text unit in corresponding text d_jThe number of middle appearance, denominator are then indicated in corresponding text d_jIn all text units go out The sum of existing number.

It should be noted that in the present embodiment, obtaining for keyword can be carried out using Simase LSTM replacement TF-IDF It takes.

In the present embodiment, existed after obtaining the keyword in captioned test according to keyword by above-mentioned TF-IDF algorithm The frequency that contrast groups occur, generates the corresponding two word frequency vectors of two texts in contrast groups, cosine similarity θ by dot product and Vector length provides, specifically:

In the present embodiment, when the cosine value cos (θ) of two word frequency vectors reaches preset threshold 0.67, judge two Corresponding two texts of word frequency vector are converted by same subtitling image.

It should be noted that preset threshold can between 0.65-0.7 any one numerical value, for guarantee two texts Similarity judging result.

For step S14, specifically, multiple texts are merged, obtain N number of correspondingly with N number of subtitling image After merging text, the first voice data corresponding with subtitling image is handled by VAD technology, and treated first by general Voice data carries out cutting according to N number of merging text, obtains N number of corpus for merging each first text unit in text.

In the present embodiment, the prolonged mute phase in voice signal stream is removed in the first voice data using VAD technology, To greatly reduce data volume to be processed during speech recognition etc..

The embodiment of the present application provides a kind of corpus extraction method, by acquiring audio, video data, and will be not comprising subtitle text The audio, video data of notebook data is counted by edge detection and grey scale difference, after obtaining caption area phonetic image, by subtitle region Domain phonetic image is intercepted by default frame number, obtains multiple phonetic image data；By the subtitle in multiple phonetic image data After image is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to together One subtitling image, and after the text for belonging to same subtitling image is merged, it will the first voice number corresponding with subtitling image Cutting is carried out according to according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is logical It crosses to be converted to the audio-video subtitling image of no subtitle file and be matched with voice data after text file, to extract corpus Method, overcome need to by multiple playback environ-ments carry out corpus extraction the problem of, and then reached reduce corpus extract at This purpose.

Please refer to Fig. 2-3.

Referring to fig. 2, be the application another embodiment provide a kind of corpus extraction method flow diagram, except figure Outside step shown in 1, further includes:

Step S15 passes through Regularization Technique using the audio, video data comprising caption text data as second processing data Caption text data is parsed, and cutting is carried out to the second speech datas of second processing data according to time shaft, obtains multiple the After two text-to-speech data, according to each second text-to-speech data, to the second text unit of each of caption text data one One is labeled, and obtains the corpus of each second text unit.

In the present embodiment, when the audio, video data of acquisition include caption text data, then directly utilize Regularization Technique Subtitle file is parsed, obtains multiple second text units, and utilize time shaft cutting second speech data, then pass through VAD technology After handling second speech data, using treated, second speech data marks each second text unit one by one Note obtains the corpus of each second text unit.

The embodiment of the present application provides a kind of corpus extraction method, and by acquiring audio, video data, and according to whether there are words Audio, video data is divided into the first processing data of no subtitle file and has subtitle file second processing data by curtain file；By pre- If the caption area phonetic image of frame number interception the first processing data, and the subtitling image of caption area phonetic image is converted to After multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to same subtitling image； The text for belonging to same subtitling image is merged, and by the first voice data with subtitling image to drink, after merging Text carry out cutting, obtain the corpus of each first text unit；Second processing data are parsed into word using Regularization Technique Curtain file parses second speech data using time shaft after obtaining multiple second text units；By second speech data to every A second text unit is labeled, and obtains the corpus of each second text unit.Compared with prior art, present invention employs It is matched with voice data after text file by being converted to the audio-video subtitling image of no subtitle file, to extract language The method of material overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then has reached and reduced what corpus extracted The purpose of cost.

In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced The cost of extraction.

Please refer to Fig. 5.

It is the structural schematic diagram of the corpus extraction element of one embodiment offer of the application referring to Fig. 5, comprising:

Data acquisition module 101, for acquiring the audio, video data of audio-visual-materials.

In the present embodiment, data acquisition module 101 is specifically used for, and chooses the audio-video number of audio-visual-materials to be processed According to, and by the audio, video data by whether dividing comprising caption text data.

Data cutout module 102, for handling data for the audio, video data not comprising caption text data as first, It is counted by edge detection and grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number Caption area phonetic image is intercepted, N number of phonetic image data are obtained.

In the present embodiment, data cutout module 102 is specifically used for, will be for the audio, video data comprising caption text data Gradation conversion is carried out as the first processing data, and by the frame image of the first processing data, and by Sobel Operator to progress After frame image after gradation conversion carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out edge detection Curtain region is positioned, and after obtaining caption area phonetic image, is intercepted according to default frame number to caption area phonetic image, Obtain N number of phonetic image data.

Data judgment module 103, for after N number of subtitling image is converted to M text, calculating M by OCR technique The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same subtitle figure Picture.

Wherein, M >=N and M are positive integer.

In the present embodiment, data judgment module 103 is specifically used for that N number of subtitling image is converted to M by OCR technique After a text, after M text is formed contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to more A keyword is in the frequency of occurrences of contrast groups, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to Two word frequency vectors, obtain the cosine value of contrast groups, and are up to two texts of the cosine value of preset threshold, are judged as and belong to Same subtitling image.

First data match module 104, multiple texts for judgement to be belonged to same subtitling image merge, and obtain With N number of subtitling image it is one-to-one it is N number of merge text after, by the first voice corresponding with subtitling image in phonetic image data Data carry out cutting according to N number of merging text, obtain N number of the first text-to-speech for merging each first text unit in text Data, i.e., the corpus of each first text unit.

In the present embodiment, the first data match module 104 is specifically used for, and multiple texts are merged, obtain with it is N number of After the one-to-one N number of merging text of subtitling image, the first voice data corresponding with subtitling image is carried out by VAD technology Processing, and treated the first voice data according to N number of mergings text is subjected to cutting, each the in the N number of merging text of acquisition The corpus of one text unit.

The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number According to, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtain caption area After phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data；By multiple languages After subtitling image in sound image data is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judgement Whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged, will be with subtitle figure As corresponding first voice data according to after merging text carry out cutting, obtain the corpus of each first text unit.With it is existing There is technology to compare, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with voice data into Row matching, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then reach Reduce the purpose for the cost that corpus extracts.

Please refer to Fig. 6.

Referring to Fig. 6, it is the structural schematic diagram of the corpus extraction element for having one embodiment to provide of the application, removes Fig. 5 institute Show outside structure, further includes:

Second data match module 105, for that will include the audio, video data of caption text data as second processing number According to, caption text data is parsed by Regularization Technique, and according to time shaft to the second speech datas of second processing data into Row cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to the every of caption text data A second text unit is labeled one by one, obtains the corpus of each second text unit.

In the present embodiment, the second data match module 105 is specifically used for, when the audio, video data of acquisition includes subtitle text Notebook data then directly parses subtitle file using Regularization Technique, obtains multiple second text units, and utilize time shaft cutting Second speech data, then after being handled by VAD technology second speech data utilizes treated second speech data pair Each second text unit is labeled one by one, obtains the corpus of each second text unit.

The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number According to, and according to whether audio, video data is divided into the first processing data of no subtitle file and has subtitle literary there are subtitle file Part second processing data；By the caption area phonetic image of default frame number interception the first processing data, and by caption area voice After the subtitling image of image is converted to multiple texts, by calculating the cosine value of multiple texts between any two, two texts are judged Whether same subtitling image is belonged to；The text for belonging to same subtitling image is merged, and by with subtitling image to drink One voice data carries out cutting according to the text after merging, obtains the corpus of each first text unit；By second processing data Subtitle file is parsed using Regularization Technique, after obtaining multiple second text units, the second voice number is parsed using time shaft According to；Second speech data is labeled each second text unit, obtains the corpus of each second text unit.With it is existing Technology is compared, present invention employs by by the audio-video subtitling image of no subtitle file be converted to after text file with voice number According to being matched, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, in turn Achieve the purpose that reduce the cost that corpus extracts.

In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced The cost of extraction

The another embodiment of the application additionally provides a kind of configurable terminal device of kinetic control system, including processing Device, memory and storage in the memory and are configured as the computer program executed by the processor, the place Reason device realizes the corpus extraction method as described in above-described embodiment when executing the computer program.

The above is the preferred embodiment of the application, it is noted that for those skilled in the art For, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also considered as The protection scope of the application.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Claims

1. a kind of corpus extraction method, which is characterized in that include at least following steps:

Acquire the audio, video data of audio-visual-materials；

Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and gray scale difference Divide statistics, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the caption area language Sound image is intercepted, and N number of phonetic image data are obtained；Wherein, phonetic image data include a subtitling image and with Corresponding first voice data of the subtitling image；N is positive integer；

By OCR technique, after N number of subtitling image is converted to M text, the cosine value of M text between any two is calculated, And two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image；Wherein, M >=N and M is positive integer；

Multiple texts that judgement belongs to the same subtitling image are merged, are obtained and the one-to-one N of N number of subtitling image After a merging text, by first voice data corresponding with the subtitling image in the phonetic image data, according to N number of Merge text and carry out cutting, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e., each the The corpus of one text unit.

2. corpus extraction method according to claim 1, which is characterized in that further include:

Using the audio, video data comprising the caption text data as second processing data, parsed by Regularization Technique The caption text data, and cutting is carried out according to second speech data of the time shaft to the second processing data, it obtains more After a second text-to-speech data, according to each second text-to-speech data, to each of described caption text data Two text units are labeled one by one, obtain the corpus of each second text unit.

3. corpus extraction method according to claim 1, which is characterized in that described to be united by edge detection and grey scale difference Meter, after obtaining the subtitle region phonetic image that described first handles data, according to default frame number to the caption area phonetic image It is intercepted, obtains N number of phonetic image data, specifically:

By it is described first processing data frame image carry out gradation conversion, and by Sobel Operator to carry out gradation conversion after After frame image carries out edge detection, by grey scale difference statistics to the subtitle region of the frame image after carrying out the edge detection Domain is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area phonetic image It is intercepted.

4. corpus extraction method according to claim 1, which is characterized in that it is described by OCR technique, by N number of word After curtain image is converted to M text, the cosine value of M text between any two is calculated, specifically:

By OCR technique, after N number of subtitling image is converted to M text, after M text is formed contrast groups two-by-two, lead to Cross multiple keywords that TF-IDF obtains the contrast groups, and according to the multiple keyword the contrast groups appearance frequency Rate generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency vectors, Obtain the cosine value of the contrast groups.

5. corpus extraction method according to claim 1, which is characterized in that it is described by the phonetic image data with institute Corresponding first voice data of subtitling image is stated, cutting is carried out according to N number of merging text, specifically:

First voice data is handled by VAD technology, and will treated first voice data according to institute It states N number of merging text and carries out cutting.

6. a kind of corpus extraction element characterized by comprising

Data cutout module, for leading to using the audio, video data not comprising caption text data as the first processing data Edge detection and grey scale difference statistics are crossed, after the caption area phonetic image for obtaining the first processing data, according to default frame It is several that the caption area phonetic image is intercepted, obtain N number of phonetic image data；Wherein, a phonetic image data packet Include a subtitling image and the first voice data corresponding with the subtitling image；N is positive integer；

Data judgment module, for after N number of subtitling image is converted to M text, calculating M text by OCR technique Cosine value between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same word Curtain image；Wherein, M >=N and M are positive integer；

First data match module, multiple texts for judgement to be belonged to the same subtitling image merge, acquisition and N After the one-to-one N number of merging text of a subtitling image, by institute corresponding with the subtitling image in the phonetic image data The first voice data is stated, cutting is carried out according to N number of merging text, obtains and N number of merges the of each first text unit in text One text-to-speech data, i.e., the corpus of each first text unit.

7. corpus extraction element according to claim 6, which is characterized in that further include:

Second data match module, for that will include the audio, video data of the caption text data as second processing number According to parsing the caption text data by Regularization Technique, and according to time shaft to the second language of the second processing data Sound data carry out cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to described The second text unit of each of caption text data is labeled one by one, obtains the corpus of each second text unit.

8. a kind of corpus extracts terminal device, which is characterized in that in the memory including processor, memory and storage And it is configured as the computer program executed by the processor, the processor is realized when executing the computer program as weighed Benefit requires any one of 1~5 corpus extraction method.