CN109858427A - A kind of corpus extraction method, device and terminal device - Google Patents
A kind of corpus extraction method, device and terminal device Download PDFInfo
- Publication number
- CN109858427A CN109858427A CN201910077238.7A CN201910077238A CN109858427A CN 109858427 A CN109858427 A CN 109858427A CN 201910077238 A CN201910077238 A CN 201910077238A CN 109858427 A CN109858427 A CN 109858427A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- image
- corpus
- subtitling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Television Signal Processing For Recording (AREA)
Abstract
This application discloses a kind of corpus extraction method, device and terminal devices, the described method includes: passing through acquisition audio, video data, and after obtaining the not caption area phonetic image of the audio, video data comprising caption text data, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data;Subtitling image in multiple phonetic image data is converted into multiple texts, calculates the cosine value of multiple texts between any two, and the text that cosine value reaches threshold value is merged;The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is matched after text file by being converted to the audio-video subtitling image of no subtitle file with voice data, method to extract corpus, the problem of need to carrying out corpus extraction by multiple playback environ-ments is overcome, and then has achieved the purpose that reduce the cost that corpus extracts.
Description
Technical field
This application involves audio-video speech information retrieval technical field more particularly to a kind of corpus extraction method, device and
Terminal device.
Background technique
In automatic speech recognition system, the performance and robustness of system depend greatly on identification model modeling
Whether there is corpus data abundant enough, i.e. the corpus data resources bank key foundation ring that is intelligent sound technology in the process
Section.And in corpus data resources bank corpus scale and quality, largely determine the wide of various intelligent sounds application
Degree and depth, while also strong influence the experience of user.
In the prior art, corpus is extracted by way of recording, to establish corpus data resources bank.But using existing
When technology carries out corpus extraction, discovery is that trained library and survey are provided for speech recognition system due to the purpose for establishing and collecting corpus
Library is tried, therefore the selection of speaker need to cover national different regions, age, gender and schooling, and need to be from multiple recording rings
Border carries out corpus extraction, it is ensured that and the matching degree of subsequent speech recognition is too high so as to cause the extraction cost of corpus.
Summary of the invention
The embodiment of the present application technical problem to be solved is how to reduce the cost of corpus extraction.
To solve the above problems, the embodiment of the present application provides a kind of corpus extraction method, suitable for being executed in calculating equipment,
Including at least following steps:
Acquire the audio, video data of audio-visual-materials;
Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and ash
Difference statistics are spent, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the subtitle region
Domain phonetic image is intercepted, and N number of phonetic image data are obtained;Wherein, a phonetic image data include a subtitling image
The first voice data corresponding with the subtitling image;N is positive integer;
By OCR technique, after N number of subtitling image is converted to M text, M text between any two remaining is calculated
String value, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image;Wherein, M
>=N and M are positive integer;
Multiple texts that judgement belongs to the same subtitling image are merged, are obtained a pair of with N number of subtitling image one
After the N number of merging text answered, by first voice data corresponding with the subtitling image in the phonetic image data, root
Cutting is carried out according to N number of merging text, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e.,
The corpus of each first text unit.
Further, further includes:
Using the audio, video data comprising the caption text data as second processing data, pass through Regularization Technique
The caption text data is parsed, and cutting is carried out according to second speech data of the time shaft to the second processing data, is obtained
After taking multiple second text-to-speech data, according to each second text-to-speech data, to the every of the caption text data
A second text unit is labeled one by one, obtains the corpus of each second text unit.
Further, described to be counted by edge detection and grey scale difference, obtain the subtitle region of the first processing data
After phonetic image, the caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image data, had
Body are as follows:
The frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to progress gradation conversion
After frame image afterwards carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out the edge detection
Curtain region is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area voice
Image is intercepted.
Further, described by OCR technique, after N number of subtitling image is converted to M text, calculate M text
Cosine value between any two, specifically:
By OCR technique, after N number of subtitling image is converted to M text, M text is formed into contrast groups two-by-two
Afterwards, multiple keywords of the contrast groups, and the going out in the contrast groups according to the multiple keyword are obtained by TF-IDF
Existing frequency generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency
Vector obtains the cosine value of the contrast groups.
Further, described by the first voice number corresponding with the subtitling image in the phonetic image data
According to, cutting is carried out according to N number of merging text, specifically:
First voice data is handled by VAD technology, and will treated the first voice data root
Cutting is carried out according to N number of merging text.
Further, a kind of corpus extraction element is also provided, comprising:
Data acquisition module, for acquiring the audio, video data of audio-visual-materials;
Data cutout module, for will not include the audio, video data of caption text data as the first processing number
According to by edge detection and grey scale difference statistics, after the caption area phonetic image for obtaining the first processing data, according to pre-
If frame number intercepts the caption area phonetic image, N number of phonetic image data are obtained;Wherein, a phonetic image number
According to including a subtitling image and the first voice data corresponding with the subtitling image;N is positive integer;
Data judgment module, for after N number of subtitling image is converted to M text, calculating M by OCR technique
The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same institute
State subtitling image;Wherein, M >=N and M are positive integer;
First data match module, multiple texts for judgement to be belonged to the same subtitling image are merged, are obtained
With N number of subtitling image it is one-to-one it is N number of merge text after, will be corresponding with the subtitling image in the phonetic image data
First voice data, cutting is carried out according to N number of merging text, obtains each first text unit in N number of merging text
The first text-to-speech data, i.e., the corpus of each first text unit.
Further, further includes:
Second data match module, for that will include the audio, video data of the caption text data as at second
Data are managed, the caption text data is parsed by Regularization Technique, and according to time shaft to the of the second processing data
Two voice data carry out cutting, right according to each second text-to-speech data after obtaining multiple second text-to-speech data
The second text unit of each of described caption text data is labeled one by one, obtains the corpus of each second text unit.
Further, a kind of corpus is also provided and extracts terminal device, including processor, memory and is stored in described deposit
In reservoir and it is configured as the computer program executed by the processor, the processor executes real when the computer program
Now such as any one of above-described embodiment corpus extraction method.
Implement the embodiment of the present application, has the following beneficial effects:
A kind of corpus extraction method, device and terminal device provided by the embodiments of the present application, which comprises by adopting
Collect audio, video data, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtains
After taking caption area phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image numbers
According to;After subtitling image in multiple phonetic image data is converted into multiple texts, between any two by the multiple texts of calculating
Cosine value, judges whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged,
The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains each first text unit
Corpus.Compared with prior art, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with
Voice data is matched, thus the method for extracting corpus, asking for corpus extraction need to be carried out by multiple playback environ-ments by overcoming
Topic, and then achieved the purpose that reduce the cost that corpus extracts.
Detailed description of the invention
Fig. 1 is the flow diagram for the corpus extraction method that one embodiment of the application provides;
Fig. 2 is the flow diagram for the corpus extraction method that another embodiment of the application provides;
Fig. 3 is the flow diagram for the corpus extraction method that the further embodiment of the application provides;
Fig. 4 is the TF-IDF flow chart that one embodiment of the application provides;
Fig. 5 is the structural schematic diagram for the corpus extraction element that one embodiment of the application provides;
Fig. 6 is the structural schematic diagram for the corpus extraction element that another embodiment of the application provides;
Fig. 7 is the edge detection effect picture that one embodiment of the application provides;
Fig. 8 is that the caption area image that one embodiment of the application provides obtains effect picture.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
Referring to Figure 1.
It is the flow diagram of the corpus extraction method of one embodiment offer of the application referring to Fig. 1, as shown in Figure 1,
The task processing method includes step S11 to step S14.Each step is specific as follows:
Step S11 acquires the audio, video data of audio-visual-materials.
Step S12 passes through edge detection using the audio, video data not comprising caption text data as the first processing data
It is counted with grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number to caption area language
Sound image is intercepted, and N number of phonetic image data are obtained.
Wherein, a phonetic image data include a subtitling image and the first voice data corresponding with subtitling image;N
For positive integer.
Step S13 after N number of subtitling image is converted to M text, calculates M text between any two by OCR technique
Cosine value, and be up to two texts of the cosine value of preset threshold, be judged as and belong to same subtitling image.
Wherein, M >=N and M are positive integer.
Step S14 merges multiple texts that judgement belongs to same subtitling image, obtains and N number of subtitling image one
After one corresponding N number of merging text, by the first voice data corresponding with subtitling image in phonetic image data, according to N number of conjunction
And text carries out cutting, the first text-to-speech data of each first text unit in the N number of merging text of acquisition, i.e., each first
The corpus of text unit.
For step S11, specifically, choose the audio, video data of audio-visual-materials to be processed, and by the audio-video number
According to by whether being divided comprising caption text data.
For step S12, specifically, data are handled as first using for the audio, video data comprising caption text data,
And the frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to the frame image after carrying out gradation conversion
After carrying out edge detection, is positioned, obtained by caption area of the grey scale difference statistics to the frame image after carrying out edge detection
To after caption area phonetic image, caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image number
According to.
Since the edge feature of caption area is more obvious, the position that subtitle occurs is relatively fixed, and same section of subtitle is logical
Often can be in same position stay longer, in addition to this, subtitle color often has biggish difference with ambient background color.Cause
The frame image of first processing data is loaded into RGB color image space and carries out gray proces, conversion by this in the present embodiment
At gray level image, specific conversion formula are as follows:
Y (x, y)=0.229 × R (x, y)+0.587 × G (x, y)+0.114 × B (x, y)
Wherein, Y (x, y) is the gray value of pixel (z, y), and R (x, y), G (x, y) and B (x, y) are the position (z, y) pixel
The red, green, blue component of RGB color.
In the present embodiment, the gray level image after conversion is subjected to edge detection by Sobel Operator, specifically:
Assuming that gray level image is I, the kernel Gx of I and odd sized is subjected to convolution in the horizontal direction.For example, working as
When kernel size is 3, the calculated result of Gx are as follows:
After completing the convolutional calculation in horizontal direction, the kernel G_y of I and odd sized is carried out in vertical direction
Convolution.For example, when kernel size is 3, the calculated result of G_y are as follows:
According to the convolution of I in the horizontal direction and the vertical direction, the approximate gradient of every bit on I is obtained:
The specific processing result of the present embodiment can be as shown in Figure 7.
The caption area of frame image will be positioned by grey scale difference statistics by the image after edge detection, is obtained
Caption area phonetic image, specifically:
Wherein, E (x) indicates that the pixel grey scale absolute value of the difference of two neighboring frame image in audio-visual-materials adds up, f
(x, y), f (x, y+1) are respectively the gray value of respective pixel point.
The specific processing result of the present embodiment can be as shown in Figure 8.
In the present embodiment, after obtaining caption area phonetic image, the caption area phonetic image is cut by 7 frames
It takes, obtains multiple phonetic image data.Each phonetic image data include a subtitling image and corresponding with subtitling image the
One voice data.
For step S13, specifically, by OCR technique, after N number of subtitling image is converted to M text, by M text
After forming contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to multiple keywords in contrast groups
The frequency of occurrences, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to two word frequency vectors, acquisition pair
Than the cosine value of group, and two texts of the cosine value of preset threshold are up to, are judged as and belong to same subtitling image.
In the present embodiment, as shown in figure 4, after generating a variety of texts by OCR technique, subtitle text is obtained by TF-IDF
Keyword in this, specifically:
TF-IDF=TFi,j×IDFi
Wherein, TFi,jIt indicates to obtain a certain text unit t in textiImportance, be represented by
ni,jIt is the text unit in corresponding text djThe number of middle appearance, denominator are then indicated in corresponding text djIn all text units go out
The sum of existing number.
Wherein, | D | it is the total number of files in corpus, | j:ti∈dj| it indicates to include word tiNumber of files (i.e. ni,j
≠ 0 number of files).If the word, not in corpus, will lead to denominator is zero, therefore uses 1+ under normal circumstances |
j:ti∈dj|。
It should be noted that in the present embodiment, obtaining for keyword can be carried out using Simase LSTM replacement TF-IDF
It takes.
In the present embodiment, existed after obtaining the keyword in captioned test according to keyword by above-mentioned TF-IDF algorithm
The frequency that contrast groups occur, generates the corresponding two word frequency vectors of two texts in contrast groups, cosine similarity θ by dot product and
Vector length provides, specifically:
In the present embodiment, when the cosine value cos (θ) of two word frequency vectors reaches preset threshold 0.67, judge two
Corresponding two texts of word frequency vector are converted by same subtitling image.
It should be noted that preset threshold can between 0.65-0.7 any one numerical value, for guarantee two texts
Similarity judging result.
For step S14, specifically, multiple texts are merged, obtain N number of correspondingly with N number of subtitling image
After merging text, the first voice data corresponding with subtitling image is handled by VAD technology, and treated first by general
Voice data carries out cutting according to N number of merging text, obtains N number of corpus for merging each first text unit in text.
In the present embodiment, the prolonged mute phase in voice signal stream is removed in the first voice data using VAD technology,
To greatly reduce data volume to be processed during speech recognition etc..
The embodiment of the present application provides a kind of corpus extraction method, by acquiring audio, video data, and will be not comprising subtitle text
The audio, video data of notebook data is counted by edge detection and grey scale difference, after obtaining caption area phonetic image, by subtitle region
Domain phonetic image is intercepted by default frame number, obtains multiple phonetic image data;By the subtitle in multiple phonetic image data
After image is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to together
One subtitling image, and after the text for belonging to same subtitling image is merged, it will the first voice number corresponding with subtitling image
Cutting is carried out according to according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is logical
It crosses to be converted to the audio-video subtitling image of no subtitle file and be matched with voice data after text file, to extract corpus
Method, overcome need to by multiple playback environ-ments carry out corpus extraction the problem of, and then reached reduce corpus extract at
This purpose.
Please refer to Fig. 2-3.
Referring to fig. 2, be the application another embodiment provide a kind of corpus extraction method flow diagram, except figure
Outside step shown in 1, further includes:
Step S15 passes through Regularization Technique using the audio, video data comprising caption text data as second processing data
Caption text data is parsed, and cutting is carried out to the second speech datas of second processing data according to time shaft, obtains multiple the
After two text-to-speech data, according to each second text-to-speech data, to the second text unit of each of caption text data one
One is labeled, and obtains the corpus of each second text unit.
In the present embodiment, when the audio, video data of acquisition include caption text data, then directly utilize Regularization Technique
Subtitle file is parsed, obtains multiple second text units, and utilize time shaft cutting second speech data, then pass through VAD technology
After handling second speech data, using treated, second speech data marks each second text unit one by one
Note obtains the corpus of each second text unit.
The embodiment of the present application provides a kind of corpus extraction method, and by acquiring audio, video data, and according to whether there are words
Audio, video data is divided into the first processing data of no subtitle file and has subtitle file second processing data by curtain file;By pre-
If the caption area phonetic image of frame number interception the first processing data, and the subtitling image of caption area phonetic image is converted to
After multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to same subtitling image;
The text for belonging to same subtitling image is merged, and by the first voice data with subtitling image to drink, after merging
Text carry out cutting, obtain the corpus of each first text unit;Second processing data are parsed into word using Regularization Technique
Curtain file parses second speech data using time shaft after obtaining multiple second text units;By second speech data to every
A second text unit is labeled, and obtains the corpus of each second text unit.Compared with prior art, present invention employs
It is matched with voice data after text file by being converted to the audio-video subtitling image of no subtitle file, to extract language
The method of material overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then has reached and reduced what corpus extracted
The purpose of cost.
In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced
The cost of extraction.
Please refer to Fig. 5.
It is the structural schematic diagram of the corpus extraction element of one embodiment offer of the application referring to Fig. 5, comprising:
Data acquisition module 101, for acquiring the audio, video data of audio-visual-materials.
In the present embodiment, data acquisition module 101 is specifically used for, and chooses the audio-video number of audio-visual-materials to be processed
According to, and by the audio, video data by whether dividing comprising caption text data.
Data cutout module 102, for handling data for the audio, video data not comprising caption text data as first,
It is counted by edge detection and grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number
Caption area phonetic image is intercepted, N number of phonetic image data are obtained.
Wherein, a phonetic image data include a subtitling image and the first voice data corresponding with subtitling image;N
For positive integer.
In the present embodiment, data cutout module 102 is specifically used for, will be for the audio, video data comprising caption text data
Gradation conversion is carried out as the first processing data, and by the frame image of the first processing data, and by Sobel Operator to progress
After frame image after gradation conversion carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out edge detection
Curtain region is positioned, and after obtaining caption area phonetic image, is intercepted according to default frame number to caption area phonetic image,
Obtain N number of phonetic image data.
Data judgment module 103, for after N number of subtitling image is converted to M text, calculating M by OCR technique
The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same subtitle figure
Picture.
Wherein, M >=N and M are positive integer.
In the present embodiment, data judgment module 103 is specifically used for that N number of subtitling image is converted to M by OCR technique
After a text, after M text is formed contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to more
A keyword is in the frequency of occurrences of contrast groups, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to
Two word frequency vectors, obtain the cosine value of contrast groups, and are up to two texts of the cosine value of preset threshold, are judged as and belong to
Same subtitling image.
First data match module 104, multiple texts for judgement to be belonged to same subtitling image merge, and obtain
With N number of subtitling image it is one-to-one it is N number of merge text after, by the first voice corresponding with subtitling image in phonetic image data
Data carry out cutting according to N number of merging text, obtain N number of the first text-to-speech for merging each first text unit in text
Data, i.e., the corpus of each first text unit.
In the present embodiment, the first data match module 104 is specifically used for, and multiple texts are merged, obtain with it is N number of
After the one-to-one N number of merging text of subtitling image, the first voice data corresponding with subtitling image is carried out by VAD technology
Processing, and treated the first voice data according to N number of mergings text is subjected to cutting, each the in the N number of merging text of acquisition
The corpus of one text unit.
The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number
According to, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtain caption area
After phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data;By multiple languages
After subtitling image in sound image data is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judgement
Whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged, will be with subtitle figure
As corresponding first voice data according to after merging text carry out cutting, obtain the corpus of each first text unit.With it is existing
There is technology to compare, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with voice data into
Row matching, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then reach
Reduce the purpose for the cost that corpus extracts.
Please refer to Fig. 6.
Referring to Fig. 6, it is the structural schematic diagram of the corpus extraction element for having one embodiment to provide of the application, removes Fig. 5 institute
Show outside structure, further includes:
Second data match module 105, for that will include the audio, video data of caption text data as second processing number
According to, caption text data is parsed by Regularization Technique, and according to time shaft to the second speech datas of second processing data into
Row cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to the every of caption text data
A second text unit is labeled one by one, obtains the corpus of each second text unit.
In the present embodiment, the second data match module 105 is specifically used for, when the audio, video data of acquisition includes subtitle text
Notebook data then directly parses subtitle file using Regularization Technique, obtains multiple second text units, and utilize time shaft cutting
Second speech data, then after being handled by VAD technology second speech data utilizes treated second speech data pair
Each second text unit is labeled one by one, obtains the corpus of each second text unit.
The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number
According to, and according to whether audio, video data is divided into the first processing data of no subtitle file and has subtitle literary there are subtitle file
Part second processing data;By the caption area phonetic image of default frame number interception the first processing data, and by caption area voice
After the subtitling image of image is converted to multiple texts, by calculating the cosine value of multiple texts between any two, two texts are judged
Whether same subtitling image is belonged to;The text for belonging to same subtitling image is merged, and by with subtitling image to drink
One voice data carries out cutting according to the text after merging, obtains the corpus of each first text unit;By second processing data
Subtitle file is parsed using Regularization Technique, after obtaining multiple second text units, the second voice number is parsed using time shaft
According to;Second speech data is labeled each second text unit, obtains the corpus of each second text unit.With it is existing
Technology is compared, present invention employs by by the audio-video subtitling image of no subtitle file be converted to after text file with voice number
According to being matched, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, in turn
Achieve the purpose that reduce the cost that corpus extracts.
In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced
The cost of extraction
The another embodiment of the application additionally provides a kind of configurable terminal device of kinetic control system, including processing
Device, memory and storage in the memory and are configured as the computer program executed by the processor, the place
Reason device realizes the corpus extraction method as described in above-described embodiment when executing the computer program.
The above is the preferred embodiment of the application, it is noted that for those skilled in the art
For, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also considered as
The protection scope of the application.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
Claims (8)
1. a kind of corpus extraction method, which is characterized in that include at least following steps:
Acquire the audio, video data of audio-visual-materials;
Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and gray scale difference
Divide statistics, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the caption area language
Sound image is intercepted, and N number of phonetic image data are obtained;Wherein, phonetic image data include a subtitling image and with
Corresponding first voice data of the subtitling image;N is positive integer;
By OCR technique, after N number of subtitling image is converted to M text, the cosine value of M text between any two is calculated,
And two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image;Wherein, M >=N and
M is positive integer;
Multiple texts that judgement belongs to the same subtitling image are merged, are obtained and the one-to-one N of N number of subtitling image
After a merging text, by first voice data corresponding with the subtitling image in the phonetic image data, according to N number of
Merge text and carry out cutting, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e., each the
The corpus of one text unit.
2. corpus extraction method according to claim 1, which is characterized in that further include:
Using the audio, video data comprising the caption text data as second processing data, parsed by Regularization Technique
The caption text data, and cutting is carried out according to second speech data of the time shaft to the second processing data, it obtains more
After a second text-to-speech data, according to each second text-to-speech data, to each of described caption text data
Two text units are labeled one by one, obtain the corpus of each second text unit.
3. corpus extraction method according to claim 1, which is characterized in that described to be united by edge detection and grey scale difference
Meter, after obtaining the subtitle region phonetic image that described first handles data, according to default frame number to the caption area phonetic image
It is intercepted, obtains N number of phonetic image data, specifically:
By it is described first processing data frame image carry out gradation conversion, and by Sobel Operator to carry out gradation conversion after
After frame image carries out edge detection, by grey scale difference statistics to the subtitle region of the frame image after carrying out the edge detection
Domain is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area phonetic image
It is intercepted.
4. corpus extraction method according to claim 1, which is characterized in that it is described by OCR technique, by N number of word
After curtain image is converted to M text, the cosine value of M text between any two is calculated, specifically:
By OCR technique, after N number of subtitling image is converted to M text, after M text is formed contrast groups two-by-two, lead to
Cross multiple keywords that TF-IDF obtains the contrast groups, and according to the multiple keyword the contrast groups appearance frequency
Rate generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency vectors,
Obtain the cosine value of the contrast groups.
5. corpus extraction method according to claim 1, which is characterized in that it is described by the phonetic image data with institute
Corresponding first voice data of subtitling image is stated, cutting is carried out according to N number of merging text, specifically:
First voice data is handled by VAD technology, and will treated first voice data according to institute
It states N number of merging text and carries out cutting.
6. a kind of corpus extraction element characterized by comprising
Data acquisition module, for acquiring the audio, video data of audio-visual-materials;
Data cutout module, for leading to using the audio, video data not comprising caption text data as the first processing data
Edge detection and grey scale difference statistics are crossed, after the caption area phonetic image for obtaining the first processing data, according to default frame
It is several that the caption area phonetic image is intercepted, obtain N number of phonetic image data;Wherein, a phonetic image data packet
Include a subtitling image and the first voice data corresponding with the subtitling image;N is positive integer;
Data judgment module, for after N number of subtitling image is converted to M text, calculating M text by OCR technique
Cosine value between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same word
Curtain image;Wherein, M >=N and M are positive integer;
First data match module, multiple texts for judgement to be belonged to the same subtitling image merge, acquisition and N
After the one-to-one N number of merging text of a subtitling image, by institute corresponding with the subtitling image in the phonetic image data
The first voice data is stated, cutting is carried out according to N number of merging text, obtains and N number of merges the of each first text unit in text
One text-to-speech data, i.e., the corpus of each first text unit.
7. corpus extraction element according to claim 6, which is characterized in that further include:
Second data match module, for that will include the audio, video data of the caption text data as second processing number
According to parsing the caption text data by Regularization Technique, and according to time shaft to the second language of the second processing data
Sound data carry out cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to described
The second text unit of each of caption text data is labeled one by one, obtains the corpus of each second text unit.
8. a kind of corpus extracts terminal device, which is characterized in that in the memory including processor, memory and storage
And it is configured as the computer program executed by the processor, the processor is realized when executing the computer program as weighed
Benefit requires any one of 1~5 corpus extraction method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910077238.7A CN109858427A (en) | 2019-01-24 | 2019-01-24 | A kind of corpus extraction method, device and terminal device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910077238.7A CN109858427A (en) | 2019-01-24 | 2019-01-24 | A kind of corpus extraction method, device and terminal device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109858427A true CN109858427A (en) | 2019-06-07 |
Family
ID=66896298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910077238.7A Pending CN109858427A (en) | 2019-01-24 | 2019-01-24 | A kind of corpus extraction method, device and terminal device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109858427A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110730389A (en) * | 2019-12-19 | 2020-01-24 | 恒信东方文化股份有限公司 | Method and device for automatically generating interactive question and answer for video program |
CN111445902A (en) * | 2020-03-27 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Data collection method and device, storage medium and electronic equipment |
CN112925905A (en) * | 2021-01-28 | 2021-06-08 | 北京达佳互联信息技术有限公司 | Method, apparatus, electronic device and storage medium for extracting video subtitles |
CN114495128A (en) * | 2022-04-06 | 2022-05-13 | 腾讯科技(深圳)有限公司 | Subtitle information detection method, device, equipment and storage medium |
WO2022228235A1 (en) * | 2021-04-29 | 2022-11-03 | 华为云计算技术有限公司 | Method and apparatus for generating video corpus, and related device |
CN116468054A (en) * | 2023-04-26 | 2023-07-21 | 中央民族大学 | Method and system for aided construction of Tibetan transliteration data set based on OCR technology |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101115151A (en) * | 2007-07-10 | 2008-01-30 | 北京大学 | Method for extracting video subtitling |
CN101453575A (en) * | 2007-12-05 | 2009-06-10 | 中国科学院计算技术研究所 | Video subtitle information extracting method |
CN102262644A (en) * | 2010-05-25 | 2011-11-30 | 索尼公司 | Search Apparatus, Search Method, And Program |
CN103607635A (en) * | 2013-10-08 | 2014-02-26 | 十分(北京)信息科技有限公司 | Method, device and terminal for caption identification |
CN103761261A (en) * | 2013-12-31 | 2014-04-30 | 北京紫冬锐意语音科技有限公司 | Voice recognition based media search method and device |
JP2017045027A (en) * | 2015-08-24 | 2017-03-02 | 日本放送協会 | Speech language corpus generation device and its program |
CN106971010A (en) * | 2017-05-12 | 2017-07-21 | 深圳市唯特视科技有限公司 | A kind of video abstraction generating method suitable for text query |
-
2019
- 2019-01-24 CN CN201910077238.7A patent/CN109858427A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101115151A (en) * | 2007-07-10 | 2008-01-30 | 北京大学 | Method for extracting video subtitling |
CN100562074C (en) * | 2007-07-10 | 2009-11-18 | 北京大学 | The method that a kind of video caption extracts |
CN101453575A (en) * | 2007-12-05 | 2009-06-10 | 中国科学院计算技术研究所 | Video subtitle information extracting method |
CN102262644A (en) * | 2010-05-25 | 2011-11-30 | 索尼公司 | Search Apparatus, Search Method, And Program |
CN103607635A (en) * | 2013-10-08 | 2014-02-26 | 十分(北京)信息科技有限公司 | Method, device and terminal for caption identification |
CN103761261A (en) * | 2013-12-31 | 2014-04-30 | 北京紫冬锐意语音科技有限公司 | Voice recognition based media search method and device |
JP2017045027A (en) * | 2015-08-24 | 2017-03-02 | 日本放送協会 | Speech language corpus generation device and its program |
CN106971010A (en) * | 2017-05-12 | 2017-07-21 | 深圳市唯特视科技有限公司 | A kind of video abstraction generating method suitable for text query |
Non-Patent Citations (9)
Title |
---|
BRECHT DESPLANQUES 等: "Adaptive speaker diarization of broadcast news based on", 《SCIENCEDIRECT》 * |
EKATERINA PRONOZA 等: "A New Corpus of the Russian Social Network", 《SPRINGER NATURE SWITZERLAND AG 2018》 * |
PATRICIA SOTELO DIOS 等: "extraction of Indonesian and English Parallel Sentences from Movie Subtitles", 《IEEE》 * |
YOONA CHOI 等: "Pansori: ASR Corpus Generation from Open Online Video Contents", 《RESEARCHGATE》 * |
刘剑: "多模态口译语料库的建设与应用研究", 《中国外语》 * |
张望舒: "电视视频中的文字识别及检索技术的研究", 《中国优秀硕士论文全文数据库(信息科技辑)》 * |
李宪武: "数字视频技术在语料库建设者红的实践研究", 《中国现代教育装备》 * |
樊重俊 等: "《大数据分析与应用》", 31 January 2016 * |
陈树越 等: "基于灰度差分的新闻视频标题字幕探测", 《计算机与数字工程》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110730389A (en) * | 2019-12-19 | 2020-01-24 | 恒信东方文化股份有限公司 | Method and device for automatically generating interactive question and answer for video program |
CN111445902A (en) * | 2020-03-27 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Data collection method and device, storage medium and electronic equipment |
CN111445902B (en) * | 2020-03-27 | 2023-05-30 | 北京字节跳动网络技术有限公司 | Data collection method, device, storage medium and electronic equipment |
CN112925905A (en) * | 2021-01-28 | 2021-06-08 | 北京达佳互联信息技术有限公司 | Method, apparatus, electronic device and storage medium for extracting video subtitles |
CN112925905B (en) * | 2021-01-28 | 2024-02-27 | 北京达佳互联信息技术有限公司 | Method, device, electronic equipment and storage medium for extracting video subtitles |
WO2022228235A1 (en) * | 2021-04-29 | 2022-11-03 | 华为云计算技术有限公司 | Method and apparatus for generating video corpus, and related device |
CN114495128A (en) * | 2022-04-06 | 2022-05-13 | 腾讯科技(深圳)有限公司 | Subtitle information detection method, device, equipment and storage medium |
CN116468054A (en) * | 2023-04-26 | 2023-07-21 | 中央民族大学 | Method and system for aided construction of Tibetan transliteration data set based on OCR technology |
CN116468054B (en) * | 2023-04-26 | 2023-11-07 | 中央民族大学 | Method and system for aided construction of Tibetan transliteration data set based on OCR technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109858427A (en) | A kind of corpus extraction method, device and terminal device | |
Harwath et al. | Deep multimodal semantic embeddings for speech and images | |
CN111968649B (en) | Subtitle correction method, subtitle display method, device, equipment and medium | |
CN109145152B (en) | Method for adaptively and intelligently generating image-text video thumbnail based on query word | |
US10304458B1 (en) | Systems and methods for transcribing videos using speaker identification | |
CN108648746A (en) | A kind of open field video natural language description generation method based on multi-modal Fusion Features | |
CN111723791A (en) | Character error correction method, device, equipment and storage medium | |
CN110866958A (en) | Method for text to image | |
CN109993040A (en) | Text recognition method and device | |
US20080095442A1 (en) | Detection and Modification of Text in a Image | |
CN106708949A (en) | Identification method of harmful content of video | |
CN114465737B (en) | Data processing method and device, computer equipment and storage medium | |
CN110796140B (en) | Subtitle detection method and device | |
WO2022089170A1 (en) | Caption area identification method and apparatus, and device and storage medium | |
WO2021129466A1 (en) | Watermark detection method, device, terminal and storage medium | |
CN113221890A (en) | OCR-based cloud mobile phone text content supervision method, system and system | |
CN106161873A (en) | A kind of video information extracts method for pushing and system | |
CN108921032A (en) | A kind of new video semanteme extracting method based on deep learning model | |
CN110072140A (en) | A kind of video information reminding method, device, equipment and storage medium | |
CN112989098B (en) | Automatic retrieval method and device for image infringement entity and electronic equipment | |
KR20210047467A (en) | Method and System for Auto Multiple Image Captioning | |
CN114548274A (en) | Multi-modal interaction-based rumor detection method and system | |
CN116708055B (en) | Intelligent multimedia audiovisual image processing method, system and storage medium | |
CN113191787A (en) | Telecommunication data processing method, device electronic equipment and storage medium | |
CN106162328A (en) | A kind of video synchronizing information methods of exhibiting and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190607 |
|
RJ01 | Rejection of invention patent application after publication |