CN109858427A - A kind of corpus extraction method, device and terminal device - Google Patents

A kind of corpus extraction method, device and terminal device Download PDF

Info

Publication number
CN109858427A
CN109858427A CN201910077238.7A CN201910077238A CN109858427A CN 109858427 A CN109858427 A CN 109858427A CN 201910077238 A CN201910077238 A CN 201910077238A CN 109858427 A CN109858427 A CN 109858427A
Authority
CN
China
Prior art keywords
data
text
image
corpus
subtitling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910077238.7A
Other languages
Chinese (zh)
Inventor
周发升
何伟宝
詹逸
陈渤
杨敬慈
皮樾
李锦韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201910077238.7A priority Critical patent/CN109858427A/en
Publication of CN109858427A publication Critical patent/CN109858427A/en
Pending legal-status Critical Current

Links

Landscapes

  • Television Signal Processing For Recording (AREA)

Abstract

This application discloses a kind of corpus extraction method, device and terminal devices, the described method includes: passing through acquisition audio, video data, and after obtaining the not caption area phonetic image of the audio, video data comprising caption text data, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data;Subtitling image in multiple phonetic image data is converted into multiple texts, calculates the cosine value of multiple texts between any two, and the text that cosine value reaches threshold value is merged;The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is matched after text file by being converted to the audio-video subtitling image of no subtitle file with voice data, method to extract corpus, the problem of need to carrying out corpus extraction by multiple playback environ-ments is overcome, and then has achieved the purpose that reduce the cost that corpus extracts.

Description

A kind of corpus extraction method, device and terminal device
Technical field
This application involves audio-video speech information retrieval technical field more particularly to a kind of corpus extraction method, device and Terminal device.
Background technique
In automatic speech recognition system, the performance and robustness of system depend greatly on identification model modeling Whether there is corpus data abundant enough, i.e. the corpus data resources bank key foundation ring that is intelligent sound technology in the process Section.And in corpus data resources bank corpus scale and quality, largely determine the wide of various intelligent sounds application Degree and depth, while also strong influence the experience of user.
In the prior art, corpus is extracted by way of recording, to establish corpus data resources bank.But using existing When technology carries out corpus extraction, discovery is that trained library and survey are provided for speech recognition system due to the purpose for establishing and collecting corpus Library is tried, therefore the selection of speaker need to cover national different regions, age, gender and schooling, and need to be from multiple recording rings Border carries out corpus extraction, it is ensured that and the matching degree of subsequent speech recognition is too high so as to cause the extraction cost of corpus.
Summary of the invention
The embodiment of the present application technical problem to be solved is how to reduce the cost of corpus extraction.
To solve the above problems, the embodiment of the present application provides a kind of corpus extraction method, suitable for being executed in calculating equipment, Including at least following steps:
Acquire the audio, video data of audio-visual-materials;
Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and ash Difference statistics are spent, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the subtitle region Domain phonetic image is intercepted, and N number of phonetic image data are obtained;Wherein, a phonetic image data include a subtitling image The first voice data corresponding with the subtitling image;N is positive integer;
By OCR technique, after N number of subtitling image is converted to M text, M text between any two remaining is calculated String value, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image;Wherein, M >=N and M are positive integer;
Multiple texts that judgement belongs to the same subtitling image are merged, are obtained a pair of with N number of subtitling image one After the N number of merging text answered, by first voice data corresponding with the subtitling image in the phonetic image data, root Cutting is carried out according to N number of merging text, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e., The corpus of each first text unit.
Further, further includes:
Using the audio, video data comprising the caption text data as second processing data, pass through Regularization Technique The caption text data is parsed, and cutting is carried out according to second speech data of the time shaft to the second processing data, is obtained After taking multiple second text-to-speech data, according to each second text-to-speech data, to the every of the caption text data A second text unit is labeled one by one, obtains the corpus of each second text unit.
Further, described to be counted by edge detection and grey scale difference, obtain the subtitle region of the first processing data After phonetic image, the caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image data, had Body are as follows:
The frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to progress gradation conversion After frame image afterwards carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out the edge detection Curtain region is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area voice Image is intercepted.
Further, described by OCR technique, after N number of subtitling image is converted to M text, calculate M text Cosine value between any two, specifically:
By OCR technique, after N number of subtitling image is converted to M text, M text is formed into contrast groups two-by-two Afterwards, multiple keywords of the contrast groups, and the going out in the contrast groups according to the multiple keyword are obtained by TF-IDF Existing frequency generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency Vector obtains the cosine value of the contrast groups.
Further, described by the first voice number corresponding with the subtitling image in the phonetic image data According to, cutting is carried out according to N number of merging text, specifically:
First voice data is handled by VAD technology, and will treated the first voice data root Cutting is carried out according to N number of merging text.
Further, a kind of corpus extraction element is also provided, comprising:
Data acquisition module, for acquiring the audio, video data of audio-visual-materials;
Data cutout module, for will not include the audio, video data of caption text data as the first processing number According to by edge detection and grey scale difference statistics, after the caption area phonetic image for obtaining the first processing data, according to pre- If frame number intercepts the caption area phonetic image, N number of phonetic image data are obtained;Wherein, a phonetic image number According to including a subtitling image and the first voice data corresponding with the subtitling image;N is positive integer;
Data judgment module, for after N number of subtitling image is converted to M text, calculating M by OCR technique The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same institute State subtitling image;Wherein, M >=N and M are positive integer;
First data match module, multiple texts for judgement to be belonged to the same subtitling image are merged, are obtained With N number of subtitling image it is one-to-one it is N number of merge text after, will be corresponding with the subtitling image in the phonetic image data First voice data, cutting is carried out according to N number of merging text, obtains each first text unit in N number of merging text The first text-to-speech data, i.e., the corpus of each first text unit.
Further, further includes:
Second data match module, for that will include the audio, video data of the caption text data as at second Data are managed, the caption text data is parsed by Regularization Technique, and according to time shaft to the of the second processing data Two voice data carry out cutting, right according to each second text-to-speech data after obtaining multiple second text-to-speech data The second text unit of each of described caption text data is labeled one by one, obtains the corpus of each second text unit.
Further, a kind of corpus is also provided and extracts terminal device, including processor, memory and is stored in described deposit In reservoir and it is configured as the computer program executed by the processor, the processor executes real when the computer program Now such as any one of above-described embodiment corpus extraction method.
Implement the embodiment of the present application, has the following beneficial effects:
A kind of corpus extraction method, device and terminal device provided by the embodiments of the present application, which comprises by adopting Collect audio, video data, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtains After taking caption area phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image numbers According to;After subtitling image in multiple phonetic image data is converted into multiple texts, between any two by the multiple texts of calculating Cosine value, judges whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged, The first voice data corresponding with subtitling image is subjected to cutting according to the text after merging, obtains each first text unit Corpus.Compared with prior art, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with Voice data is matched, thus the method for extracting corpus, asking for corpus extraction need to be carried out by multiple playback environ-ments by overcoming Topic, and then achieved the purpose that reduce the cost that corpus extracts.
Detailed description of the invention
Fig. 1 is the flow diagram for the corpus extraction method that one embodiment of the application provides;
Fig. 2 is the flow diagram for the corpus extraction method that another embodiment of the application provides;
Fig. 3 is the flow diagram for the corpus extraction method that the further embodiment of the application provides;
Fig. 4 is the TF-IDF flow chart that one embodiment of the application provides;
Fig. 5 is the structural schematic diagram for the corpus extraction element that one embodiment of the application provides;
Fig. 6 is the structural schematic diagram for the corpus extraction element that another embodiment of the application provides;
Fig. 7 is the edge detection effect picture that one embodiment of the application provides;
Fig. 8 is that the caption area image that one embodiment of the application provides obtains effect picture.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
Referring to Figure 1.
It is the flow diagram of the corpus extraction method of one embodiment offer of the application referring to Fig. 1, as shown in Figure 1, The task processing method includes step S11 to step S14.Each step is specific as follows:
Step S11 acquires the audio, video data of audio-visual-materials.
Step S12 passes through edge detection using the audio, video data not comprising caption text data as the first processing data It is counted with grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number to caption area language Sound image is intercepted, and N number of phonetic image data are obtained.
Wherein, a phonetic image data include a subtitling image and the first voice data corresponding with subtitling image;N For positive integer.
Step S13 after N number of subtitling image is converted to M text, calculates M text between any two by OCR technique Cosine value, and be up to two texts of the cosine value of preset threshold, be judged as and belong to same subtitling image.
Wherein, M >=N and M are positive integer.
Step S14 merges multiple texts that judgement belongs to same subtitling image, obtains and N number of subtitling image one After one corresponding N number of merging text, by the first voice data corresponding with subtitling image in phonetic image data, according to N number of conjunction And text carries out cutting, the first text-to-speech data of each first text unit in the N number of merging text of acquisition, i.e., each first The corpus of text unit.
For step S11, specifically, choose the audio, video data of audio-visual-materials to be processed, and by the audio-video number According to by whether being divided comprising caption text data.
For step S12, specifically, data are handled as first using for the audio, video data comprising caption text data, And the frame image of the first processing data is subjected to gradation conversion, and by Sobel Operator to the frame image after carrying out gradation conversion After carrying out edge detection, is positioned, obtained by caption area of the grey scale difference statistics to the frame image after carrying out edge detection To after caption area phonetic image, caption area phonetic image is intercepted according to default frame number, obtains N number of phonetic image number According to.
Since the edge feature of caption area is more obvious, the position that subtitle occurs is relatively fixed, and same section of subtitle is logical Often can be in same position stay longer, in addition to this, subtitle color often has biggish difference with ambient background color.Cause The frame image of first processing data is loaded into RGB color image space and carries out gray proces, conversion by this in the present embodiment At gray level image, specific conversion formula are as follows:
Y (x, y)=0.229 × R (x, y)+0.587 × G (x, y)+0.114 × B (x, y)
Wherein, Y (x, y) is the gray value of pixel (z, y), and R (x, y), G (x, y) and B (x, y) are the position (z, y) pixel The red, green, blue component of RGB color.
In the present embodiment, the gray level image after conversion is subjected to edge detection by Sobel Operator, specifically:
Assuming that gray level image is I, the kernel Gx of I and odd sized is subjected to convolution in the horizontal direction.For example, working as When kernel size is 3, the calculated result of Gx are as follows:
After completing the convolutional calculation in horizontal direction, the kernel G_y of I and odd sized is carried out in vertical direction Convolution.For example, when kernel size is 3, the calculated result of G_y are as follows:
According to the convolution of I in the horizontal direction and the vertical direction, the approximate gradient of every bit on I is obtained:
The specific processing result of the present embodiment can be as shown in Figure 7.
The caption area of frame image will be positioned by grey scale difference statistics by the image after edge detection, is obtained Caption area phonetic image, specifically:
Wherein, E (x) indicates that the pixel grey scale absolute value of the difference of two neighboring frame image in audio-visual-materials adds up, f (x, y), f (x, y+1) are respectively the gray value of respective pixel point.
The specific processing result of the present embodiment can be as shown in Figure 8.
In the present embodiment, after obtaining caption area phonetic image, the caption area phonetic image is cut by 7 frames It takes, obtains multiple phonetic image data.Each phonetic image data include a subtitling image and corresponding with subtitling image the One voice data.
For step S13, specifically, by OCR technique, after N number of subtitling image is converted to M text, by M text After forming contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to multiple keywords in contrast groups The frequency of occurrences, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to two word frequency vectors, acquisition pair Than the cosine value of group, and two texts of the cosine value of preset threshold are up to, are judged as and belong to same subtitling image.
In the present embodiment, as shown in figure 4, after generating a variety of texts by OCR technique, subtitle text is obtained by TF-IDF Keyword in this, specifically:
TF-IDF=TFi,j×IDFi
Wherein, TFi,jIt indicates to obtain a certain text unit t in textiImportance, be represented by ni,jIt is the text unit in corresponding text djThe number of middle appearance, denominator are then indicated in corresponding text djIn all text units go out The sum of existing number.
Wherein, | D | it is the total number of files in corpus, | j:ti∈dj| it indicates to include word tiNumber of files (i.e. ni,j ≠ 0 number of files).If the word, not in corpus, will lead to denominator is zero, therefore uses 1+ under normal circumstances | j:ti∈dj|。
It should be noted that in the present embodiment, obtaining for keyword can be carried out using Simase LSTM replacement TF-IDF It takes.
In the present embodiment, existed after obtaining the keyword in captioned test according to keyword by above-mentioned TF-IDF algorithm The frequency that contrast groups occur, generates the corresponding two word frequency vectors of two texts in contrast groups, cosine similarity θ by dot product and Vector length provides, specifically:
In the present embodiment, when the cosine value cos (θ) of two word frequency vectors reaches preset threshold 0.67, judge two Corresponding two texts of word frequency vector are converted by same subtitling image.
It should be noted that preset threshold can between 0.65-0.7 any one numerical value, for guarantee two texts Similarity judging result.
For step S14, specifically, multiple texts are merged, obtain N number of correspondingly with N number of subtitling image After merging text, the first voice data corresponding with subtitling image is handled by VAD technology, and treated first by general Voice data carries out cutting according to N number of merging text, obtains N number of corpus for merging each first text unit in text.
In the present embodiment, the prolonged mute phase in voice signal stream is removed in the first voice data using VAD technology, To greatly reduce data volume to be processed during speech recognition etc..
The embodiment of the present application provides a kind of corpus extraction method, by acquiring audio, video data, and will be not comprising subtitle text The audio, video data of notebook data is counted by edge detection and grey scale difference, after obtaining caption area phonetic image, by subtitle region Domain phonetic image is intercepted by default frame number, obtains multiple phonetic image data;By the subtitle in multiple phonetic image data After image is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to together One subtitling image, and after the text for belonging to same subtitling image is merged, it will the first voice number corresponding with subtitling image Cutting is carried out according to according to the text after merging, obtains the corpus of each first text unit.Compared with prior art, the application is logical It crosses to be converted to the audio-video subtitling image of no subtitle file and be matched with voice data after text file, to extract corpus Method, overcome need to by multiple playback environ-ments carry out corpus extraction the problem of, and then reached reduce corpus extract at This purpose.
Please refer to Fig. 2-3.
Referring to fig. 2, be the application another embodiment provide a kind of corpus extraction method flow diagram, except figure Outside step shown in 1, further includes:
Step S15 passes through Regularization Technique using the audio, video data comprising caption text data as second processing data Caption text data is parsed, and cutting is carried out to the second speech datas of second processing data according to time shaft, obtains multiple the After two text-to-speech data, according to each second text-to-speech data, to the second text unit of each of caption text data one One is labeled, and obtains the corpus of each second text unit.
In the present embodiment, when the audio, video data of acquisition include caption text data, then directly utilize Regularization Technique Subtitle file is parsed, obtains multiple second text units, and utilize time shaft cutting second speech data, then pass through VAD technology After handling second speech data, using treated, second speech data marks each second text unit one by one Note obtains the corpus of each second text unit.
The embodiment of the present application provides a kind of corpus extraction method, and by acquiring audio, video data, and according to whether there are words Audio, video data is divided into the first processing data of no subtitle file and has subtitle file second processing data by curtain file;By pre- If the caption area phonetic image of frame number interception the first processing data, and the subtitling image of caption area phonetic image is converted to After multiple texts, by calculating the cosine value of multiple texts between any two, judge whether two texts belong to same subtitling image; The text for belonging to same subtitling image is merged, and by the first voice data with subtitling image to drink, after merging Text carry out cutting, obtain the corpus of each first text unit;Second processing data are parsed into word using Regularization Technique Curtain file parses second speech data using time shaft after obtaining multiple second text units;By second speech data to every A second text unit is labeled, and obtains the corpus of each second text unit.Compared with prior art, present invention employs It is matched with voice data after text file by being converted to the audio-video subtitling image of no subtitle file, to extract language The method of material overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then has reached and reduced what corpus extracted The purpose of cost.
In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced The cost of extraction.
Please refer to Fig. 5.
It is the structural schematic diagram of the corpus extraction element of one embodiment offer of the application referring to Fig. 5, comprising:
Data acquisition module 101, for acquiring the audio, video data of audio-visual-materials.
In the present embodiment, data acquisition module 101 is specifically used for, and chooses the audio-video number of audio-visual-materials to be processed According to, and by the audio, video data by whether dividing comprising caption text data.
Data cutout module 102, for handling data for the audio, video data not comprising caption text data as first, It is counted by edge detection and grey scale difference, after the caption area phonetic image for obtaining the first processing data, according to default frame number Caption area phonetic image is intercepted, N number of phonetic image data are obtained.
Wherein, a phonetic image data include a subtitling image and the first voice data corresponding with subtitling image;N For positive integer.
In the present embodiment, data cutout module 102 is specifically used for, will be for the audio, video data comprising caption text data Gradation conversion is carried out as the first processing data, and by the frame image of the first processing data, and by Sobel Operator to progress After frame image after gradation conversion carries out edge detection, by grey scale difference statistics to the word of the frame image after carrying out edge detection Curtain region is positioned, and after obtaining caption area phonetic image, is intercepted according to default frame number to caption area phonetic image, Obtain N number of phonetic image data.
Data judgment module 103, for after N number of subtitling image is converted to M text, calculating M by OCR technique The cosine value of text between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to same subtitle figure Picture.
Wherein, M >=N and M are positive integer.
In the present embodiment, data judgment module 103 is specifically used for that N number of subtitling image is converted to M by OCR technique After a text, after M text is formed contrast groups two-by-two, multiple keywords of contrast groups are obtained by TF-IDF, and according to more A keyword is in the frequency of occurrences of contrast groups, after generating the corresponding two word frequency vectors of two texts for forming contrast groups, according to Two word frequency vectors, obtain the cosine value of contrast groups, and are up to two texts of the cosine value of preset threshold, are judged as and belong to Same subtitling image.
First data match module 104, multiple texts for judgement to be belonged to same subtitling image merge, and obtain With N number of subtitling image it is one-to-one it is N number of merge text after, by the first voice corresponding with subtitling image in phonetic image data Data carry out cutting according to N number of merging text, obtain N number of the first text-to-speech for merging each first text unit in text Data, i.e., the corpus of each first text unit.
In the present embodiment, the first data match module 104 is specifically used for, and multiple texts are merged, obtain with it is N number of After the one-to-one N number of merging text of subtitling image, the first voice data corresponding with subtitling image is carried out by VAD technology Processing, and treated the first voice data according to N number of mergings text is subjected to cutting, each the in the N number of merging text of acquisition The corpus of one text unit.
The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number According to, and the audio, video data not comprising caption text data is counted by edge detection and grey scale difference, obtain caption area After phonetic image, caption area phonetic image is intercepted by default frame number, obtains multiple phonetic image data;By multiple languages After subtitling image in sound image data is converted into multiple texts, by calculating the cosine value of multiple texts between any two, judgement Whether two texts belong to same subtitling image, and after the text for belonging to same subtitling image is merged, will be with subtitle figure As corresponding first voice data according to after merging text carry out cutting, obtain the corpus of each first text unit.With it is existing There is technology to compare, the application by by the audio-video subtitling image of no subtitle file be converted to after text file with voice data into Row matching, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, and then reach Reduce the purpose for the cost that corpus extracts.
Please refer to Fig. 6.
Referring to Fig. 6, it is the structural schematic diagram of the corpus extraction element for having one embodiment to provide of the application, removes Fig. 5 institute Show outside structure, further includes:
Second data match module 105, for that will include the audio, video data of caption text data as second processing number According to, caption text data is parsed by Regularization Technique, and according to time shaft to the second speech datas of second processing data into Row cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to the every of caption text data A second text unit is labeled one by one, obtains the corpus of each second text unit.
In the present embodiment, the second data match module 105 is specifically used for, when the audio, video data of acquisition includes subtitle text Notebook data then directly parses subtitle file using Regularization Technique, obtains multiple second text units, and utilize time shaft cutting Second speech data, then after being handled by VAD technology second speech data utilizes treated second speech data pair Each second text unit is labeled one by one, obtains the corpus of each second text unit.
The embodiment of the present application provides a kind of corpus extraction method and device, which comprises passes through acquisition audio-video number According to, and according to whether audio, video data is divided into the first processing data of no subtitle file and has subtitle literary there are subtitle file Part second processing data;By the caption area phonetic image of default frame number interception the first processing data, and by caption area voice After the subtitling image of image is converted to multiple texts, by calculating the cosine value of multiple texts between any two, two texts are judged Whether same subtitling image is belonged to;The text for belonging to same subtitling image is merged, and by with subtitling image to drink One voice data carries out cutting according to the text after merging, obtains the corpus of each first text unit;By second processing data Subtitle file is parsed using Regularization Technique, after obtaining multiple second text units, the second voice number is parsed using time shaft According to;Second speech data is labeled each second text unit, obtains the corpus of each second text unit.With it is existing Technology is compared, present invention employs by by the audio-video subtitling image of no subtitle file be converted to after text file with voice number According to being matched, so that the method for extracting corpus, overcomes the problem of need to carrying out corpus extraction by multiple playback environ-ments, in turn Achieve the purpose that reduce the cost that corpus extracts.
In addition to this it is possible to conveniently and efficiently obtain the corpus of text by captioned test, corpus is further reduced The cost of extraction
The another embodiment of the application additionally provides a kind of configurable terminal device of kinetic control system, including processing Device, memory and storage in the memory and are configured as the computer program executed by the processor, the place Reason device realizes the corpus extraction method as described in above-described embodiment when executing the computer program.
The above is the preferred embodiment of the application, it is noted that for those skilled in the art For, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also considered as The protection scope of the application.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Claims (8)

1. a kind of corpus extraction method, which is characterized in that include at least following steps:
Acquire the audio, video data of audio-visual-materials;
Using the audio, video data not comprising caption text data as the first processing data, pass through edge detection and gray scale difference Divide statistics, after the caption area phonetic image for obtaining the first processing data, according to default frame number to the caption area language Sound image is intercepted, and N number of phonetic image data are obtained;Wherein, phonetic image data include a subtitling image and with Corresponding first voice data of the subtitling image;N is positive integer;
By OCR technique, after N number of subtitling image is converted to M text, the cosine value of M text between any two is calculated, And two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same subtitling image;Wherein, M >=N and M is positive integer;
Multiple texts that judgement belongs to the same subtitling image are merged, are obtained and the one-to-one N of N number of subtitling image After a merging text, by first voice data corresponding with the subtitling image in the phonetic image data, according to N number of Merge text and carry out cutting, obtains N number of the first text-to-speech data for merging each first text unit in text, i.e., each the The corpus of one text unit.
2. corpus extraction method according to claim 1, which is characterized in that further include:
Using the audio, video data comprising the caption text data as second processing data, parsed by Regularization Technique The caption text data, and cutting is carried out according to second speech data of the time shaft to the second processing data, it obtains more After a second text-to-speech data, according to each second text-to-speech data, to each of described caption text data Two text units are labeled one by one, obtain the corpus of each second text unit.
3. corpus extraction method according to claim 1, which is characterized in that described to be united by edge detection and grey scale difference Meter, after obtaining the subtitle region phonetic image that described first handles data, according to default frame number to the caption area phonetic image It is intercepted, obtains N number of phonetic image data, specifically:
By it is described first processing data frame image carry out gradation conversion, and by Sobel Operator to carry out gradation conversion after After frame image carries out edge detection, by grey scale difference statistics to the subtitle region of the frame image after carrying out the edge detection Domain is positioned, after obtaining the caption area phonetic image, according to the default frame number to the caption area phonetic image It is intercepted.
4. corpus extraction method according to claim 1, which is characterized in that it is described by OCR technique, by N number of word After curtain image is converted to M text, the cosine value of M text between any two is calculated, specifically:
By OCR technique, after N number of subtitling image is converted to M text, after M text is formed contrast groups two-by-two, lead to Cross multiple keywords that TF-IDF obtains the contrast groups, and according to the multiple keyword the contrast groups appearance frequency Rate generates after forming the corresponding two word frequency vectors of two of the contrast groups texts, according to described two word frequency vectors, Obtain the cosine value of the contrast groups.
5. corpus extraction method according to claim 1, which is characterized in that it is described by the phonetic image data with institute Corresponding first voice data of subtitling image is stated, cutting is carried out according to N number of merging text, specifically:
First voice data is handled by VAD technology, and will treated first voice data according to institute It states N number of merging text and carries out cutting.
6. a kind of corpus extraction element characterized by comprising
Data acquisition module, for acquiring the audio, video data of audio-visual-materials;
Data cutout module, for leading to using the audio, video data not comprising caption text data as the first processing data Edge detection and grey scale difference statistics are crossed, after the caption area phonetic image for obtaining the first processing data, according to default frame It is several that the caption area phonetic image is intercepted, obtain N number of phonetic image data;Wherein, a phonetic image data packet Include a subtitling image and the first voice data corresponding with the subtitling image;N is positive integer;
Data judgment module, for after N number of subtitling image is converted to M text, calculating M text by OCR technique Cosine value between any two, and two texts of the cosine value of preset threshold are up to, it is judged as and belongs to the same word Curtain image;Wherein, M >=N and M are positive integer;
First data match module, multiple texts for judgement to be belonged to the same subtitling image merge, acquisition and N After the one-to-one N number of merging text of a subtitling image, by institute corresponding with the subtitling image in the phonetic image data The first voice data is stated, cutting is carried out according to N number of merging text, obtains and N number of merges the of each first text unit in text One text-to-speech data, i.e., the corpus of each first text unit.
7. corpus extraction element according to claim 6, which is characterized in that further include:
Second data match module, for that will include the audio, video data of the caption text data as second processing number According to parsing the caption text data by Regularization Technique, and according to time shaft to the second language of the second processing data Sound data carry out cutting, after obtaining multiple second text-to-speech data, according to each second text-to-speech data, to described The second text unit of each of caption text data is labeled one by one, obtains the corpus of each second text unit.
8. a kind of corpus extracts terminal device, which is characterized in that in the memory including processor, memory and storage And it is configured as the computer program executed by the processor, the processor is realized when executing the computer program as weighed Benefit requires any one of 1~5 corpus extraction method.
CN201910077238.7A 2019-01-24 2019-01-24 A kind of corpus extraction method, device and terminal device Pending CN109858427A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910077238.7A CN109858427A (en) 2019-01-24 2019-01-24 A kind of corpus extraction method, device and terminal device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910077238.7A CN109858427A (en) 2019-01-24 2019-01-24 A kind of corpus extraction method, device and terminal device

Publications (1)

Publication Number Publication Date
CN109858427A true CN109858427A (en) 2019-06-07

Family

ID=66896298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910077238.7A Pending CN109858427A (en) 2019-01-24 2019-01-24 A kind of corpus extraction method, device and terminal device

Country Status (1)

Country Link
CN (1) CN109858427A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110730389A (en) * 2019-12-19 2020-01-24 恒信东方文化股份有限公司 Method and device for automatically generating interactive question and answer for video program
CN111445902A (en) * 2020-03-27 2020-07-24 北京字节跳动网络技术有限公司 Data collection method and device, storage medium and electronic equipment
CN112925905A (en) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 Method, apparatus, electronic device and storage medium for extracting video subtitles
CN114495128A (en) * 2022-04-06 2022-05-13 腾讯科技(深圳)有限公司 Subtitle information detection method, device, equipment and storage medium
WO2022228235A1 (en) * 2021-04-29 2022-11-03 华为云计算技术有限公司 Method and apparatus for generating video corpus, and related device
CN116468054A (en) * 2023-04-26 2023-07-21 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101115151A (en) * 2007-07-10 2008-01-30 北京大学 Method for extracting video subtitling
CN101453575A (en) * 2007-12-05 2009-06-10 中国科学院计算技术研究所 Video subtitle information extracting method
CN102262644A (en) * 2010-05-25 2011-11-30 索尼公司 Search Apparatus, Search Method, And Program
CN103607635A (en) * 2013-10-08 2014-02-26 十分(北京)信息科技有限公司 Method, device and terminal for caption identification
CN103761261A (en) * 2013-12-31 2014-04-30 北京紫冬锐意语音科技有限公司 Voice recognition based media search method and device
JP2017045027A (en) * 2015-08-24 2017-03-02 日本放送協会 Speech language corpus generation device and its program
CN106971010A (en) * 2017-05-12 2017-07-21 深圳市唯特视科技有限公司 A kind of video abstraction generating method suitable for text query

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101115151A (en) * 2007-07-10 2008-01-30 北京大学 Method for extracting video subtitling
CN100562074C (en) * 2007-07-10 2009-11-18 北京大学 The method that a kind of video caption extracts
CN101453575A (en) * 2007-12-05 2009-06-10 中国科学院计算技术研究所 Video subtitle information extracting method
CN102262644A (en) * 2010-05-25 2011-11-30 索尼公司 Search Apparatus, Search Method, And Program
CN103607635A (en) * 2013-10-08 2014-02-26 十分(北京)信息科技有限公司 Method, device and terminal for caption identification
CN103761261A (en) * 2013-12-31 2014-04-30 北京紫冬锐意语音科技有限公司 Voice recognition based media search method and device
JP2017045027A (en) * 2015-08-24 2017-03-02 日本放送協会 Speech language corpus generation device and its program
CN106971010A (en) * 2017-05-12 2017-07-21 深圳市唯特视科技有限公司 A kind of video abstraction generating method suitable for text query

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
BRECHT DESPLANQUES 等: "Adaptive speaker diarization of broadcast news based on", 《SCIENCEDIRECT》 *
EKATERINA PRONOZA 等: "A New Corpus of the Russian Social Network", 《SPRINGER NATURE SWITZERLAND AG 2018》 *
PATRICIA SOTELO DIOS 等: "extraction of Indonesian and English Parallel Sentences from Movie Subtitles", 《IEEE》 *
YOONA CHOI 等: "Pansori: ASR Corpus Generation from Open Online Video Contents", 《RESEARCHGATE》 *
刘剑: "多模态口译语料库的建设与应用研究", 《中国外语》 *
张望舒: "电视视频中的文字识别及检索技术的研究", 《中国优秀硕士论文全文数据库(信息科技辑)》 *
李宪武: "数字视频技术在语料库建设者红的实践研究", 《中国现代教育装备》 *
樊重俊 等: "《大数据分析与应用》", 31 January 2016 *
陈树越 等: "基于灰度差分的新闻视频标题字幕探测", 《计算机与数字工程》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110730389A (en) * 2019-12-19 2020-01-24 恒信东方文化股份有限公司 Method and device for automatically generating interactive question and answer for video program
CN111445902A (en) * 2020-03-27 2020-07-24 北京字节跳动网络技术有限公司 Data collection method and device, storage medium and electronic equipment
CN111445902B (en) * 2020-03-27 2023-05-30 北京字节跳动网络技术有限公司 Data collection method, device, storage medium and electronic equipment
CN112925905A (en) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 Method, apparatus, electronic device and storage medium for extracting video subtitles
CN112925905B (en) * 2021-01-28 2024-02-27 北京达佳互联信息技术有限公司 Method, device, electronic equipment and storage medium for extracting video subtitles
WO2022228235A1 (en) * 2021-04-29 2022-11-03 华为云计算技术有限公司 Method and apparatus for generating video corpus, and related device
CN114495128A (en) * 2022-04-06 2022-05-13 腾讯科技(深圳)有限公司 Subtitle information detection method, device, equipment and storage medium
CN116468054A (en) * 2023-04-26 2023-07-21 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology
CN116468054B (en) * 2023-04-26 2023-11-07 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology

Similar Documents

Publication Publication Date Title
CN109858427A (en) A kind of corpus extraction method, device and terminal device
Harwath et al. Deep multimodal semantic embeddings for speech and images
CN111968649B (en) Subtitle correction method, subtitle display method, device, equipment and medium
CN109145152B (en) Method for adaptively and intelligently generating image-text video thumbnail based on query word
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
CN108648746A (en) A kind of open field video natural language description generation method based on multi-modal Fusion Features
CN111723791A (en) Character error correction method, device, equipment and storage medium
CN110866958A (en) Method for text to image
CN109993040A (en) Text recognition method and device
US20080095442A1 (en) Detection and Modification of Text in a Image
CN106708949A (en) Identification method of harmful content of video
CN114465737B (en) Data processing method and device, computer equipment and storage medium
CN110796140B (en) Subtitle detection method and device
WO2022089170A1 (en) Caption area identification method and apparatus, and device and storage medium
WO2021129466A1 (en) Watermark detection method, device, terminal and storage medium
CN113221890A (en) OCR-based cloud mobile phone text content supervision method, system and system
CN106161873A (en) A kind of video information extracts method for pushing and system
CN108921032A (en) A kind of new video semanteme extracting method based on deep learning model
CN110072140A (en) A kind of video information reminding method, device, equipment and storage medium
CN112989098B (en) Automatic retrieval method and device for image infringement entity and electronic equipment
KR20210047467A (en) Method and System for Auto Multiple Image Captioning
CN114548274A (en) Multi-modal interaction-based rumor detection method and system
CN116708055B (en) Intelligent multimedia audiovisual image processing method, system and storage medium
CN113191787A (en) Telecommunication data processing method, device electronic equipment and storage medium
CN106162328A (en) A kind of video synchronizing information methods of exhibiting and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190607

RJ01 Rejection of invention patent application after publication