CN114996506A - Corpus generation method and device, electronic equipment and computer-readable storage medium - Google Patents

Corpus generation method and device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN114996506A
CN114996506A CN202210572357.1A CN202210572357A CN114996506A CN 114996506 A CN114996506 A CN 114996506A CN 202210572357 A CN202210572357 A CN 202210572357A CN 114996506 A CN114996506 A CN 114996506A
Authority
CN
China
Prior art keywords
video
text
content
audio
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210572357.1A
Other languages
Chinese (zh)
Inventor
王书培
刘攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210572357.1A priority Critical patent/CN114996506A/en
Publication of CN114996506A publication Critical patent/CN114996506A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the invention discloses a corpus generating method, a corpus generating device, electronic equipment and a computer-readable storage medium; the method comprises the steps of obtaining at least one candidate video, performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating similarity between the subtitle content and the text content to obtain text similarity of the candidate video, screening out at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video; the scheme can greatly improve the accuracy of the corpus generation in the voice recognition.

Description

Corpus generation method and device, electronic equipment and computer-readable storage medium
Technical Field
The invention relates to the technical field of communication, in particular to a corpus generating method and device and a computer readable storage medium.
Background
In recent years, with the rapid development of internet technology, corpora have become more and more important in the field of language identification, and the accuracy of the corpora can often determine the accuracy of language identification. Therefore, it is necessary to generate accurate corpora. The existing corpus generating method is marked in an auxiliary manual mode after speech recognition.
In the process of research and practice of the prior art, the inventor of the present invention finds that a large amount of human resources are often needed in a manual mode, errors are easy to generate, and in addition, for some special languages spread in a small range, the accuracy of speech recognition is often low, so that the accuracy of corpus generation is low.
Disclosure of Invention
The embodiment of the invention provides a corpus generating method and device, electronic equipment and a computer-readable storage medium, which can improve the accuracy of corpus generation.
A corpus generation method includes:
acquiring at least one candidate video, and performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video;
extracting audio content from the candidate video, and converting the audio content into text content;
calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video;
screening at least one target video of a target language from the candidate videos according to the text similarity;
and generating the corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
Correspondingly, an embodiment of the present invention provides a corpus generating device, including:
the acquisition unit is used for acquiring at least one candidate video and performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video;
the conversion unit is used for extracting audio content from the candidate videos and converting the audio content into text content;
the calculation unit is used for calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video;
the screening unit is used for screening at least one target video of a target language from the candidate videos according to the text similarity;
and the generating unit is used for generating the corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
Optionally, in some embodiments, the computing unit may be specifically configured to identify a subtitle character string in the subtitle content and identify a text character string in the text content; calculating the conversion operation times between the caption character strings and the text character strings to obtain the class editing distance between the caption character strings and the text character strings; and determining the text similarity of the candidate videos based on the subtitle character strings, the text character strings and the class editing distance.
Optionally, in some embodiments, the calculation unit may be specifically configured to fuse the subtitle character string and the text character string to obtain a character string distance; calculating a distance difference between the class editing distance and the character string distance; and calculating the ratio of the distance difference to the character string distance to obtain the text similarity of the candidate video.
Optionally, in some embodiments, the obtaining unit may be specifically configured to frame the candidate video, and screen out a key video frame from the framed video frames; positioning a target position area in the key video frame to obtain a subtitle area of the candidate video; and identifying a text corresponding to the subtitle area in the video frame to obtain the subtitle content of the candidate video.
Optionally, in some embodiments, the obtaining unit may be specifically configured to perform text recognition on the framed video frame to obtain a video frame text of the video frame; classifying the video frames based on the video frame texts to obtain a video frame set corresponding to each video frame text; and sequencing the video frames in the video frame set according to the playing time corresponding to the video frames, and screening out key video frames in the video frame set based on a sequencing result.
Optionally, in some embodiments, the obtaining unit may be specifically configured to screen at least one key video frame text of the key video frames from the video frame texts, and identify text position information of each key video frame text in the key video frame; screening target position information from the text position information based on the key video frame text; and positioning a position area corresponding to the target position information in the key video frame to obtain a subtitle area of the candidate video.
Optionally, in some embodiments, the obtaining unit may be specifically configured to obtain a basic video set of a target language according to a preset keyword; identifying a video type and a confidence level of the video type of each video in the base video set; and screening at least one candidate video in the base video set based on the video type and the confidence coefficient.
Optionally, in some embodiments, the obtaining unit may be specifically configured to perform audio detection on an audio frame of each video in the base video set to obtain an audio type of the audio frame; performing silence detection on the video, and performing audio cutting on the video based on a detection result to obtain at least one audio clip; and extracting the characteristics of the audio segments, and determining the video type of the video and the confidence coefficient of the video type based on the extracted audio characteristics and the audio type.
Optionally, in some embodiments, the obtaining unit may be specifically configured to determine a voice type of the audio segment and classification information of the voice type according to the audio type and the audio feature; acquiring the audio time length of the audio clip, and determining the classification weight of the voice type based on the audio time length; and according to the classification weight and the classification information, fusing the voice types corresponding to the audio segments of the videos to obtain the video types of the videos and the confidence degrees of the video types.
Optionally, in some embodiments, the generating unit may be specifically configured to screen target subtitle content of the target video from the subtitle content; extracting a time axis corresponding to the target subtitle content from the target video; and taking the audio content, the target subtitle content and the time axis of the target video as initial linguistic data, and sending the initial linguistic data to a verification server for verification to obtain the linguistic data of the target language.
In addition, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores an application program, and the processor is configured to run the application program in the memory to implement the corpus generating method provided in the embodiment of the present invention.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where multiple instructions are stored in the computer-readable storage medium, and the instructions are suitable for being loaded by a processor to perform steps in any corpus generating method provided in the embodiment of the present invention.
The method comprises the steps of obtaining at least one candidate video, performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video, screening out at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video; according to the scheme, the caption content can be identified in the candidate video, the audio content of the candidate video is converted into the text content, then the target video of the target language is accurately screened out according to the similarity between the caption content and the text content, and the caption content of the target video can be used as the reference of manual annotation, so that the accuracy of corpus generation can be greatly improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of a scenario of a corpus generating method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a corpus generating method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a search for dialect videos provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of the voice type of an audio clip provided by an embodiment of the invention;
FIG. 5 is a diagram of a method for filtering key video frames according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a flow of recognizing dialect videos according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating dialect corpus identification according to an embodiment of the present invention;
FIG. 8 is a schematic overall flow chart of corpus generation according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating dialect corpus generation according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of another process of corpus generation according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a corpus generating device according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a corpus generating method and device, electronic equipment and a computer-readable storage medium. The corpus generating device may be integrated in an electronic device, and the electronic device may be a server or a terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform. The terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
For example, referring to fig. 1, taking an example that a corpus generating device is integrated in an electronic device, the electronic device obtains at least one candidate video, performs text recognition on a video frame of the candidate video to obtain subtitle content of the candidate video, extracts audio content from the candidate video, converts the audio content into text content, calculates similarity between the subtitle content and the text content to obtain text similarity of the candidate video, then screens out at least one target video in a target language from the candidate video according to the text similarity, generates a corpus corresponding to the target language based on the audio content and the subtitle content of the target video, and further improves the accuracy of corpus generation.
The corpus can be labeled audio content, and mainly comprises an audio file and a labeled text corresponding to the audio file, wherein the labeled text corresponds to the audio content in the audio file in a one-to-one manner through time axes and other forms. Corpora are the basic units that make up a corpus. By corpus is meant a large-scale electronic text library that has been scientifically sampled and processed, in which is stored the linguistic material that actually appears in the actual use of the language. The corpus can be generally used for training an acoustic model or an audio recognition model and the like, and can also be used for scenes such as question and answer search and the like.
The corpus generation method provided by the embodiment of the application relates to a voice technology and a natural voice processing (NLP) direction in the field of artificial intelligence. The method and the device for recognizing the candidate video can perform text recognition on the video frames of the candidate video, extract audio content from the candidate video, convert the audio content into text content and the like.
Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.
Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, knowledge mapping, and the like.
It is to be understood that, in the embodiments of the present application, related data such as candidate videos related to a subject need to be approved or agreed when the following embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.
The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
This embodiment will be described in terms of a corpus generating device, which may be specifically integrated in an electronic device, where the electronic device may be a server or a terminal; the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), a wearable device, a virtual reality device, or other intelligent devices capable of generating corpora.
A corpus generation method includes:
the method comprises the steps of obtaining at least one candidate video, carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video, screening at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
As shown in fig. 2, the specific process of the corpus generating method is as follows:
101. and acquiring at least one candidate video, and performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video.
The caption content is content information of a caption in a video frame, and the caption may be non-video content such as a dialog in a television, a movie, and a stage work, which is displayed in a text form, and also generally refers to a text of a post-processing of a movie and television work. The commentary and various characters appearing below the movie screen or the television screen, such as the film title, the credits, the lyrics, the dialogues, the captions and the explanatory words are called subtitles according to the introduction of people, the place name and the year. The dialogue captions for movie and television works typically appear below the screen, while the captions for theatrical works may appear on either side or above the stage.
The method for acquiring at least one candidate video may be various, and specifically may be as follows:
for example, a base video set of the target language may be obtained according to preset keywords, a video type and a confidence level of the video type of each video are identified in the base video set, and at least one candidate video is screened out from the base video set based on the video type and the confidence level.
The method for acquiring the basic video set of the target language according to the preset keywords may be various, for example, the preset keywords may be acquired, the target keywords of the target language are screened out from the preset keywords, and the original video is acquired on a network or a video platform based on the target keywords, so as to obtain the basic video set.
For example, if the target language is a dialect, the target keyword may be a Sichuan dialect, a Chongqing dialect, a northeast dialect, a Shanghai dialect, or the like. Based on the target keywords, videos that may be dialects can be searched out, so that a basic video set is obtained. Taking the target keyword as the Sichuan language as an example, the original video including the Sichuan language can be searched in the video platform, and the searching process can be as shown in fig. 3.
After the basic video set is obtained, the video type of each video and the confidence level of the video type can be identified in the basic video set, where the video type is understood to be a scene tag of audio data in the video, and is mainly used to determine an audio scene in which the audio data in the video is located, where the audio scene may be various, for example, it may include voice, songs, people, and so on. The method for identifying the video type of each video in the basic video set may be various, for example, audio detection is performed on an audio frame of each video in the basic video set to obtain an audio type of the audio frame, silence detection is performed on the video, audio cutting is performed on the video based on a detection result to obtain at least one audio segment, feature extraction is performed on the audio segment, and the video type of the video and the confidence coefficient of the video type are determined based on the extracted audio feature and the extracted audio type.
The audio type is used to indicate whether the audio frame is a voice, and thus, the audio type may include a voice tag and a non-voice tag. For example, audio information is extracted from a video, the audio information is framed to obtain at least one frame of audio frame, and an audio detection technique (VAD) is used to perform audio detection on the audio frame to obtain an audio type of the audio frame.
For example, based on the detection result, a silence interval with a silence audio frame is identified in the audio information of the video, and the audio corresponding to the silence interval is deleted in the audio information of the video, so that at least one audio clip can be obtained.
After audio cutting is performed on a video, feature extraction can be performed on audio segments, and there are various ways of feature extraction, for example, an x-vector embedding model (an audio feature extraction model) can be used as a main system to perform feature extraction on each audio segment, and audio features (embedding) representing audio content information are obtained through a TDNN network and a Statistics posing layer.
After the audio features are extracted, the video type and the confidence level of the video type can be determined based on the extracted audio features and the extracted audio features, and various ways for determining the video type and the confidence level of the video type can be provided, for example, the classification information of the voice type and the voice type of an audio clip is determined according to the audio type and the audio features, the audio time length of the audio clip is obtained, the classification weight of the voice type is determined based on the audio time length, and the voice types corresponding to the video clip of the video are fused according to the classification weight and the classification information, so that the confidence level of the video type and the video type of the video is obtained.
The voice type is used to indicate sub-scene information of the audio segment in a voice or non-voice scene, for example, taking the scene as voice, the voice type may include chinese language or other languages, taking the scene as song, the voice type may include song types, for example, it may include singing, pure music, and as shown in fig. 4. The method for determining the voice type and the classification information of the voice type of the audio clip may be multiple according to the audio type and the audio feature, for example, the audio types of the audio frames in the audio clip are fused to obtain a basic voice type of the audio clip, and a back-end classifier is used to classify the audio feature on the basic voice type, so as to obtain the voice type of each audio clip and the classification score of the voice type, and the classification score is used as the classification information.
After the classification information and the classification weight of the voice type are determined, the voice types corresponding to the audio segments of the video can be fused, and the fusion mode can be various, for example, the classification information can be weighted based on the classification weight to obtain weighted classification information, a target voice type is screened from the voice types according to the weighted classification information, the target voice type is used as the video type, and the confidence coefficient corresponding to the target voice type is used as the confidence coefficient of the video type.
After the video type of the video and the confidence coefficient of the video type are determined, at least one candidate video can be screened from the base video set based on the video type and the confidence coefficient, and there are various ways for screening the candidate video, for example, a video with the video type being the target video type can be screened from the base video set to obtain a candidate video set, and a video with the confidence coefficient exceeding a preset confidence coefficient threshold value is screened from the candidate video set to obtain at least one candidate video.
After at least one candidate video is obtained, text recognition can be performed on video frames of the candidate video to obtain subtitle content of the candidate video, and the text recognition modes can be various, for example, the candidate video can be framed, key video frames are screened out from the framed video frames, a target position area is located in the key video frames to obtain subtitle areas of the candidate video, and texts corresponding to the subtitle areas are recognized in the video frames to obtain subtitle content of the candidate video.
The method for screening out the key video frames from the framed video frames may be various, for example, text recognition is performed on the framed video frames to obtain video frame texts of the video frames, the video frames are classified based on the video frame texts to obtain a video frame set corresponding to each video frame text, the video frames in the video frame set are sorted according to the playing time corresponding to the video frames, and the key video frames are screened out from the video frame set based on the sorting result.
For example, the video frame with the earliest playing time may be selected from the video frame set based on the sorting result to obtain a key video, so that the key video frame may be found to be understood as a video frame in which the video frame text of the video frame and the video frame in the previous frame have a change, and specifically, as shown in fig. 5.
After the key video frames are screened out, a target position area can be positioned in the key video frames to obtain a subtitle area of a candidate video, the subtitle area can be understood as the position area of a subtitle in the video frames, and the method for positioning the subtitle area can be various, for example, at least one key video frame text of the key video frames can be screened out from the video frame text, text position information of each key video frame text is identified from the key video frames, target position information is screened out from the text position information based on the key video frame text, and a position area corresponding to the target position information is positioned from the key video frames to obtain the subtitle area of the candidate video.
For example, the text position information with a change in the key video frame text is screened out from the text position information to obtain candidate position information, and the position information with an unchanged ordinate is screened out from the candidate position information to obtain the target position information. Besides the subtitles, the key video frames may also contain information such as station captions or advertisements, except that the subtitles can be changed, the vertical coordinates of the subtitles cannot be changed, and the horizontal and vertical coordinates of other contents cannot be changed, so that the position information of the subtitles can be screened out.
After the target position information is screened out, a position region corresponding to the target position information can be located in the key video frame, and there are various locating manners, for example, an initial position region corresponding to each target position information can be located in the key video frame, and the initial position regions are fused to obtain a subtitle region of a candidate video, or an initial position region corresponding to each target position information can be located in the key video frame, and a position region with the maximum abscissa or the maximum length is screened out from the initial position region as the subtitle region.
102. And extracting audio content from the candidate videos and converting the audio content into text content.
The audio content may be extracted from the candidate video in various ways, which may specifically be as follows:
for example, the audio data may be directly separated from the candidate video to obtain the audio content, or the audio data may be extracted from the candidate video to obtain the initial audio content, the initial audio content is subjected to silence detection, and based on the detection result, the silence content is screened from the initial audio content to obtain the audio content of the candidate video.
After the audio content is extracted, the audio content may be converted into text content, and the text content may be converted in various ways, for example, the audio content of the video may be converted into the text content by using an Automatic Speech Recognition (ASR) service, or the audio content may be converted into the text content by using other Speech conversion technologies.
103. And calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate videos.
The text similarity is used for indicating the similarity information of the text between the subtitle content and the text content.
The method for calculating the similarity between the subtitle content and the text content may be various, and specifically may be as follows:
for example, a caption character string may be recognized in the caption content, a text character string may be recognized in the text content, the number of conversion operations between the caption character string and the text character string is calculated, a class editing distance between the caption character string and the text character string is obtained, and the text similarity of the candidate video is determined based on the caption character string, the text character string, and the class editing distance.
For example, the subtitle character string may be converted into the text character string by using insertion, deletion, replacement, or the like, or the text character string may be converted into the subtitle character string by adding 1 to the insertion operation frequency and adding 2 to the replacement operation frequency, so that the conversion operation frequency may be calculated, and the minimum operation frequency may be screened out from the operation frequencies, thereby obtaining the class edit distance between the subtitle character string and the text character string.
After the class edit distance is calculated, the text similarity of the candidate video may be determined based on the caption character string, the text character string and the class edit distance, and the text similarity may be determined in various manners, for example, the caption character string and the text character string may be fused to obtain a character string distance, a distance difference between the class edit distance and the character string distance is calculated, and a ratio between the distance difference and the character string distance is calculated to obtain the text similarity of the candidate video, which may be specifically shown in formula (1):
r=(sum-ldist)/sum (1)
where r is text similarity, which may also be referred to as a lewinstein ratio, sum is a string distance, and ldist is a class edit distance.
The string distance may be understood as a length of the subtitle string and the text string, for example, str1 ═ abc ', str2 ═ cde', and sum ═ 3+3 ═ 6. Calculating code information with similar texts can be as follows:
Figure BDA0003659585340000121
104. and screening at least one target video of the target language from the candidate videos according to the text similarity.
For example, a preset text similarity threshold set may be obtained, and a target text similarity threshold corresponding to the target language is screened out from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain the target videos corresponding to the target language.
The text similarity threshold can be set according to practical application, taking a target language as an example, and the text similarity threshold can be 50%, then videos which do not exceed 50% can be screened out from candidate videos to be used as dialect videos, and videos of which the text similarity exceeds the target text similarity threshold can be mandarin videos, so that the recognition of the dialect videos can be shown in fig. 6, text contents of video data are recognized through an ASR technology, subtitle contents are recognized through an OCR technology, the text similarity between the text contents and the subtitle contents is calculated, then the text similarity is compared with the text similarity threshold, a dialect video with a low threshold is reserved, a mandarin video with a high threshold can be stripped, and accordingly the target videos can be obtained.
105. And generating a corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
For example, the target subtitle content of the target video may be screened out from the subtitle content, the time axis corresponding to the target subtitle content is extracted from the target video, the audio content, the target subtitle content and the time axis of the target video are used as initial corpus information, and the initial corpus is sent to the verification server for verification, so as to obtain the corpus of the target language.
For example, the initial corpus is sent to the verification server so as to be corrected by manual correction, and then, part of the time axis is adjusted, so that the ASR corpus of the target language can be obtained.
According to the scheme, aiming at the situation that linguistic data of the obtained dialect data training set is difficult to label (a large number of dialect common labeling personnel can only understand 1-2 areas), the mode that video is visually presented in combination with a multi-mode labeling mode is used for assisting manual labeling by means of OCR recognition results, and the problem that manual labeling is difficult is effectively solved, and the scheme can be specifically shown in FIG. 7.
Optionally, after the corpus of the target language is generated, the language recognition model may be trained based on the corpus to obtain a trained language recognition model, and the speech to be recognized is recognized based on the trained language recognition model to obtain text content corresponding to the speech to be recognized.
In the overall process of corpus generation, ASR and OCR techniques are respectively used to screen out a target video and assist manual labeling, so as to obtain an ASR corpus, and a specific flow may be as shown in fig. 8.
Taking corpus as an example, the overall process of generating the corpus of the dialect may be as shown in fig. 9, inputting a screening video, performing audio type detection on the video, and obtaining an audio with scene tags and scores, where the scene tags may include voices, songs, crowds, interfering sounds, and the like. And screening out videos with scores exceeding 80 and scenes being voice from the videos so as to obtain candidate videos. The method comprises the steps of obtaining subtitle information of a candidate video through a subtitle extraction service, identifying subtitle content of the candidate video in a video frame through an OCR technology, converting audio content into text content through an audio ASR service, calculating text similarity of the subtitle content and the text content, judging the candidate video to be a mandarin video when the text similarity exceeds 50%, stripping, judging the dialect video to be reserved when the text similarity does not exceed 50%, then extracting target subtitle content of the dialect video, taking audio, a time stamp (time axis) and corresponding subtitle content as initial linguistic data, and manually checking and modifying the initial linguistic data to obtain the ASR linguistic data.
As can be seen from the above, in the embodiment of the application, after at least one candidate video is obtained, and text recognition is performed on video frames of the candidate video to obtain subtitle content of the candidate video, audio content is extracted from the candidate video, the audio content is converted into text content, then, similarity between the subtitle content and the text content is calculated to obtain text similarity of the candidate video, then, at least one target video in a target language is screened out from the candidate video according to the text similarity, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate video, the audio content of the candidate video is converted into the text content, then the target video of the target language is accurately screened out according to the similarity between the subtitle content and the text content, and the subtitle content of the target video can be used as the reference of manual marking, so that the accuracy of corpus generation can be greatly improved.
The method described in the above examples is further illustrated in detail below by way of example.
In this embodiment, the corpus generating apparatus is specifically integrated in an electronic device, the electronic device is a server, and the target language is a dialect.
As shown in fig. 10, a corpus generating method specifically includes the following steps:
201. the server obtains at least one candidate dialect video.
For example, the server obtains preset keywords, screens out target keywords of the dialect from the preset keywords, and obtains an original video on a network or a video platform based on the target keywords, so as to obtain a basic dialect video set. Extracting audio information from each video in the basic dialect video set, framing the audio information to obtain at least one frame of audio frame, and performing audio detection on the audio frame by using an audio detection technology (VAD) to obtain the audio type of the audio frame. And carrying out silence detection on the video, identifying a silence interval with a silence audio frame in the audio information of the video based on the detection result, and deleting the audio corresponding to the silence interval in the audio information of the video to obtain at least one audio clip.
The server adopts an x-vector embedding model as a main system, performs feature extraction on each audio segment, and obtains audio features (embedding) representing audio content information through a TDNN (time domain neural network) and a Statistics Pooling layer. And fusing the audio types of the audio frames in the audio clips to obtain the basic voice type of the audio clips, classifying the audio characteristics on the basic voice type by adopting a rear-end classifier so as to obtain the voice type of each audio clip and the class score of the voice type, and taking the class score as classification information.
The server obtains the audio time length of the audio clip, determines the classification weight of the voice type based on the audio time length, weights the classification information based on the classification weight to obtain weighted classification information, screens out a target voice type from the voice types according to the weighted classification information, takes the target voice type as a video type, and takes the confidence coefficient corresponding to the target voice type as the confidence coefficient of the video type. And screening videos with the video types being languages from the basic video set to obtain a candidate dialect video set, and screening videos with confidence degrees exceeding a preset confidence degree threshold value from the candidate video set to obtain at least one candidate dialect video.
202. And the server performs text recognition on the video frames of the candidate dialect videos to obtain the subtitle contents of the candidate dialect videos.
For example, the server frames the candidate video, and performs text recognition on the framed video frame to obtain a video frame text of the video frame. The method comprises the steps of classifying video frames based on video frame texts to obtain a video frame set corresponding to each video frame text, sequencing the video frames in the video frame set according to the playing time corresponding to the video frames, and screening the video frame with the earliest playing time from the video frame set based on the sequencing result to obtain a key video.
The server screens out at least one key video frame text of the key video frames from the video frame texts, and identifies text position information of each key video frame text from the key video frames. And screening out text position information with a changed key video frame text from the text position information to obtain candidate position information, and screening out position information with unchanged vertical coordinates from the candidate position information to obtain target position information. Besides the subtitles, the key video frames may also contain information such as station captions or advertisements, and besides the subtitles are changed, the vertical coordinates of the subtitles are not changed, and the horizontal and vertical coordinates of other contents are not changed, so that the position information of the subtitles can be screened out. And positioning an initial position region corresponding to each target position information in the key video frame, and fusing the initial position regions to obtain a subtitle region of the candidate video, or positioning an initial position region corresponding to each target position information in the key video frame, and screening out a position region with the maximum abscissa or the maximum length in the initial position region as the subtitle region.
203. The server extracts the audio content from the candidate dialect video.
For example, the server may directly separate the audio data from the candidate dialect video to obtain the audio content, or may extract the audio data from the candidate dialect video to obtain the initial audio content, perform silence detection on the initial audio content, and screen out the silence content from the initial audio content based on the detection result to obtain the audio content of the candidate dialect video.
204. The server converts the audio content to text content.
For example, the server converts audio content of the candidate dialect video to textual content using the ASR service, or may also convert audio content to textual content using other speech conversion techniques.
205. And the server calculates the similarity between the subtitle content and the text content to obtain the text similarity of the candidate videos.
For example, the server may identify a caption string in caption content and a text string in text content. The subtitle character strings are converted into the text character strings by adopting modes of inserting, deleting, replacing and the like, or the text character strings are converted into the subtitle character strings, the number of inserting operations is added with 1, and the number of replacing operations is added with 2, so that the number of converting operations can be calculated, the minimum number of operations can be screened out from the number of operations, and the similar editing distance between the subtitle character strings and the text character strings can be obtained. The method includes the steps of fusing a caption character string and a text character string to obtain a character string distance, calculating a distance difference between a similar editing distance and the character string distance, and calculating a ratio of the distance difference to the character string distance to obtain text similarity of a candidate video, wherein the text similarity can be specifically shown in a formula (1).
206. And the server screens out at least one target dialect video from the candidate dialect videos according to the text similarity.
For example, the server obtains a preset text similarity threshold set, and filters out a target text similarity threshold (50%) corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate dialect videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate dialect videos on the basis of the comparison result, thereby obtaining the target dialect videos.
207. And the server generates a corpus corresponding to the dialect based on the audio content and the subtitle content of the target dialect video.
For example, the server may screen target subtitle content of a target dialect video from the subtitle content, extract a timeline corresponding to the target subtitle content from the target dialect video, use audio content, the target subtitle content, and the timeline of the target dialect video as initial corpus information, send the initial corpus to the verification server, so as to manually correct and modify the corpus, and then adjust a part of the timeline, so as to obtain an ASR corpus of the dialect.
Optionally, after the server generates the corpus of the target language, the language recognition model may be trained based on the corpus to obtain a trained dialect recognition model, and the speech to be recognized is recognized based on the trained dialect recognition model to obtain the text content corresponding to the speech to be recognized.
As can be seen from the above, after the server acquires at least one candidate dialect video and performs text recognition on a video frame of the candidate dialect video to obtain subtitle content of the candidate dialect video, audio content is extracted from the candidate dialect video and converted into text content, then similarity between the subtitle content and the text content is calculated to obtain text similarity of the candidate dialect video, then at least one target dialect video is screened out from the candidate dialect video according to the text similarity, and corpus corresponding to the dialect is generated based on the audio content and the subtitle content of the target dialect video; according to the scheme, the subtitle content can be identified in the candidate dialect video, the audio content of the candidate dialect video is converted into the text content, then the target dialect video is accurately screened out according to the similarity between the subtitle content and the text content, and the subtitle content of the target dialect video can be used as the reference of manual labeling, so that the accuracy of dialect corpus generation can be greatly improved.
In order to better implement the above method, an embodiment of the present invention further provides a corpus generating device, which may be integrated in an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 11, the corpus generating device may include an acquiring unit 301, a converting unit 302, a calculating unit 303, a filtering unit 304, and a generating unit 305 as follows:
(1) an acquisition unit 301;
the obtaining unit 301 is configured to obtain at least one candidate video, and perform text recognition on a video frame of the candidate video to obtain subtitle content of the candidate video.
For example, the obtaining unit 301 may be specifically configured to obtain a base video set of a target language according to a preset keyword, identify a video type and a confidence level of the video type of each video in the base video set, and screen out at least one candidate video in the base video set based on the video type and the confidence level. And framing the candidate video, screening out a key video frame from the framed video frames, positioning a target position area in the key video frame to obtain a subtitle area of the candidate video, and identifying a text corresponding to the subtitle area in the video frame to obtain subtitle content of the candidate video.
(2) A conversion unit 302;
a conversion unit 302, configured to extract audio content from the candidate videos and convert the audio content into text content.
For example, the converting unit 302 may be specifically configured to separate audio data from the candidate video to obtain audio content, or extract audio data from the candidate video to obtain initial audio content, perform silence detection on the initial audio content, and screen out silence content from the initial audio content based on a detection result to obtain audio content of the candidate video. The ASR service is employed to convert the audio content of the video to textual content, or other speech conversion techniques may be employed to convert the audio content to textual content.
(3) A calculation unit 303;
and the calculating unit 303 is configured to calculate a similarity between the subtitle content and the text content to obtain a text similarity of the candidate video.
For example, the calculating unit 303 may be specifically configured to identify a caption character string in the caption content, identify a text character string in the text content, calculate the number of conversion operations between the caption character string and the text character string to obtain a class editing distance between the caption character string and the text character string, fuse the caption character string and the text character string to obtain a character string distance, calculate a distance difference between the class editing distance and the character string distance, and calculate a ratio between the distance difference and the character string distance to obtain the text similarity of the candidate video.
(4) A screening unit 304;
and a screening unit 304, configured to screen at least one target video in a target language from the candidate videos according to the text similarity.
For example, the screening unit 304 may be specifically configured to obtain a preset text similarity threshold set, and screen out a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain the target videos corresponding to the target language.
(5) A generation unit 305;
a generating unit 305, configured to generate a corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
For example, the generating unit 305 may be specifically configured to screen a target subtitle content of a target video from the subtitle content, extract a time axis corresponding to the target subtitle content from the target video, use an audio content of the target video, the target subtitle content, and the time axis as initial corpus information, and send the initial corpus to a verification server for verification, so as to obtain a corpus of a target language.
In specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and specific implementations of the above units may refer to the foregoing method embodiment, which is not described herein again.
As can be seen from the above, in this embodiment, after the obtaining unit 301 obtains at least one candidate video, and performs text recognition on a video frame of the candidate video to obtain subtitle content of the candidate video, the converting unit 302 extracts audio content from the candidate video, and converts the audio content into text content, then the calculating unit 303 calculates a similarity between the subtitle content and the text content to obtain text similarity of the candidate video, then the screening unit 304 screens out at least one target video in a target language from the candidate video according to the text similarity, and the generating unit 305 generates corpus corresponding to the target language based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate video, the audio content of the candidate video is converted into the text content, then the target video of the target language is accurately screened out according to the similarity between the subtitle content and the text content, and the subtitle content of the target video can be used as the reference of manual marking, so that the accuracy of corpus generation can be greatly improved.
An embodiment of the present invention further provides an electronic device, as shown in fig. 12, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:
the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 12 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:
the method comprises the steps of obtaining at least one candidate video, carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating similarity between the subtitle content and the text content to obtain text similarity of the candidate video, screening at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
For example, the electronic device obtains a base video set of a target language according to preset keywords, identifies a video type and a confidence level of the video type of each video in the base video set, and screens out at least one candidate video in the base video set based on the video type and the confidence level. And framing the candidate video, screening out a key video frame from the framed video frames, positioning a target position area in the key video frame to obtain a subtitle area of the candidate video, and identifying a text corresponding to the subtitle area in the video frame to obtain subtitle content of the candidate video. The audio data are separated from the candidate videos, so that audio contents are obtained, or the audio data can be extracted from the candidate videos, so that initial audio contents are obtained, the initial audio contents are subjected to mute detection, and based on the detection result, the mute contents are screened from the initial audio contents, so that the audio contents of the candidate videos are obtained. The ASR service is employed to convert the audio content of the video to textual content, or other speech conversion techniques may be employed to convert the audio content to textual content. The method comprises the steps of identifying caption character strings in caption content, identifying text character strings in the text content, calculating the number of conversion operations between the caption character strings and the text character strings to obtain a similar editing distance between the caption character strings and the text character strings, fusing the caption character strings and the text character strings to obtain a character string distance, calculating a distance difference between the similar editing distance and the character string distance, calculating a ratio between the distance difference and the character string distance to obtain the text similarity of candidate videos. And acquiring a preset text similarity threshold set, and screening out a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain the target videos corresponding to the target language. And screening target subtitle content of the target video from the subtitle content, extracting a time axis corresponding to the target subtitle content from the target video, taking the audio content, the target subtitle content and the time axis of the target video as initial corpus information, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
As can be seen from the above, in the embodiment of the present invention, after at least one candidate video is obtained, and text recognition is performed on video frames of the candidate video to obtain subtitle content of the candidate video, audio content is extracted from the candidate video, and the audio content is converted into text content, then, similarity between the subtitle content and the text content is calculated to obtain text similarity of the candidate video, then, according to the text similarity, at least one target video in a target language is screened out from the candidate video, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate video, the audio content of the candidate video is converted into the text content, then the target video of the target language is accurately screened out according to the similarity between the subtitle content and the text content, and the subtitle content of the target video can be used as the reference of manual marking, so that the accuracy of corpus generation can be greatly improved.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute steps in any of the linguistic methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
the method comprises the steps of obtaining at least one candidate video, carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video, screening at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
For example, a base video set of a target language is obtained according to preset keywords, the video type and the confidence coefficient of the video type of each video are identified in the base video set, and at least one candidate video is screened out from the base video set based on the video type and the confidence coefficient. And framing the candidate video, screening out a key video frame from the framed video frames, positioning a target position area in the key video frame to obtain a subtitle area of the candidate video, and identifying a text corresponding to the subtitle area in the video frame to obtain subtitle content of the candidate video. The audio data is separated from the candidate videos, so that audio content is obtained, or the audio data can be extracted from the candidate videos, so that initial audio content is obtained, the initial audio content is subjected to mute detection, and based on a detection result, the mute content is screened from the initial audio content, so that the audio content of the candidate videos is obtained. The ASR service is employed to convert the audio content of the video to textual content, or other speech conversion techniques may be employed to convert the audio content to textual content. The method comprises the steps of identifying caption character strings in caption content, identifying text character strings in the text content, calculating the number of conversion operations between the caption character strings and the text character strings to obtain a similar editing distance between the caption character strings and the text character strings, fusing the caption character strings and the text character strings to obtain a character string distance, calculating a distance difference between the similar editing distance and the character string distance, and calculating a ratio of the distance difference to the character string distance to obtain the text similarity of candidate videos. And acquiring a preset text similarity threshold set, and screening out a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain the target videos corresponding to the target language. And screening target subtitle content of the target video from the subtitle content, extracting a time axis corresponding to the target subtitle content from the target video, taking the audio content, the target subtitle content and the time axis of the target video as initial corpus information, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the computer-readable storage medium may execute the steps in any corpus generating method provided in the embodiment of the present invention, beneficial effects that can be achieved by any corpus generating method provided in the embodiment of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the corpus generation aspect or the speech recognition aspect described above.
The corpus generating method, apparatus, electronic device and computer-readable storage medium provided by the embodiments of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (14)

1. A corpus generating method, comprising:
acquiring at least one candidate video, and performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video;
extracting audio content from the candidate video, and converting the audio content into text content;
calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate videos;
screening at least one target video of a target language from the candidate videos according to the text similarity;
and generating the corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
2. The corpus generating method according to claim 1, wherein said calculating a similarity between said subtitle content and said text content to obtain a text similarity of said candidate video comprises:
recognizing a caption character string in the caption content, and recognizing a text character string in the text content;
calculating the conversion operation times between the caption character strings and the text character strings to obtain the class editing distance between the caption character strings and the text character strings;
and determining the text similarity of the candidate videos based on the subtitle character strings, the text character strings and the class editing distance.
3. The corpus generation method according to claim 2, wherein said determining the text similarity of the candidate videos based on the caption string, the text string and the class edit distance comprises:
fusing the subtitle character string and the text character string to obtain a character string distance;
calculating a distance difference between the class editing distance and the character string distance;
and calculating the ratio of the distance difference to the character string distance to obtain the text similarity of the candidate video.
4. The corpus generation method according to any one of claims 1 to 3, wherein the performing text recognition on the video frames of the candidate video to obtain the subtitle content of the candidate video comprises:
framing the candidate video, and screening out a key video frame from the framed video frames;
positioning a target position area in the key video frame to obtain a subtitle area of the candidate video;
and identifying a text corresponding to the subtitle area in the video frame to obtain the subtitle content of the candidate video.
5. The corpus generation method according to claim 4, wherein said screening out key video frames from the framed video frames comprises:
performing text recognition on the framed video frame to obtain a video frame text of the video frame;
classifying the video frames based on the video frame texts to obtain a video frame set corresponding to each video frame text;
and sequencing the video frames in the video frame set according to the playing time corresponding to the video frames, and screening out key video frames in the video frame set based on a sequencing result.
6. The corpus generation method according to claim 4, wherein said locating a target location area in said key video frame to obtain a subtitle area of said candidate video comprises:
screening at least one key video frame text of the key video frames from the video frame texts, and identifying text position information of each key video frame text from the key video frames;
screening target position information from the text position information based on the key video frame text;
and positioning a position area corresponding to the target position information in the key video frame to obtain a subtitle area of the candidate video.
7. The corpus generation method according to claim 1 to 3, wherein said obtaining at least one candidate video comprises:
acquiring a basic video set of a target language according to preset keywords;
identifying a video type and a confidence level of the video type of each video in the base video set;
and screening at least one candidate video from the base video set based on the video type and the confidence coefficient.
8. The corpus generation method according to claim 7, wherein said identifying a video type and a confidence level of the video type of each video in the base video set comprises:
performing audio detection on an audio frame of each video in the basic video set to obtain an audio type of the audio frame;
performing silence detection on the video, and performing audio cutting on the video based on a detection result to obtain at least one audio clip;
and extracting the characteristics of the audio segments, and determining the video type of the video and the confidence coefficient of the video type based on the extracted audio characteristics and the audio type.
9. The corpus generation method according to claim 8, wherein said determining a video type of said video and a confidence level of said video type based on said extracted audio features and audio type comprises:
determining the voice type of the audio clip and the classification information of the voice type according to the audio type and the audio characteristics;
acquiring the audio time length of the audio clip, and determining the classification weight of the voice type based on the audio time length;
and according to the classification weight and the classification information, fusing the voice types corresponding to the audio segments of the videos to obtain the video types of the videos and the confidence degrees of the video types.
10. The corpus generating method according to any one of claims 1 to 3, wherein said generating the corpus corresponding to the target language based on the audio content and the subtitle content of the target video comprises:
screening target subtitle contents of the target video from the subtitle contents;
extracting a time axis corresponding to the target subtitle content from the target video;
and taking the audio content, the target subtitle content and the time axis of the target video as initial linguistic data, and sending the initial linguistic data to a verification server for verification to obtain the linguistic data of the target language.
11. A corpus generating device, comprising:
the acquisition unit is used for acquiring at least one candidate video and performing text recognition on video frames of the candidate video to obtain subtitle content of the candidate video;
the conversion unit is used for extracting audio content from the candidate videos and converting the audio content into text content;
the calculation unit is used for calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video;
the screening unit is used for screening at least one target video of a target language from the candidate videos according to the text similarity;
and the generating unit is used for generating the corpus corresponding to the target language based on the audio content and the subtitle content of the target video.
12. An electronic device, comprising a processor and a memory, wherein the memory stores an application program, and the processor is configured to run the application program in the memory to perform the steps of the corpus generation method according to any one of claims 1 to 10.
13. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps in the corpus generating method according to any one of the claims 1 to 10.
14. A computer-readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the corpus generation method according to any one of claims 1 to 10.
CN202210572357.1A 2022-05-24 2022-05-24 Corpus generation method and device, electronic equipment and computer-readable storage medium Pending CN114996506A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210572357.1A CN114996506A (en) 2022-05-24 2022-05-24 Corpus generation method and device, electronic equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210572357.1A CN114996506A (en) 2022-05-24 2022-05-24 Corpus generation method and device, electronic equipment and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN114996506A true CN114996506A (en) 2022-09-02

Family

ID=83028828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210572357.1A Pending CN114996506A (en) 2022-05-24 2022-05-24 Corpus generation method and device, electronic equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN114996506A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229943A (en) * 2023-05-08 2023-06-06 北京爱数智慧科技有限公司 Conversational data set generation method and device
CN116468054A (en) * 2023-04-26 2023-07-21 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468054A (en) * 2023-04-26 2023-07-21 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology
CN116468054B (en) * 2023-04-26 2023-11-07 中央民族大学 Method and system for aided construction of Tibetan transliteration data set based on OCR technology
CN116229943A (en) * 2023-05-08 2023-06-06 北京爱数智慧科技有限公司 Conversational data set generation method and device
CN116229943B (en) * 2023-05-08 2023-08-15 北京爱数智慧科技有限公司 Conversational data set generation method and device

Similar Documents

Publication Publication Date Title
KR101990023B1 (en) Method for chunk-unit separation rule and display automated key word to develop foreign language studying, and system thereof
CN107305541A (en) Speech recognition text segmentation method and device
CN114465737B (en) Data processing method and device, computer equipment and storage medium
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN114996506A (en) Corpus generation method and device, electronic equipment and computer-readable storage medium
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
CN104750677A (en) Speech translation apparatus, speech translation method and speech translation program
CN112382295A (en) Voice recognition method, device, equipment and readable storage medium
Sharma et al. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows
CN114547373A (en) Method for intelligently identifying and searching programs based on audio
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
CN113129895B (en) Voice detection processing system
KR20060100646A (en) Method and system for searching the position of an image thing
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
CN112231440A (en) Voice search method based on artificial intelligence
Zufferey et al. Towards automatic identification of discourse markers in dialogs: The case of like
Farhan et al. American Sign Language: Detection and Automatic Text Generation
CN115580758A (en) Video content generation method and device, electronic equipment and storage medium
WO2011039773A2 (en) Tv news analysis system for multilingual broadcast channels
CN112201225B (en) Corpus acquisition method and device, readable storage medium and electronic equipment
Hukkeri et al. Erratic navigation in lecture videos using hybrid text based index point generation
Bechet et al. Detecting person presence in tv shows with linguistic and structural features
Chittaragi et al. Sentence-based dialect identification system using extreme gradient boosting algorithm
Novitasari et al. Construction of English-French Multimodal Affective Conversational Corpus from TV Dramas

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination