CN115238129A

CN115238129A - Method for extracting multilingual network audio and video data based on AI engine

Info

Publication number: CN115238129A
Application number: CN202210947268.0A
Authority: CN
Inventors: 李斌斌
Original assignee: Anhui Xinzhi Technology Co ltd
Current assignee: Anhui Xinzhi Technology Co ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-10-25

Abstract

The invention discloses a method for extracting multilingual network audio and video data based on an AI (Artificial intelligence) engine, which belongs to the technical field of AI, and is used for acquiring target audio and video data, classifying the target audio and video data and acquiring first data and second data; establishing a database for storage; extracting audio data of the first data, and classifying the audio of each time period into normal human voice audio, human voice environment sound mixed audio and non-human voice audio based on subtitles with a time axis by using a VADNN algorithm; extracting audio data of the second data, dividing the long audio into small audios by using a VADNN algorithm, numbering from front to back, sending the T divided audios to a multi-language voice recognition engine for recognition to obtain recognition texts corresponding to the audios of all numbers, and determining non-human voice parts in the divided small audios by combining the recognition texts and the VADNN algorithm result; and searching corresponding text segments in the subtitles, and matching the subtitle segments of all the segmented audios.

Description

Method for extracting multi-language network audio and video data based on AI engine

Technical Field

The invention belongs to the technical field of AI (Artificial Intelligence) and particularly relates to a method for extracting multilingual network audio and video data based on an AI engine.

Background

With the development of AI technology, the demand for various languages and various types of data is also increasing. In order to meet the requirements of the AI technology on various data, most AI companies adopt modes such as manual marking, manual recording, manual verification and the like. The method has the problems of high requirements of multiple languages on the quality and the specialty of a marker, high training labor cost, low manufacturing efficiency, high data manufacturing cost and the like in a pure manual mode. In consideration of the fact that a large number of multi-language audios and videos with subtitles, network radio stations and the like exist on a network, in order to solve the problems encountered by manual labeling of various data, the invention provides a method for extracting multi-language network audio and video data based on an AI engine, which can realize automatic extraction of large data of multi-language audios and videos with subtitles, network radio stations and the like, reduce requirements on quality and speciality of a labeling worker, reduce cost and improve data production efficiency.

Disclosure of Invention

In order to solve the problems existing in the scheme, the invention provides a method for extracting multilingual network audio/video data based on an AI engine.

The purpose of the invention can be realized by the following technical scheme:

a method for extracting multi-language network audio and video data based on an AI engine specifically comprises the following steps:

the method comprises the following steps: acquiring target audio and video data based on the Internet, and classifying the target audio and video data to acquire first data and second data; establishing a database for storage;

step two: extracting audio data of the first data, and classifying the audio of each time period into normal human voice audio, human voice environment sound mixed audio and non-human voice audio based on subtitles with a time axis by using a VADNN algorithm;

step three: extracting audio data of the second data, dividing the long audio into small audio by using a VADNN algorithm, and numbering from front to back, wherein the number is 001, 002, \8230, 00T; sending the T divided audios to a multi-language voice recognition engine for recognition to obtain recognition texts corresponding to the audios of all numbers, and determining the non-human voice part in the small divided audios by combining the recognition texts and the VADNN algorithm result;

step four: and searching corresponding text segments in the subtitles, and matching the subtitle segments of all the segmented audios.

Further, in step four, preconditions and parameter settings need to be performed, and the specific method includes:

setting a minimum text segment, marking the minimum text segment as L, identifying the length of each identification text, and filtering the identification texts with the length less than L;

and setting a first M length character, wherein M is a dynamic value, and setting an initial parameter to be N.

Further, searching for a corresponding text segment in the subtitle in four ways, which are respectively:

a. recognizing that a text is completely matched with a text segment in a subtitle;

b. identifying that the first character, the middle character, the tail character and the text length feature of the text are matched with a text segment in the caption;

c. identifying that the first character, the last character and the text length feature of the text are matched with a text segment in the subtitle;

d. identifying that the first character or the tail character of the text and the text length characteristic are matched with a text segment in the caption;

based on the four conditions, recording the number of the current audio, calculating the head and tail position information of the caption, wherein the confidence coefficient a is highest, the confidence coefficient d is lowest, and combining the audio number information of the non-human voice, filling the audio number information determined by the conditions of a, b, c and d into an array of T elements according to the priority order, wherein the high priority covers the low priority;

there are two matching cases:

A. matching numbers 00 (i-1) and 00 (i + 1), and directly positioning to the position information of the 00i number audio;

B. matching numbers 00 (i-2) and 00 (i + 1), wherein 00 (i-1) is non-human voice, and then positioning the audio position information of the 00i number;

and recursively finding all positioned number information, changing the value of M, and circularly confirming all information until finding all subtitle fragments of the segmented audio.

Further, the method for acquiring the target audio and video data based on the internet comprises the following steps:

setting a limiting condition of a target audio/video, acquiring a network platform of the target audio/video meeting the limiting condition from the Internet, marking the network platform as a platform to be selected, screening the platform to be selected, acquiring a platform to be selected to be butted, and marking the platform to be selected as a target platform; the data acquisition module is arranged and comprises a plurality of data acquisition units, the data acquisition units are associated with corresponding target platforms, and the data acquisition units are used for acquiring target audio and video data in the corresponding associated target platforms, identifying the acquired target audio and video data and printing corresponding identification tags.

Further, the method for screening the platform to be selected comprises the following steps:

marking the platform to be selected as j, wherein j =1, 2, \8230, 8230, n and n are positive integers; acquiring target audio and video data volume in a platform to be selected, matching a corresponding data magnitude index according to the acquired target audio and video data volume, marking the target audio and video data volume as LZj, evaluating the implementation cost of the platform to be selected, marking the target audio and video data volume as CBj, setting a quality model, acquiring target audio and video data in a plurality of platforms to be selected, inputting the target audio and video data into the quality model for analysis, acquiring corresponding quality scores, marking the target audio and video data as ZPj, calculating corresponding priority values according to a priority formula, and sorting according to the calculated priority values; and acquiring the demand magnitude of the client, and selecting a corresponding target platform according to the demand magnitude of the client and the sequencing of the platforms to be selected.

Further, the priority formula is Qj = b1 × LZj × (b 2 × CBj + b3 × ZPj), where b1, b2, and b3 are all proportional coefficients and have a value range of 0 and b1 ≦ 1,0 and b2 ≦ 1, and 0 and b3 ≦ 1.

Compared with the prior art, the invention has the beneficial effects that:

the method has the advantages that the corresponding data acquisition channel is established according to the actual condition of a client, the multi-language audio and video data generated in a large amount on the network are fully utilized, the multi-language AI engine is iteratively trained, various economic and time costs generated by the traditional production data are greatly reduced, the automatic extraction of big data such as multi-language audio and video with subtitles, network radio stations and the like is realized, the requirements on the quality and the specialty of a annotator are reduced, the cost is reduced, and the data making efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the iterative feedback engine mechanism of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 to fig. 2, a method for extracting multilingual network audio/video data based on an AI engine specifically includes:

the method comprises the following steps: acquiring a large amount of target audio and video data based on the Internet, and classifying the target audio and video data to acquire first data and second data; establishing a database for storage;

the method for acquiring a large amount of target audio and video data based on the Internet comprises the following steps:

setting a limiting condition of a target audio and video, wherein the limiting condition is set according to the needs of an AI company, acquiring a network platform of the target audio and video meeting the limiting condition from the Internet, marking the network platform as a platform to be selected, screening the platform to be selected, acquiring a platform to be selected to be butted, and marking the platform to be selected as a target platform; the data acquisition module is arranged and comprises a plurality of data acquisition units, the number of the data acquisition units is the same as that of the target platforms, the data acquisition units are associated with the corresponding target platforms, the data acquisition units are used for acquiring corresponding target audio and video data in the associated target platforms, identifying the acquired target audio and video data and marking corresponding identification tags.

The data acquisition unit is arranged corresponding to the target platform, corresponding data acquisition and identification are carried out in the platform, whether the identified caption with a time axis and the language type of the caption are provided or not is judged, because audio and video data in a network platform are introduced correspondingly generally and have the information of the caption, the language and the like, the specific data acquisition unit can be realized by the existing technology, and the same data acquisition unit can be used for partial network platforms because the operation working modes are the same, so detailed description is not carried out.

The method for screening the platform to be selected comprises the following steps:

marking the platform to be selected as j, wherein j =1, 2, \8230;, n are positive integers; acquiring target audio and video data volume in a platform to be selected, matching corresponding data magnitude indexes according to the acquired target audio and video data volume, marking the target audio and video data volume as LZj, evaluating the implementation cost of the platform to be selected, marking the target audio and video data volume as CBj, setting a quality model, acquiring target audio and video data in a plurality of platforms to be selected, inputting the target audio and video data into the quality model for analysis, acquiring corresponding quality scores, marking the target audio and video data volume as ZPj, calculating corresponding priority values according to a priority formula Qj = b1 xLZj x (b 2 xCBj + b3 xZPj), wherein b1, b2 and b3 are proportional coefficients, the value range is 0 & ltb 1 & lt/1 & gt, 0 & ltb 2 & lt/1 & gt, 0 & ltb 3 & lt/1 & gt, and sorting is carried out according to the calculated priority values; the method comprises the steps of obtaining a requirement magnitude of a client, namely setting according to a target audio and video data amount required by the client, and selecting a corresponding target platform according to the requirement magnitude of the client and the sequencing of platforms to be selected.

Namely, accumulation selection is carried out according to the data magnitude indexes corresponding to the sorted platforms to be selected.

And matching the corresponding data magnitude indexes according to the obtained target audio and video data quantity, dividing different intervals by the expert group according to the possible target audio and video data quantity, setting the corresponding data magnitude indexes in each interval, and obtaining the corresponding data magnitude indexes after matching.

Evaluating the implementation cost of the docking corresponding to the platform to be selected, and determining the corresponding cost according to the cost of the whole process from docking to acquisition, such as the cost of acquiring the corresponding target audio and video frequency, the construction cost, the operation cost and the like is common knowledge in the field.

The quality model is established based on the CNN network or the DNN network, and the specific establishing and training process is common knowledge in the field. Target audio and video data in the plurality of platforms to be selected do not need to be butted with the corresponding platforms to be selected, the target audio and video data can be directly obtained from a network due to the fact that the obtaining amount is small, implementation cost is not consumed, however, the target audio and video data can only be evaluated due to the fact that the target audio and video data are obtained with low efficiency and small quantity through the method, and follow-up development requirements are not met.

In another embodiment, the target audio-video data may be directly obtained by the existing method.

The method for carrying out the target audio/video data category is to classify by whether a subtitle scheme with a time axis is provided or not; the caption scheme with the time axis is the first data, and vice versa is the second data.

the audio data in the network audio and video and the working principle of the corresponding VADNN algorithm are common knowledge in the art, and therefore, detailed description is not given.

Step three: extracting audio data of the second data, dividing the long audio into small audios by using a VADNN algorithm, and numbering from front to back, wherein the numbers are 001, 002, \8230;, 00T; sending the T divided audios to a multi-language voice recognition engine for recognition to obtain recognition texts corresponding to the audios of all numbers, and determining the non-human voice part in the small divided audios by combining the recognition texts and the VADNN algorithm result;

step four: searching a corresponding text segment in the caption;

performing precondition and parameter setting:

filtering out a text with the identification text length L, wherein the value of L is the minimum text segment, so that the problem of wrong positioning information caused by matching of the text segment which is too small with a plurality of segments in the caption is avoided;

setting a first M length character, wherein M is a dynamic value, such as 1,2,3, \8230;, N; setting an initial parameter to be N;

searching a corresponding text segment in the caption by four ways:

a. recognizing that a text is completely matched with a certain text segment in the caption;

b. identifying that the first character, the middle character, the tail character, the text length and other characteristics of the text are matched with a certain text segment in the caption;

c. identifying that the characteristics of the first character, the tail character, the text length and the like of the text are matched with a certain text segment in the caption;

d. identifying that the characteristics of the first character or the tail character of the text, the length of the text and the like are matched with a certain text segment in the subtitle;

in the above four cases, the number of the current audio is recorded, the head and tail position information of the caption is calculated, wherein the confidence coefficient a is highest, the confidence coefficient d is lowest, the audio number information determined by the conditions of a, b, c and d is filled into an array of T elements according to the priority order by combining the audio number information of the non-human voice, the high priority covers the low priority, and the specific part which is not disclosed is common knowledge in the field; two matching situations occur:

and recursively finding all the number information capable of being positioned, changing the value of M, and circularly confirming all the information until finding the optimal subtitle fragments of all the segmented audios.

The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the closest real situation, and the preset parameters and the preset threshold value in the formula are set by the technical personnel in the field according to the actual situation or obtained by simulating a large amount of data.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A method for extracting multilingual network audio and video data based on an AI engine is characterized by comprising the following specific steps:

step three: extracting audio data of the second data, dividing the long audio into small audio by using a VADNN algorithm, and numbering from front to back, wherein the number is 001, 002, \8230, 00T; the segmented T audios are sent to a multi-language voice recognition engine for recognition, recognition texts corresponding to the audios of all numbers are obtained, and meanwhile, the non-human voice part in the segmented small audios is determined by combining the recognition texts and a VADNN algorithm result;

step four: and searching corresponding text segments in the subtitles, and matching all the subtitle segments of the segmented audio.

2. The method for extracting multilingual network audio-video data based on the AI engine according to claim 1, wherein preconditions and parameter settings are required in step four, and the specific method comprises:

3. The method for extracting multilingual network audio-video data based on the AI engine according to claim 2, wherein the corresponding text segment is searched in the subtitle in four ways, which are:

b. identifying that the first character, the middle character, the tail character and the text length feature of the text are matched with a text fragment in the caption;

c. identifying that the first character, the tail character and the text length feature of the text are matched with a text fragment in the caption;

d. identifying that the first character or the tail character of the text and the text length characteristic are matched with a text segment in the subtitle;

based on the four conditions, recording the number of the current audio, calculating the head and tail position information of the caption, wherein the confidence coefficient a is highest, the confidence coefficient d is lowest, and the audio number information determined by the conditions a, b, c and d is filled into an array of T elements according to the priority order by combining the audio number information of the non-human voice, and the high priority covers the low priority;

there are two matching cases:

4. The method for extracting multilingual network audio-video data based on the AI engine according to claim 1, wherein the method for obtaining the target audio-video data based on the internet comprises:

setting a limiting condition of a target audio/video, acquiring a network platform of the target audio/video meeting the limiting condition from the Internet, marking the network platform as a platform to be selected, screening the platform to be selected, acquiring a platform to be selected to be butted, and marking the platform to be selected as a target platform; the data acquisition module is arranged and comprises a plurality of data acquisition units, the data acquisition units are associated with corresponding target platforms, and the data acquisition units are used for acquiring corresponding target audio and video data in the associated target platforms, identifying the acquired target audio and video data and marking corresponding identification tags.

5. The AI engine-based method for extracting multilingual network audio-video data according to claim 4, wherein the method for screening the platform to be selected comprises:

marking the platform to be selected as j, wherein j =1, 2, \8230;, n are positive integers; acquiring target audio and video data volume in a platform to be selected, matching a corresponding data magnitude index according to the acquired target audio and video data volume, marking the target audio and video data volume as LZj, evaluating the implementation cost of the platform to be selected, marking the target audio and video data volume as CBj, setting a quality model, acquiring target audio and video data in a plurality of platforms to be selected, inputting the target audio and video data into the quality model for analysis, acquiring corresponding quality scores, marking the target audio and video data as ZPj, calculating corresponding priority values according to a priority formula, and sorting according to the calculated priority values; and acquiring the demand magnitude of the client, and selecting a corresponding target platform according to the demand magnitude of the client and the sequencing of the platforms to be selected.

6. The method for extracting multilingual network audio-video data based on the AI engine according to claim 5, wherein the priority formula is Qj = b1 xllzj x (b 2 xcbj + b3 xzpj), wherein b1, b2, and b3 are all proportionality coefficients, and the value range is 0-bl 1-1, 0-bl 2-1, 0-bl 3-1.