CN116486812A

CN116486812A - Automatic generation method and system for multi-field lip language recognition sample based on corpus relation

Info

Publication number: CN116486812A
Application number: CN202310295664.4A
Authority: CN
Inventors: 谭振华; 吴晓儿; 宁婧宇; 茹禹然
Original assignee: 东北大学
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-07-25

Abstract

The invention provides a multi-field lip language identification sample automatic generation method and system based on corpus relation, and relates to the technical field of lip language identification. The method comprises the following steps: collecting data, preprocessing and storing the data as an initial corpus; constructing a corpus, and processing the corpus based on a basic dictionary and a related domain-specific dictionary to form a domain corpus; synthesizing voice, and generating voice samples for each text in the corpus; generating a lip shape, and intelligently outputting a corresponding digital lip shape by combining a digital face and a voice file; annotating the label and generating a database, and labeling the digital lip to form a lip data set; the corpus is incrementally updated to create a lip dataset that can be dynamically expanded and realmized. The method solves the problems of large workload and low efficiency of the existing self-built lip language data set, and greatly improves the sample number and category diversity of the lip-shaped model, thereby effectively improving the generalization capability of the lip language identification model.

Description

Automatic generation method and system for multi-field lip language recognition sample based on corpus relation

Technical Field

The invention relates to the technical field of lip language identification, in particular to a multi-field lip language identification sample automatic generation method and system based on corpus relation.

Background

Lip recognition, also known as visual speech recognition, recognizes speech from silent video. The technology has very wide application prospect in a plurality of fields, including improving voice recognition in noisy environments, realizing silent dictation or dubbing and transcribing videos. In addition, it has important medical applications, and for people with vocal cord injury or hearing impairment, lip language is one of the effective non-verbal communication methods in their daily lives, and the development of this technology is also intended to help people with speech impairment live more conveniently.

Reviewing the development history of lip recognition research, audiovisual datasets tend to be a bottleneck restricting the performance improvement of lip recognition tasks. Early datasets had limited vocabulary and the acquisition environment was limited to laboratory scenarios. Vocabulary richness, variety diversity and collection environment authenticity are used as focus of the construction of a new data set in the later period, and the aim of promoting the solution of a plurality of problems of lip language identification is achieved. In general, the training data set should conform to the following principles: accuracy, comprehensiveness, and representativeness, i.e., the constructed lip dataset should be able to describe any sequence of sentences accurately and representatively, and ensure that there is enough data to support the model for reliable predictions, while covering generic features where no samples occur.

The existing lip language data set has the following problems: (1) Common public data sets are wide in data source and have a certain scale, but massive unstructured data actually contains a large amount of useful information, and training samples obtained through untreated processing are often unbalanced in distribution and difficult to ensure comprehensively. (2) Because the public data set is limited, most of lip language identification researches are based on the self-built data set, the manual labeling mode is adopted to label the data set, the workload is large, the efficiency is low, more labeling errors exist, and the acquisition process also often involves the problem of face privacy. (3) To date, most of the lip recognition effort has focused on a single language, on the one hand, due to the variability between languages and, on the other hand, the lack of sample balancing across-language lip datasets. In the process of improving the accuracy of lip language identification, it is important to establish a reasonable and accurate lip language data set. Thus, exploring a suitable lip dataset construction method is a matter of urgent need for those skilled in the art to solve.

Disclosure of Invention

The invention aims to solve the technical problems of the prior art, and provides a multi-field lip language recognition sample automatic generation method and system based on corpus relation, which can be suitable for various application scenes and effectively improve the quantity and quality of training samples of a lip language recognition model.

In order to solve the technical problems, the invention adopts the following technical scheme: on one hand, the invention provides a multi-field lip language identification sample automatic generation method based on corpus relation, text files in a network are collected through an automatic data collection pipeline, and the text files are stored as initial corpus data after data preprocessing; recognizing statement texts with importance degrees larger than a set threshold value in the initial corpus data based on the basic dictionary and the related domain-specific dictionary to form a domain corpus; performing voice synthesis on each sentence text in the domain corpus to generate a voice sample of corresponding text content; the generated voice sample is intelligently generated into a corresponding digital Lip sample by combining a digital face through a Wav2Lip technology; labeling the synthesized digital lip sample to generate a lip data set; incremental updates to existing lip corpus to build a lip dataset that can be dynamically expanded and realmized. The method specifically comprises the following steps:

step 1: collecting and preprocessing corpus data and storing the corpus data as an initial corpus file;

according to the field keywords, a large number of published text files in the related field are collected through an automatic data collection pipeline, and after the collected text files are subjected to data preprocessing, corpus data of a required scale are obtained and stored as initial corpus files;

the specific method for preprocessing the data of the collected text file comprises the following steps:

intercepting the collected text files according to the requirement according to the descending order of the file sizes, and carrying out text analysis and extraction on each file; the method comprises the steps of taking a page paragraph as a basic unit, carrying out duplication removal and deletion operation on irrelevant information on the premise of maintaining continuity of text file contents, dividing the paragraph into sentences according to punctuation marks, and storing the sentences as an initial corpus file;

step 2: constructing a domain corpus based on the initial corpus file;

performing word segmentation, corpus labeling and word removal operation on text content of an initial corpus file based on a basic dictionary and a domain-specific dictionary, performing word frequency statistics on the segmented corpus, and reserving high-frequency words and corresponding sentences in the current text content to form a domain corpus;

the word frequency statistics method for the segmented corpus comprises the following steps:

counting word frequency information of word segmentation in all corpus files and single corpus files, and calculating importance of the word segmentation by using a TF-IDF statistical method, wherein the importance is represented by the following formula:

TF-IDF＝TF·IDF

wherein the TF value represents the frequency of occurrence of a given word in a single file, i.e. word frequency; the IDF value represents the probability that a given word appears in all documents, i.e., the inverse text frequency; n represents the number of occurrences of a given word in a single corpus, N represents the sum of the number of occurrences of all words in the single corpus, D represents the total number of corpus, and D represents the number of corpus containing the given word;

step 3: performing voice synthesis on the domain corpus to generate a voice file;

the interface of the voice technology is called to carry out voice synthesis on the field corpus constructed in the step 2, a voice file is generated, and when the text content length of a single corpus file exceeds the specified length, the interface is called for multiple times to be integrated into a voice file; after the file is successfully synthesized, the word and sentence of the text is requested to be returned to the interface and stored in the beginning and ending time stamp of the voice file;

step 4: generating a digital lip based on the voice file;

the standard digital face model is adopted as a generated image basis to remove interference of irrelevant factors, and the voice file generated in the step 3 is converted into Lip animation based on the Wav2Lip synchronous model, so that the required Lip video is obtained;

the Wav2Lip synchronous model adopts a mode of combining PaddleGAN and Omniverse Audio2Face, a corresponding operation environment is configured, a required library package is introduced, a standard digital Face model and a voice file are introduced into the model, and a Face rotation angle is selected to generate a digital Lip;

the face rotation angle can be selected to be 0 degrees, 30 degrees, 45 degrees, 60 degrees, 75 degrees and 90 degrees;

step 5: performing label annotation on the digital lip to form a lip database;

taking the digital lip video generated in the step 4 as a sample of a lip language database, taking the word frequency statistic value in the step 2 and the timestamp in the step 3 as auxiliary labels, combining the real text content to form a sample label, and realizing the mapping of the final digital lip and the label to form a lip language database;

the sample tag consists of text content, language type, sentence metric value and video duration; wherein the language category comprises a plurality of languages; the sentence metric is calculated by word frequency statistics of all words composing the sample, and provides importance metric for the sample; the video duration comprises the total duration of a given lip video and the start and end time stamps of all the word segments in the video;

step 6: incrementally updating the corpus and expanding the lip database;

for the newly added text data, the operations of the step 1 and the step 2 are executed, and the corpus is intercepted again as required by combining the original corpus and the incremental corpus; for the newly generated corpus, steps 3 to 5 are repeated to realize expansion of the lip database.

On the other hand, the invention also provides a multi-field lip language identification sample automatic generation system based on corpus relation, which comprises an information acquisition module, a first corpus storage module, a second corpus storage module, a corpus extraction module, a lip-shaped sample generation module, a label annotation module, a third lip-shaped video storage module and a fourth lip-shaped video storage module:

the information acquisition module acquires massive text information in an acquisition range in various forms and processes the text information to obtain field corpus information; the collection range comprises a currently existing public lip language data set and text files related to field keywords;

the first corpus storage module is used for storing corpus information of the existing public lip language data set and the original lip video acquired by the information acquisition module;

the second corpus storage module is used for storing the domain corpus information acquired by the information acquisition module;

the corpus extraction module is used for performing word segmentation and word frequency calculation on the domain corpus information in the second corpus storage module, and updating the second corpus storage module after obtaining a domain corpus;

the lip sample generation module is used for generating a digital lip sample file for the corpus information stored in the first corpus storage module and the domain corpus stored in the second corpus storage module;

the label annotation module is used for carrying out label annotation on the generated digital lip sample file to form a field lip database and storing the field lip database into the fourth lip video storage module;

the third lip-shaped video storage module is used for storing the digital lip shape obtained by the corpus information of the first corpus storage module in the sample generation module, and combining the corresponding original lip-shaped videos to obtain a new lip-shaped database;

the fourth lip-shaped video storage module is used for storing a field lip-shaped database obtained after the field corpus information of the second corpus storage module is processed by the sample generation module and the tag annotation module.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: according to the multi-domain lip data automatic generation method based on the corpus relation for lip language identification, which is provided by the invention, (1) the related characteristics of an original corpus are supported to be extracted from large-scale data, a professional corpus can be constructed for each domain, and a balanced corpus with uniformly distributed text characteristics is realized; (2) The automatic acquisition of data and the intelligent generation of the digital lip shape are realized, and the labor cost for constructing a data set can be effectively reduced; incremental management is carried out on the corpus, so that the sample number and category diversity of lip language identification research are improved; (3) The training and testing data set can be provided for the lip language identification model, and the original lip language video sample and the generated digital lip shape video sample are combined, so that the generalization capability of the lip language identification model can be effectively improved.

Drawings

Fig. 1 is a schematic flow chart of a multi-domain lip data automatic generation method based on corpus relation for lip language identification, which is provided by the embodiment of the invention;

FIG. 2 is a digital face model for use in synthesizing lips according to an embodiment of the present invention;

FIG. 3 is a digital lip sample generated in accordance with an embodiment of the present invention;

fig. 4 is a structural block diagram of a multi-domain lip data automatic generation system based on corpus relation for lip language identification provided by the embodiment of the invention;

fig. 5 is a schematic diagram of a sample generation flow of a lip language recognition model training method according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In this embodiment, a method for automatically generating a multi-domain lip language recognition sample based on a corpus relationship, as shown in fig. 1, specifically includes the following steps:

according to field keywords (such as daily expressions, criminal investigation information, news interviews and the like), acquiring a large number of text files in the related field through an automatic data collecting pipeline, preprocessing the acquired text files, acquiring corpus data of a required scale, and storing the corpus data as an initial corpus file;

in this embodiment, the automatic data collection pipeline is implemented using Python-based crawler technology. Firstly, setting a random user-agent simulated browser, searching target domain keywords of sites such as hundred degrees, microblogs and the like, and storing a webpage address URL (uniform resource locator) of which information needs to be captured; then, sending a request to the website, returning the webpage content in binary system, extracting text content in the webpage, and storing the result in a file form;

in this embodiment, the specific method for preprocessing the data of the collected text file includes:

intercepting the collected text files according to the requirement according to the descending order of the file sizes, and carrying out text analysis and extraction on each file; the method comprises the steps of taking a page paragraph as a basic unit, carrying out duplication removal and deletion operation on irrelevant information (such as catalogues, pictures and website structure contents) on the premise of keeping continuity of text file contents, and dividing the paragraph into sentences according to punctuation marks, wherein the influence of special characters (such as English/'/marks have the effect of representing all grids of nouns) on context semantics is considered, nonsensical segmentation is avoided, a result obtained after data cleaning is stored as an initial corpus file, and the file format is csv;

step 2: constructing a domain corpus based on the initial corpus file;

in this embodiment, after word segmentation, corpus labeling and word removal operations are performed on a language document, a domain professional corpus is formed preliminarily, and before word frequency statistics is performed, sampling and checking can be performed on the language document, and errors possibly occurring in the operation process are checked to improve the construction quality of the corpus;

in this embodiment, the word frequency statistics method for the segmented corpus includes:

TF-IDF＝TF·IDF

wherein the TF value represents the frequency of occurrence of a given word in a single file, i.e. word frequency; the IDF value represents the probability that a given word appears in all documents, i.e., the inverse text frequency; n represents the number of occurrences of a given word in a single corpus, N represents the sum of the number of occurrences of all words in the single corpus, D represents the total number of corpus, and D represents the number of corpus containing the given word; the TF-IDF refers to word frequency-inverse text frequency, and the larger the TD-IDF value is, the higher the frequency of occurrence of the word in the corresponding file is, but the less occurrence of the word in other files is, so that the common words are filtered, and the keywords of each file are reserved as a domain corpus;

calling an interface of hundred-degree voice technology to perform voice synthesis on the domain corpus constructed in the step 2 to generate a voice file, wherein the file format is mp4; when the text content length of a single corpus file exceeds the specified length, calling the interface for multiple times to be integrated into a voice file; after the file is successfully synthesized, the word and sentence of the text is requested to be returned to the interface and stored in the beginning and ending time stamp of the voice file;

in the embodiment, a POST request voice synthesis interface http:// tsn.baidu.com/text2audio is used for carrying out short text online synthesis, and as the high-frequency words and corresponding sentences of the corpus text are stored in the step 2, in order to avoid the occurrence of error conditions of overlong sentences, the judgment of the text length is carried out before synthesis so as to switch the long text online synthesis; in order to better cover all special characters, 2 times of url codes are carried out when the synthesized text is transferred;

the file synthesis structure can inquire the speech synthesis task result in batches according to the array of task_id, the request interface is https:// aip.baidbce.com/rpc/2.0/tts/v 1/query, the request mode is POST, and the information required in the returned result is in the following structural form:

step 4: generating a digital lip based on the voice file;

in order to make the generated data set focus on the feature of Lip deformation, a standard digital face model is adopted as a generated image basis to remove interference of irrelevant factors such as gender, age, illumination condition and the like, the voice file generated in the step 3 is converted into Lip animation based on a Wav2Lip synchronous model, so that a required Lip video is obtained, the video frame rate is set to 25fps, and the resolution is 360 x 240;

as shown in FIG. 2, the digital face model adopted in the embodiment has the face rotation angles of 0 °, 30 °, 45 °, 60 °, 75 °, and 90 °; the Wav2Lip synchronous model adopts a mode of combining PaddleGAN and Omniverse Audio2Face, a corresponding operation environment is configured, a required library package is introduced, a standard digital Face model and a voice file are introduced into the model, and a Face rotation angle is selected to generate a digital Lip; the recognition effect of the chinese lip generated by using the PaddleGAN is better, and the effect of the Audio2Face on generating the latin language is better, so that the PaddleGAN is used for synthesizing the lip video of the japanese and korean languages, and the Audio2Face is used for synthesizing the lip video without other special cases, and fig. 3 is a lip sample synthesized in this embodiment.

In this embodiment, the video lip synchronization model based on the PaddleGAN needs to configure the environment before use, firstly downloads the PaddleGAN and the installation related package, respectively replaces the face parameter and the audio parameter in the lip motion synthesis command with the paths of the digital face model and the voice file, and runs the command to generate the video synchronized with the audio. The generation efficiency depends on the performance of the computer device, in this embodiment NVIDIA GeForce RTX 2060SUPER is used, and the time required to generate lip video is about 2.5 times the corresponding audio time;

in this embodiment, the video lip synchronization model based on Omniverse Audio2Face can directly perform Audio and video synchronization synthesis, use recording software bandcam, select settings such as recording area and file naming, and record and save synthesized video by writing script program;

step 5: performing label annotation on the digital lip to form a lip database;

the sample tag consists of text content, language type, sentence metric value and video duration; wherein, the language category comprises a plurality of languages (the embodiment comprises Chinese and English languages); the sentence metric is calculated by word frequency statistics of all words composing the sample, and provides importance metric for the sample; the video duration comprises the total duration of a given lip-shaped video and the start and end time stamps of all the segmented words in the video, and a reference basis is provided when blank and nonsensical fragments are removed in lip-language identification related research processing data;

a lip sample label synthesized by the embodiment has the following structural form:

step 6: incrementally updating the corpus and expanding the lip database;

In this embodiment, as shown in fig. 4, the system for automatically generating a multi-domain lip language recognition sample based on a corpus relationship includes an information acquisition module, a first corpus storage module, a second corpus storage module, a corpus extraction module, a lip-shaped sample generation module, a tag annotation module, a third lip-shaped video storage module and a fourth lip-shaped video storage module:

the information acquisition module acquires massive text information in an acquisition range through various forms such as web crawlers, file batch uploading, manual input and the like, and processes the text information to obtain field corpus information; the collection range comprises a currently existing public lip language data set and text files related to field keywords;

In this embodiment, the information collection module includes an automatic data collection pipeline module related to a web crawler, a file uploading module for file batch processing, an input module for manually inputting text, and a corpus information classification storage module for collected text content. The corpus information classification storage module transmits the text of the current public lip language data set to the first corpus storage module, and transmits the text related to the field to the second corpus storage module;

the corpus extraction module comprises a domain keyword library and a corpus processing module, wherein the keyword library is used for storing a basic dictionary and a domain specific dictionary, the keyword library is used for carrying out corpus processing on corpus information of the second corpus storage module, namely word segmentation, word frequency calculation and high-frequency word screening, and the obtained domain corpus is stored in the second corpus storage module again;

the second corpus storage module manages the corpus according to the domain classification, allows incremental update of the corpus, performs word frequency sequencing on newly added corpus information and the corpus in the existing domain corpus, and expands the corresponding domain corpus as required;

and (3) packaging the steps 3 and 4 in the multi-field lip language recognition sample automatic generation method based on the corpus relation by the lip language sample generation module to form a voice synthesis module and a lip language generation module. Firstly, judging whether a corpus sample has a corresponding audio file or not through a voice synthesis module, if so, directly transmitting the file to a lip-shape generation module, otherwise, synthesizing the voice file; then, the lip generating module imports module initialization information, which contains information such as selection of a lip synchronous model and a digital face model, video frame rate and resolution, and the like, generates and stores a digital lip sample file, wherein the sample generated by the corpus in the second corpus storage module is transmitted to the tag annotation module, and the sample generated by the corpus in the first corpus storage module is directly transmitted to the third lip video storage module;

step 5 in the automatic generation method of the multi-field lip language recognition sample based on the corpus relation is packaged by the label annotation module, the structured labels and the digital lips are mapped one by one, and the structured labels and the digital lips are transmitted to the fourth lip video storage module for storage;

the third lip-shaped video storage module is used for storing the sample data obtained by the sample generation module and the original data set together in an association mode based on the existing public lip-language data set so as to be used as a training data set in later lip-language recognition research.

In this embodiment, a multi-domain lip language recognition sample automatic generation system based on corpus relationship may be used to generate a lip language data set, so as to be used as a training sample and a test sample of a lip language recognition model, as shown in fig. 5, the training method of the lip language recognition model is based on the existing public lip language data set, the public data set is stored in the first corpus storage module, and the final training sample and training sample are obtained from the third lip language video storage module.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A multi-field lip language identification sample automatic generation method based on corpus relation is characterized in that: collecting text files in a network through an automatic data collecting pipeline, preprocessing the data, and storing the preprocessed data as initial corpus data; recognizing statement texts with importance degrees larger than a set threshold value in the initial corpus data based on the basic dictionary and the related domain-specific dictionary to form a domain corpus; performing voice synthesis on each sentence text in the domain corpus to generate a voice sample of corresponding text content; the generated voice sample is intelligently generated into a corresponding digital Lip sample by combining a digital face through a Wav2Lip technology; labeling the synthesized digital lip sample to generate a lip data set; incremental updates to existing lip corpus to build a lip dataset that can be dynamically expanded and realmized.

2. The automatic generation method of the multi-domain lip language recognition sample based on the corpus relation according to claim 1, wherein the method is characterized by comprising the following steps of: the method comprises the following steps:

step 2: constructing a domain corpus based on the initial corpus file;

step 4: generating a digital lip based on the voice file;

step 5: performing label annotation on the digital lip to form a lip database;

step 6: incrementally updating the corpus and expanding the lip database;

3. The automatic generation method of the multi-domain lip language recognition sample based on the corpus relation according to claim 2, wherein the method is characterized by comprising the following steps of: the specific method for preprocessing the data of the collected text file in the step 1 is as follows:

intercepting the collected text files according to the requirement according to the descending order of the file sizes, and carrying out text analysis and extraction on each file; and (3) carrying out de-duplication and deletion operation on irrelevant information on the premise of keeping the consistency of the content of the text file by taking the page paragraph as a basic unit, dividing the paragraph into sentences according to punctuation marks, and storing the sentences as an initial corpus file.

4. The automatic generation method of multi-domain lip language recognition samples based on corpus relation according to claim 3, wherein the method is characterized by comprising the following steps: the specific method for word frequency statistics of the segmented corpus in the step 2 is as follows:

TF-IDF＝TF·IDF

wherein the TF value represents the frequency of occurrence of a given word in a single file, i.e. word frequency; the IDF value represents the probability that a given word appears in all documents, i.e., the inverse text frequency; n represents the number of occurrences of a given word in a single corpus, N represents the sum of the number of occurrences of all words in the single corpus, D represents the total number of corpora, and D represents the number of corpora containing the given word.

5. The automatic generation method of the multi-domain lip language recognition sample based on the corpus relation according to claim 4, wherein the method is characterized by comprising the following steps: the Wav2Lip synchronous model adopts a mode of combining PaddleGAN and Omniverse Audio2Face, a corresponding operation environment is configured, a required library package is introduced, a standard digital Face model and a voice file are introduced into the model, and a Face rotation angle is selected to generate a digital Lip.

6. The automatic generation method of the multi-domain lip language recognition sample based on the corpus relation according to claim 5, wherein the method is characterized by comprising the following steps: the face rotation angle may be selected to be 0 °, 30 °, 45 °, 60 °, 75 °, 90 °.

7. The automatic generation method of the multi-domain lip language recognition sample based on the corpus relation according to claim 6, wherein the method is characterized by comprising the following steps: step 5, the sample label consists of text content, language type, sentence metric value and video duration; wherein the language category comprises a plurality of languages; the sentence metric is calculated by word frequency statistics of all words composing the sample, and provides importance metric for the sample; the video duration includes the total duration of a given lip video, as well as the start and end time stamps of all the tokens in the video.

8. An automatic generation system of multi-domain lip language recognition samples based on corpus relation, realized based on the method of claim 1, characterized in that: the system comprises an information acquisition module, a first corpus storage module, a second corpus storage module, a corpus extraction module, a lip-shaped sample generation module, a label annotation module, a third lip-shaped video storage module and a fourth lip-shaped video storage module: