CN113035199A - Audio processing method, device, equipment and readable storage medium - Google Patents

Audio processing method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN113035199A
CN113035199A CN202110139349.3A CN202110139349A CN113035199A CN 113035199 A CN113035199 A CN 113035199A CN 202110139349 A CN202110139349 A CN 202110139349A CN 113035199 A CN113035199 A CN 113035199A
Authority
CN
China
Prior art keywords
audio
sign language
target
text
language gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110139349.3A
Other languages
Chinese (zh)
Inventor
田园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth RGB Electronics Co Ltd
Original Assignee
Shenzhen Skyworth RGB Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth RGB Electronics Co Ltd filed Critical Shenzhen Skyworth RGB Electronics Co Ltd
Priority to CN202110139349.3A priority Critical patent/CN113035199A/en
Publication of CN113035199A publication Critical patent/CN113035199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Abstract

The invention discloses an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, wherein the method comprises the following steps: acquiring audio to be processed, and converting the audio to be processed into a target text; performing feature extraction on the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into a corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.

Description

Audio processing method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method, apparatus, device, and readable storage medium.
Background
The information itself is intangible, if the information is to be understood and accepted by people, the information must be represented by a certain method, for example, when the tv play plays video or news information, the information transmission is usually performed by combining video with audio or text, and the information transmission mode is too single.
However, according to the latest research data, the number of people with hearing impairment in china reaches 2.2 hundred million, more than 7000 million hearing losses exist, most of the current playing terminals only support audio playing when playing video, for example, most of the live news programs of mainstream media are synchronous broadcast and character synchronous broadcast without sign language broadcasters, that is, at present, due to a single information transmission mode, the people cannot understand news contents when watching the live news programs, and thus the watching experience of the people is affected.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, and aims to solve the technical problem that the user experience is influenced due to the fact that the existing information transmission mode is too single.
In order to achieve the above object, the present invention provides an audio processing method, including the steps of:
acquiring audio to be processed, and converting the audio to be processed into a target text;
performing feature extraction on the target text to obtain text feature data corresponding to the target text;
and acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Preferably, the step of converting the audio to be processed into a target text comprises:
extracting voice from the audio to be processed to obtain a target voice audio in the audio to be processed;
and performing semantic recognition on the target voice audio to obtain a target text.
Preferably, the step of extracting the voice of the audio to be processed to obtain the target voice audio in the audio to be processed includes:
acquiring the audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model so as to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
and acquiring the target human voice audio frequency in the audio to be processed based on the target human voice audio frequency characteristics.
Preferably, before the step of obtaining the sign language gesture image corresponding to the text feature data through the preset sign language gesture conversion model, the method further includes:
acquiring an initial model and a plurality of text training data;
determining a sign language gesture prediction result corresponding to the text training data through the initial model;
acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
updating the model parameters of the initial model in a gradient descending manner, and taking the corresponding model parameters when the loss function convergence or the model training turn reaches a preset training iteration turn as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
Preferably, after the step of converting the audio to be processed into the target text, the method further includes:
performing word segmentation processing on the target text to obtain multiple groups of text word segments corresponding to the target text;
performing semantic recognition on each group of text participles respectively to obtain semantic recognition results corresponding to each group of text participles;
traversing a preset sign language gesture text word bank based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word bank;
and acquiring multi-frame sign language gesture images corresponding to the multiple groups of text participles based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
Preferably, the step of displaying the plurality of frames of sign language gesture images comprises:
determining the position information of each group of text participles corresponding to each frame of the sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
and converting the multi-frame sign language gesture images into target sign language gesture images according to the sorting result, and displaying the target sign language gesture images.
Preferably, the step of displaying the target sign language gesture image comprises:
determining an audio frame timestamp of the audio to be processed;
determining an image frame timestamp of the target sign language gesture image based on the audio frame timestamp to display the target sign language gesture image based on the image frame timestamp.
Further, to achieve the above object, the present invention also provides an audio processing apparatus, including:
the acquisition module is used for acquiring audio to be processed and converting the audio to be processed into a target text;
the extraction module is used for extracting the features of the target text to obtain text feature data corresponding to the target text;
and the output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image.
Further, to achieve the above object, the present invention also provides an audio processing device, which includes a memory, a processor, and an audio processing program stored on the memory and executable on the processor, wherein the audio processing program, when executed by the processor, implements the steps of the audio processing method as described above.
Further, to achieve the above object, the present invention also provides a readable storage medium, on which an audio processing program is stored, the audio processing program implementing the steps of the audio processing method as described above when executed by a processor.
Compared with the existing terminal video playing mode, the method and the device have the advantages that the audio to be processed is obtained and converted into the target text; performing feature extraction on the target text to obtain text feature data corresponding to the target text; the target sign language gesture image corresponding to the text characteristic data is obtained through the preset sign language gesture conversion model, and the target sign language gesture image is displayed, so that the audio is converted into the corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the audio processing apparatus of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of an audio processing method according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of an audio processing method according to the present invention;
FIG. 4 is a flowchart illustrating an audio processing method according to a third embodiment of the present invention;
FIG. 5 is a block diagram of an audio processing apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an audio processing device, and referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the audio processing device of the invention.
As shown in fig. 1, the audio processing apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may optionally be a storage audio processing device separate from the processor 1001 described previously.
Those skilled in the art will appreciate that the hardware configuration of the audio processing device shown in fig. 1 does not constitute a limitation of the audio processing device and may include more or less components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a readable storage medium, may include therein an operating system, a network communication module, a user interface module, and an audio processing program. The operating system is a program for managing and controlling hardware and software resources of the audio processing device and supports the running of a network communication module, a user interface module, an audio processing program and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.
In the hardware structure of the audio processing device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
acquiring audio to be processed, and converting the audio to be processed into a target text;
performing feature extraction on the target text to obtain text feature data corresponding to the target text;
and acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
extracting voice from the audio to be processed to obtain a target voice audio in the audio to be processed;
and performing semantic recognition on the target voice audio to obtain a target text.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
acquiring the audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model so as to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
and acquiring the target human voice audio frequency in the audio to be processed based on the target human voice audio frequency characteristics.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
acquiring an initial model and a plurality of text training data;
determining a sign language gesture prediction result corresponding to the text training data through the initial model;
acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
updating the model parameters of the initial model in a gradient descending manner, and taking the corresponding model parameters when the loss function convergence or the model training turn reaches a preset training iteration turn as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
performing word segmentation processing on the target text to obtain multiple groups of text word segments corresponding to the target text;
performing semantic recognition on each group of text participles respectively to obtain semantic recognition results corresponding to each group of text participles;
traversing a preset sign language gesture text word bank based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word bank;
and acquiring multi-frame sign language gesture images corresponding to the multiple groups of text participles based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
determining the position information of each group of text participles corresponding to each frame of the sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
and converting the multi-frame sign language gesture images into target sign language gesture images according to the sorting result, and displaying the target sign language gesture images.
Further, the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
determining an audio frame timestamp of the audio to be processed;
determining an image frame timestamp of the target sign language gesture image based on the audio frame timestamp to display the target sign language gesture image based on the image frame timestamp.
The invention also provides an audio processing method.
Referring to fig. 2, fig. 2 is a flowchart illustrating an audio processing method according to a first embodiment of the invention.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein. Specifically, the audio processing method of the embodiment includes:
step S10, acquiring audio to be processed, and converting the audio to be processed into a target text;
it should be noted that the execution main body of this embodiment may be an intelligent terminal device having an audio processing function and a display function, where the intelligent terminal device may be an electronic device such as a computer or a mobile phone, or may also be another device capable of implementing the same or similar functions.
It should be understood that the information itself is intangible, and if the information is to be understood and accepted by people, the information must be represented by a certain method, for example, when some current television terminals play videos, for example, when the television terminals play live news programs, because the live news programs are mostly synchronous broadcast and character synchronous broadcast of an unspecified sign language broadcaster, the viewers with hearing impairment cannot understand the news content when watching the live news programs, thereby affecting the watching effect of the people.
In this embodiment, the audio to be processed may be recorded audio, such as audio data in a documentary, a movie, a television play, and the like, or may also be real-time audio, such as real-time audio data in a news live program or a video call, and the embodiment does not limit this.
It should be understood that, when the audio to be processed is real-time audio, the original audio data in the live video may be obtained in real time and analyzed, for example, feature extraction may be performed on the original audio data to separate target human voice audio data and non-human voice audio data (such as background music data and noise data) in the original audio data, then the target human voice audio data is detected to determine whether there is an interruption in the target human voice audio data, for example, human voice audio data is detected during the 10 th to 120 th periods during the detection period, then human voice audio data is detected again at the 130 th period, it is determined that there is an interruption in the target human voice audio data, an interruption time (such as any one second between 120 th or 130 th periods or between 120 th to 130 th periods) is determined, so as to divide the target human voice audio data into a first segment of human voice audio data and a second segment of human voice audio data based on the interruption time, the first section of voice audio data refers to voice audio data before interruption, the second section of voice audio data refers to voice audio data before interruption, then the first section of voice audio data is used as audio to be processed, audio processing is carried out on the audio to be processed, meanwhile, the second section of voice audio data is used as target voice audio data, and the steps are repeatedly executed until the target voice audio data processing is finished.
When the audio to be processed may be recorded and broadcast audio that is recorded in advance, the complete recorded and broadcast audio may be optionally used as the recorded and broadcast audio to be processed, and the recorded and broadcast audio to be processed may be further split into a plurality of segments of audio, and each segment of audio is used as the audio to be processed, which is not limited in this embodiment.
It should be understood that, in order to improve the accuracy of the audio processing result, before converting the audio to be processed into the target text, the audio to be processed needs to be subjected to a process such as denoising, so as to remove the influence of noise on the audio processing result to be processed.
In addition, for convenience of understanding, the present embodiment specifically describes the above-mentioned implementation of converting the audio to be processed into the target text:
extracting voice from the audio to be processed to obtain a target voice audio in the audio to be processed;
and performing semantic recognition on the target voice audio to obtain a target text matched with the target voice audio.
It should be understood that, in general, the audio to be processed is mixed audio, that is, the audio includes not only a human voice audio portion but also other environmental audio portions such as a background music audio portion, and therefore, in order to improve the accuracy of the audio processing result, in this embodiment, human voice extraction is performed on the audio to be processed to obtain a target human voice audio in the audio to be processed, so as to obtain a target text according to the target human voice audio.
Preferably, in this embodiment, a pre-trained audio separation model is adopted to extract human voice from the audio to be processed, specifically, feature extraction is performed on the audio to be processed to obtain a feature vector, the feature vector is input to the pre-trained audio separation model, the human voice audio feature vector is output through the pre-trained audio separation model, and a target human voice audio is extracted from the audio to be processed through the human voice audio feature vector.
Preferably, in another embodiment, when the audio to be processed is a recorded audio that is recorded in advance, it can be understood that the recorded audio is a synthesized audio, that is, the human voice audio and the background music audio are subjected to channel addition, so that the audio to be processed can be decomposed into multiple sub-audios corresponding to multiple different channels based on the channel characteristics, a sub-audio corresponding to the human voice channel is screened out from the multiple different channels based on the channel identification, and the sub-audio corresponding to the human voice channel is taken as the target human voice audio.
Specifically, in order to improve the accuracy of a result of extracting human voices, in this embodiment, a model is used to extract human voices of an audio to be processed, and for easy understanding, the embodiment specifically describes an implementation manner of extracting human voices of the audio to be processed to obtain a target human voice audio in the audio to be processed, where:
acquiring the audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model so as to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
and acquiring the target human voice audio frequency in the audio to be processed based on the target human voice audio frequency characteristics.
It should be understood that the preset audio separation model refers to a model trained in advance by using several audio training data, in this embodiment, it is preferable to perform pre-denoising, pre-emphasis, framing, windowing, fast fourier transform and other processes on the audio to be processed to obtain mel cepstral coefficients, and use the mel cepstral coefficients as the audio features of the audio to be processed.
After the target voice audio is obtained, phonetic symbol feature extraction is carried out on the target voice audio to determine the phonetic symbol attribute corresponding to the target voice audio, such as Chinese phonetic symbols, English phonetic symbols, etc., if it is detected that the phonetic symbol attribute corresponding to the voice frequency of the target person is Chinese phonetic symbol, dividing the target voice audio into a plurality of groups of phonetic symbols based on a preset Chinese phonetic symbol division rule, and determining at least one candidate word matched with each group of audio based on a preset Chinese word library, then, each group of audio-matched candidate words are combined one by one to obtain a plurality of groups of texts and the corresponding context fit weight of each group of texts, and finally, the text corresponding to the maximum context fit weight is selected as the target text, and the preset audio recognition engine can be called to recognize the voice frequency of the target person to obtain the target characters, and the preset audio recognition engine is adopted to accelerate the recognition speed.
In addition, in another embodiment, when the phonetic symbol attribute corresponding to the target voice audio is a non-chinese phonetic symbol, such as an english phonetic symbol, it should be understood that, in order to improve the accuracy of audio interpretation, the phonetic symbol attribute corresponding to the target voice audio should be converted into a chinese phonetic symbol, and the target voice audio should be divided into several groups of phonetic symbols based on a preset chinese phonetic symbol division rule for character recognition, thereby improving the accuracy of character recognition result.
Step S20, extracting the features of the target text to obtain text feature data corresponding to the target text;
and step S30, acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Specifically, the text feature data includes keywords in the target text, attributes of each word in the target text, such as a person pronoun, a word assist, an adjective, a verb, and the like, and further includes feature data for characterizing the meaning of each word in the target text, and the like.
In this embodiment, a preset sign language gesture conversion model is used to determine a corresponding target sign language gesture image corresponding to a target text, and in this step, text feature data is input into the preset sign language gesture conversion model, so that the preset sign language gesture conversion model outputs the corresponding target sign language gesture image.
In addition, it is noted that, since the target text is composed of a plurality of words or phrases, and generally, a single frame of sign language gesture image is mapped to each word or phrase, in this embodiment, the target sign language gesture image is composed of a plurality of frames of sign language gesture images, and in some embodiments, when the sign language gesture image is displayed after the plurality of frames of sign language gesture images are obtained, a problem that the sign language gesture image is not synchronized or matched with the audio content occurs, so as to affect the user experience, therefore, in this embodiment, in order to ensure that the sign language gesture image is synchronized with the audio content in the video, when the target sign language gesture image is displayed, an implementation scheme is provided, specifically:
determining display sequence information of each frame of sign language gesture image in the target sign language gesture image;
and displaying the target sign language gesture image according to the display sequence information.
Specifically, the display order information refers to character ordering of words or phrases corresponding to each frame of sign language gesture image from left to right in the target text, for example, "i want to go to work today" the corresponding text participles are "i", "today", "go" and "go to work", then the sign language gesture image corresponding to "i" is arranged at the first position, the sign language gesture image corresponding to "today" is arranged at the second position, and so on, and based on the display order information, the multiple frames of sign language gesture images are converted into coherent dynamic sign language gestures for display and output.
In addition, when outputting the sign language gesture images, attention should be paid to ensure synchronization between the display time of the output sign language gesture images and the playing time of the audio to be processed, preferably, when the audio to be processed can be recorded and played audio in advance, an audio frame timestamp of audio data corresponding to each frame of sign language gesture images in the audio to be processed is determined, so as to determine an image frame timestamp of each frame of sign language gesture images in the target sign language gesture images according to the audio frame timestamp, preferably, when the audio to be processed is real-time audio, an audio stream with a duration is cached in advance, then the audio stream is subjected to audio processing, and the sign language gesture images after the audio processing are output in real time.
In addition, in an embodiment, the step of displaying the dynamic sign language gesture includes:
determining an audio frame timestamp of the audio to be processed;
determining an image frame timestamp of the target sign language gesture image based on the audio frame timestamp to display the target sign language gesture image based on the image frame timestamp.
Specifically, the audio frame timestamp refers to playing time information of each audio stream of the audio to be processed, and the image frame timestamp refers to display time information of each sign language gesture image in the target sign language gesture image.
In addition, when the audio to be processed exists in the corresponding video stream, the audio to be processed and the target sign language gesture image are synchronously output and displayed in the target terminal.
It should be understood that, the above target terminal refers to a playing terminal corresponding to the video to be processed, and optionally, the target sign language gesture image is displayed in an area where the video stream corresponding to the audio to be processed is displayed in the display screen, for example, the left lower side, the right lower side, the left upper side, the right upper side, and the like of the area, or the target sign language gesture image may also be displayed in an area where the video stream corresponding to the audio to be processed is displayed in the display screen, and the like, which is not limited in this embodiment.
In this embodiment, the display time of each frame of the sign language gesture image in the target sign language gesture image is determined according to the playing time of each frame of the audio stream of the audio to be processed, so as to ensure that the content of the audio to be processed and the content of the target sign language gesture image are displayed synchronously, and the display position information of the video stream corresponding to the audio to be processed in the display screen of the target terminal and the display position information of the target sign language gesture image in the display screen of the target terminal are determined, so that the compatibility of the display frames is ensured while the diversity of information transmission modes is ensured, and further the user experience is improved.
Compared with the existing terminal video playing mode, in the embodiment, the audio to be processed is obtained and converted into the target text; performing feature extraction on the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into a corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.
Further, based on the first embodiment of the audio processing method of the present invention, a second embodiment of the audio processing method of the present invention is proposed.
Referring to fig. 3, fig. 3 is a flowchart illustrating an audio processing method according to a second embodiment of the invention.
The second embodiment of the audio processing method is different from the first embodiment of the audio processing method in that, before the step of obtaining the multi-frame sign language gesture images corresponding to the multiple groups of text participles through the preset sign language gesture conversion model, the method further includes:
step S301, acquiring an initial model and a plurality of text training data;
step S302, determining a sign language gesture prediction result corresponding to the text training data through the initial model;
step S303, obtaining a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
step S304, updating the model parameters of the initial model in a gradient descending manner, and taking the corresponding model parameters when the loss function convergence or the model training turn reaches a preset training iteration turn as final model parameters;
and S305, determining a preset sign language and gesture conversion model according to the final model parameters.
Specifically, the text training data refers to text data labeled with corresponding sign language gesture features, optionally, in this embodiment, an initial model is constructed based on a deep learning network or a convolutional neural network, then text features of each text training data are extracted, the text features are used as input values of the initial model, and a sign language gesture prediction result is output from the initial model, optionally, the sign language gesture prediction result is a sign language gesture image, and also can be a feature vector corresponding to the sign language gesture image, for example, when the sign language gesture prediction result is the sign language gesture prediction feature vector, a sign language gesture true feature vector of a sign language gesture true result corresponding to the text training data is extracted, then, an offset value between the sign language gesture prediction feature vector and the sign language gesture true feature vector corresponding to each text training data is calculated, so as to determine a loss function corresponding to the initial model through the offset value corresponding to each text training data, and finally, updating model parameters of the initial model in a gradient descending manner, taking the model parameters corresponding to the loss function convergence or the preset training iteration number as final model parameters to obtain a preset sign language gesture conversion model, and further obtaining sign language gesture images corresponding to each group of text participles through the preset sign language gesture conversion model.
In this embodiment, a preset sign language gesture conversion model is pre-established to obtain a target sign language gesture image corresponding to the text characteristic data through the preset sign language gesture conversion model, so that the speed of converting the text into the sign language gesture image is increased, and the user experience is further improved.
Further, based on the first embodiment of the audio processing method of the present invention, a third embodiment of the audio processing method of the present invention is proposed.
Referring to fig. 4, fig. 4 is a flowchart illustrating an audio processing method according to a third embodiment of the invention.
The third embodiment of the audio processing method is different from the first embodiment of the audio processing method in that, after the step of obtaining multiple groups of text participles corresponding to the target text, the method further includes:
step S40, performing word segmentation processing on the target text to obtain multiple groups of text word segmentation corresponding to the target text;
step S50, semantic recognition is respectively carried out on each group of text participles to obtain semantic recognition results corresponding to each group of text participles;
step S60, traversing a preset sign language gesture text word bank based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word bank;
and step S70, acquiring multiframe sign language gesture images corresponding to the multiple groups of text participles based on the target sign language gesture text, and displaying the multiframe sign language gesture images.
It should be understood that a specific word may be represented by a specific sign language gesture, so that before the sign language gesture conversion is performed on the target text, the target text may be split into multiple groups of text participles, for example, stop words (e.g., yes, get, place, etc.) or holding words (though, but, however, etc.) in the target text may be located first, or the target text may be split into multiple groups of text participles based on a preset dictionary, for example, the target text may be randomly divided into several groups of text participles first, then whether each group of text participles exists in the preset dictionary may be detected, if at least one group of text participles does not exist in the preset dictionary, the target text may be re-divided based on the group of text participles until each group of text participle exists in the preset dictionary, and in this embodiment, the target text may be participle by using a pre-trained participle model, the present embodiment does not limit this.
In addition, it should be understood that, in order to avoid the inaccuracy of the obtained sign language gesture image due to the poor model effect of the preset sign language gesture conversion model, in this implementation, a preset sign language gesture text lexicon is used to obtain a sign language gesture image matched with the group text participles.
Specifically, the semantic recognition result refers to word parameters such as word part of speech and word meaning, after the semantic recognition result corresponding to each group of text participles is obtained, the preset sign language gesture text lexicon is traversed to obtain target sign language gesture texts with the same or high similarity degree with the word parameters such as word part of speech and word meaning of the text participles from the preset sign language gesture text lexicon, for example, "me" and "me", "good" and "not wrong", "go" and "go", and then the sign language gesture image corresponding to the target sign language gesture text matched with each group of text participles in the preset sign language gesture text lexicon is determined.
Furthermore, in an embodiment, in order to ensure that the sign language gesture images are synchronized with the audio content, an implementation is provided when displaying the plurality of frames of sign language gesture images, in particular:
determining the position information of each group of text participles corresponding to each frame of the sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
and converting the multi-frame sign language gesture images into target sign language gesture images according to the sorting result, and displaying the target sign language gesture images.
Specifically, the position information refers to character sorting of each group of text participles from left to right in the target text, for example, "i want to go to work today" corresponds to text participles of "i", "today", "go" and "go to work", a sign language gesture image corresponding to "i" is arranged at the first position, a sign language gesture image corresponding to "today" is arranged at the second position, and so on, and a multi-frame sign language gesture image is converted into a coherent dynamic sign language gesture based on a sorting result.
In addition, when outputting the sign language gesture images, attention should be paid to ensure synchronization between the display time of the outputted sign language gesture images and the audio playing time in the video to be processed, preferably, when the video to be processed can be a recorded and played video which is recorded in advance, the audio frame time stamps of the audio data corresponding to each frame of sign language gesture images in the video to be processed are determined, so as to determine the output time stamps of each frame of sign language gesture images according to the audio frame time stamps, preferably, when the audio to be processed can be a recorded and played audio which is recorded in advance, a video stream with a certain duration is intercepted in advance, then the video stream is subjected to audio processing, and the audio-processed sign language gesture images are outputted in real time.
In addition, in an embodiment, the step of displaying the dynamic sign language gesture includes:
determining an audio frame timestamp of the audio to be processed;
determining an image frame timestamp of the target sign language gesture image based on the audio frame timestamp to display the target sign language gesture image based on the image frame timestamp.
Specifically, the audio frame timestamp refers to playing time information of each audio stream of the audio to be processed, and the image frame timestamp refers to display time information of each sign language gesture image in the target sign language gesture image.
In addition, when the audio to be processed exists in the corresponding video stream, the audio to be processed and the target sign language gesture image are synchronously output and displayed in the target terminal.
It should be understood that, the above target terminal refers to a playing terminal corresponding to the video to be processed, and optionally, the target sign language gesture image is displayed in an area where the video stream corresponding to the audio to be processed is displayed in the display screen, for example, the left lower side, the right lower side, the left upper side, the right upper side, and the like of the area, or the target sign language gesture image may also be displayed in an area where the video stream corresponding to the audio to be processed is displayed in the display screen, and the like, which is not limited in this embodiment.
In this embodiment, the display time of each frame of the sign language gesture image in the target sign language gesture image is determined according to the playing time of each frame of the audio stream of the audio to be processed, so as to ensure that the content of the audio to be processed and the content of the target sign language gesture image are displayed synchronously, and the display position information of the video stream corresponding to the audio to be processed in the display screen of the target terminal and the display position information of the target sign language gesture image in the display screen of the target terminal are determined, so that the compatibility of the display frames is ensured while the diversity of information transmission modes is ensured, and further the user experience is improved.
In the embodiment, semantic recognition is respectively carried out on each group of text participles to obtain semantic recognition results corresponding to each group of text participles; traversing a preset sign language gesture text word bank based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word bank; the method comprises the steps of obtaining multiframe sign language gesture images corresponding to multiple groups of text participles based on a target sign language gesture text, displaying multiframe sign language gestures, and avoiding the technical problem that when the model effect of a preset sign language gesture conversion model is poor, the obtained sign language gesture images are inaccurate.
The invention also provides an audio processing device. Referring to fig. 5, the audio processing apparatus includes:
the acquisition module 10 is configured to acquire a to-be-processed audio and convert the to-be-processed audio into a target text;
an extraction module 20, configured to perform feature extraction on the target text to obtain text feature data corresponding to the target text;
and the output module 30 is configured to obtain a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and display the target sign language gesture image.
In addition, the embodiment of the invention also provides a readable storage medium.
The readable storage medium has stored thereon an audio processing program which, when executed by the processor, implements the steps of the audio processing method as described above.
The readable storage medium of the present invention may be a computer readable storage medium, and the specific implementation manner of the readable storage medium of the present invention is substantially the same as that of each embodiment of the audio processing method, which is not described herein again.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. An audio processing method, characterized in that the audio processing method comprises the steps of:
acquiring audio to be processed, and converting the audio to be processed into a target text;
performing feature extraction on the target text to obtain text feature data corresponding to the target text;
and acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
2. The audio processing method of claim 1, wherein the step of converting the audio to be processed into a target text comprises:
extracting voice from the audio to be processed to obtain a target voice audio in the audio to be processed;
and performing semantic recognition on the target voice audio to obtain a target text.
3. The audio processing method according to claim 2, wherein the step of performing human voice extraction on the audio to be processed to obtain a target human voice audio in the audio to be processed comprises:
acquiring the audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model so as to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
and acquiring the target human voice audio frequency in the audio to be processed based on the target human voice audio frequency characteristics.
4. The audio processing method according to claim 1, wherein the step of obtaining the sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model further comprises:
acquiring an initial model and a plurality of text training data;
determining a sign language gesture prediction result corresponding to the text training data through the initial model;
acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
updating the model parameters of the initial model in a gradient descending manner, and taking the corresponding model parameters when the loss function convergence or the model training turn reaches a preset training iteration turn as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
5. The audio processing method of claim 1, wherein the step of converting the audio to be processed into the target text is followed by further comprising:
performing word segmentation processing on the target text to obtain multiple groups of text word segments corresponding to the target text;
performing semantic recognition on each group of text participles respectively to obtain semantic recognition results corresponding to each group of text participles;
traversing a preset sign language gesture text word bank based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word bank;
and acquiring multi-frame sign language gesture images corresponding to the multiple groups of text participles based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
6. The audio processing method of claim 5, wherein the step of displaying the plurality of frames of sign language gesture images comprises:
determining the position information of each group of text participles corresponding to each frame of the sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
and converting the multi-frame sign language gesture images into target sign language gesture images according to the sorting result, and displaying the target sign language gesture images.
7. The audio processing method of any of claims 1 to 6, wherein the step of displaying the target sign language gesture image comprises:
determining an audio frame timestamp of the audio to be processed;
determining an image frame timestamp of the target sign language gesture image based on the audio frame timestamp to display the target sign language gesture image based on the image frame timestamp.
8. An audio processing apparatus, characterized in that the audio processing apparatus comprises:
the acquisition module is used for acquiring audio to be processed and converting the audio to be processed into a target text;
the extraction module is used for extracting the features of the target text to obtain text feature data corresponding to the target text;
and the output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image.
9. An audio processing device, characterized in that the audio processing device comprises a memory, a processor and an audio processing program stored on the memory and executable on the processor, which audio processing program, when executed by the processor, implements the steps of the audio processing method according to any of claims 1-7.
10. A readable storage medium, having stored thereon an audio processing program which, when executed by a processor, implements the steps of the audio processing method according to any one of claims 1-7.
CN202110139349.3A 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium Pending CN113035199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110139349.3A CN113035199A (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110139349.3A CN113035199A (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113035199A true CN113035199A (en) 2021-06-25

Family

ID=76459672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110139349.3A Pending CN113035199A (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113035199A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709548A (en) * 2021-08-09 2021-11-26 北京达佳互联信息技术有限公司 Image-based multimedia data synthesis method, device, equipment and storage medium
CN113722513A (en) * 2021-09-06 2021-11-30 北京字节跳动网络技术有限公司 Multimedia data processing method and equipment
CN113923521A (en) * 2021-12-14 2022-01-11 深圳市大头兄弟科技有限公司 Video scripting method
CN114157920A (en) * 2021-12-10 2022-03-08 深圳Tcl新技术有限公司 Playing method and device for displaying sign language, smart television and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020140718A1 (en) * 2001-03-29 2002-10-03 Philips Electronics North America Corporation Method of providing sign language animation to a monitor and process therefor
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109740447A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Communication means, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110730360A (en) * 2019-10-25 2020-01-24 北京达佳互联信息技术有限公司 Video uploading and playing methods and devices, client equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN111640450A (en) * 2020-05-13 2020-09-08 广州国音智能科技有限公司 Multi-person audio processing method, device, equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020140718A1 (en) * 2001-03-29 2002-10-03 Philips Electronics North America Corporation Method of providing sign language animation to a monitor and process therefor
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109740447A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Communication means, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110730360A (en) * 2019-10-25 2020-01-24 北京达佳互联信息技术有限公司 Video uploading and playing methods and devices, client equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN111640450A (en) * 2020-05-13 2020-09-08 广州国音智能科技有限公司 Multi-person audio processing method, device, equipment and readable storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709548A (en) * 2021-08-09 2021-11-26 北京达佳互联信息技术有限公司 Image-based multimedia data synthesis method, device, equipment and storage medium
CN113709548B (en) * 2021-08-09 2023-08-25 北京达佳互联信息技术有限公司 Image-based multimedia data synthesis method, device, equipment and storage medium
CN113722513A (en) * 2021-09-06 2021-11-30 北京字节跳动网络技术有限公司 Multimedia data processing method and equipment
CN113722513B (en) * 2021-09-06 2022-12-20 抖音视界有限公司 Multimedia data processing method and equipment
CN114157920A (en) * 2021-12-10 2022-03-08 深圳Tcl新技术有限公司 Playing method and device for displaying sign language, smart television and storage medium
CN114157920B (en) * 2021-12-10 2023-07-25 深圳Tcl新技术有限公司 Method and device for playing sign language, intelligent television and storage medium
CN113923521A (en) * 2021-12-14 2022-01-11 深圳市大头兄弟科技有限公司 Video scripting method
CN113923521B (en) * 2021-12-14 2022-03-08 深圳市大头兄弟科技有限公司 Video scripting method

Similar Documents

Publication Publication Date Title
CN112115706B (en) Text processing method and device, electronic equipment and medium
US11917344B2 (en) Interactive information processing method, device and medium
CN110517689B (en) Voice data processing method, device and storage medium
CN113035199A (en) Audio processing method, device, equipment and readable storage medium
KR101990023B1 (en) Method for chunk-unit separation rule and display automated key word to develop foreign language studying, and system thereof
CN110781328A (en) Video generation method, system, device and storage medium based on voice recognition
CN109256133A (en) A kind of voice interactive method, device, equipment and storage medium
CN112399269B (en) Video segmentation method, device, equipment and storage medium
CN111753558B (en) Video translation method and device, storage medium and electronic equipment
CN111898388A (en) Video subtitle translation editing method and device, electronic equipment and storage medium
CN110853615A (en) Data processing method, device and storage medium
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
JP2012181358A (en) Text display time determination device, text display system, method, and program
CN110517668A (en) A kind of Chinese and English mixing voice identifying system and method
CN113886612A (en) Multimedia browsing method, device, equipment and medium
Yang et al. An automated analysis and indexing framework for lecture video portal
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN113992972A (en) Subtitle display method and device, electronic equipment and readable storage medium
JP5213572B2 (en) Sign language video generation system, server, terminal device, information processing method, and program
CN116089601A (en) Dialogue abstract generation method, device, equipment and medium
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN112836476B (en) Summary generation method, device, equipment and medium
CN115967833A (en) Video generation method, device and equipment meter storage medium
KR20210081308A (en) Method, device, electronic equipment and storage medium for video processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination