CN113035199B - Audio processing method, device, equipment and readable storage medium - Google Patents

Audio processing method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN113035199B
CN113035199B CN202110139349.3A CN202110139349A CN113035199B CN 113035199 B CN113035199 B CN 113035199B CN 202110139349 A CN202110139349 A CN 202110139349A CN 113035199 B CN113035199 B CN 113035199B
Authority
CN
China
Prior art keywords
audio
sign language
language gesture
text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110139349.3A
Other languages
Chinese (zh)
Other versions
CN113035199A (en
Inventor
田园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth RGB Electronics Co Ltd
Original Assignee
Shenzhen Skyworth RGB Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth RGB Electronics Co Ltd filed Critical Shenzhen Skyworth RGB Electronics Co Ltd
Priority to CN202110139349.3A priority Critical patent/CN113035199B/en
Publication of CN113035199A publication Critical patent/CN113035199A/en
Application granted granted Critical
Publication of CN113035199B publication Critical patent/CN113035199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, wherein the method comprises the following steps: acquiring audio to be processed, and converting the audio to be processed into a target text; extracting features of the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into the corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.

Description

Audio processing method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method, apparatus, device, and readable storage medium.
Background
The information itself is intangible, if the information is to be understood and accepted by people, the information must be represented by a certain method, for example, when a television play plays video or news information, the information is usually transmitted by combining video with audio or words, and the information transmission mode is too single.
However, according to the latest research data, the number of people with hearing impairment in China reaches 2.2 hundred million, and more than 7000 tens of thousands of hearing loss exist, because most of the current playing terminals only support audio playing when video playing is carried out, for example, most of news live programs of main stream media are synchronously played without sign language broadcasters and synchronously played with words, namely, at present, the news content cannot be understood when people watch the news live programs due to single information transmission mode, so that watching experience of the people is influenced.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, and aims to solve the technical problem that the user experience is influenced due to the fact that the existing information transmission mode is too single.
To achieve the above object, the present invention provides an audio processing method including the steps of:
acquiring audio to be processed, and converting the audio to be processed into a target text;
Extracting features of the target text to obtain text feature data corresponding to the target text;
and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Preferably, the step of converting the audio to be processed into target text includes:
Extracting the voice of the audio to be processed to obtain target voice audio in the audio to be processed;
and carrying out semantic recognition on the target voice audio to obtain a target text.
Preferably, the step of extracting the voice of the audio to be processed to obtain the target voice audio in the audio to be processed includes:
acquiring audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
And acquiring target voice audio in the audio to be processed based on the target voice audio characteristics.
Preferably, before the step of obtaining the sign language gesture image corresponding to the text feature data through the preset sign language gesture conversion model, the method further includes:
acquiring an initial model and a plurality of text training data;
Determining sign language gesture prediction results corresponding to the text training data through the initial model;
Acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
Preferably, after the step of converting the audio to be processed into the target text, the method further includes:
Word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;
Respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;
Traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;
And acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
Preferably, the step of displaying the multi-frame sign language gesture image includes:
Determining the position information of each group of text segmentation corresponding to each frame of sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
And converting the multi-frame sign language gesture image into a target sign language gesture image according to the sorting result, and displaying the target sign language gesture image.
Preferably, the step of displaying the target sign language gesture image includes:
determining an audio frame timestamp of the audio to be processed;
And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.
Further, to achieve the above object, the present invention also provides an audio processing apparatus including:
The acquisition module is used for acquiring the audio to be processed and converting the audio to be processed into a target text;
The extraction module is used for extracting the characteristics of the target text so as to obtain text characteristic data corresponding to the target text;
The output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image.
Further, in order to achieve the above object, the present invention also provides an audio processing apparatus including a memory, a processor, and an audio processing program stored on the memory and executable on the processor, the audio processing program implementing the steps of the audio processing method as described above when executed by the processor.
Further, to achieve the above object, the present invention also provides a readable storage medium having stored thereon an audio processing program which, when executed by a processor, implements the steps of the audio processing method as described above.
Compared with the existing terminal video playing mode, the method and the device have the advantages that the audio to be processed is obtained and converted into the target text; extracting features of the target text to obtain text feature data corresponding to the target text; the method comprises the steps of obtaining target sign language gesture images corresponding to text feature data through a preset sign language gesture conversion model, displaying the target sign language gesture images, converting audio into corresponding sign language gesture images, improving the diversity of information transmission modes, and further improving user experience.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of an audio processing method according to the present invention;
FIG. 3 is a flowchart of a second embodiment of an audio processing method according to the present invention;
FIG. 4 is a flowchart of a third embodiment of an audio processing method according to the present invention;
Fig. 5 is a schematic functional block diagram of an audio processing apparatus according to an embodiment of the invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides an audio processing device, referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operation environment related to an embodiment of the audio processing device of the invention.
As shown in fig. 1, the audio processing device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage audio processing device separate from the aforementioned processor 1001.
It will be appreciated by those skilled in the art that the hardware structure of the audio processing device shown in fig. 1 does not constitute a limitation of the audio processing device, and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and an audio processing program may be included in the memory 1005 as one type of readable storage medium. The operating system is a program for managing and controlling hardware and software resources of the audio processing device, and supports the operation of a network communication module, a user interface module, an audio processing program and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.
In the hardware structure of the audio processing device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server, and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:
acquiring audio to be processed, and converting the audio to be processed into a target text;
Extracting features of the target text to obtain text feature data corresponding to the target text;
and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
Extracting the voice of the audio to be processed to obtain target voice audio in the audio to be processed;
and carrying out semantic recognition on the target voice audio to obtain a target text.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
acquiring audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
And acquiring target voice audio in the audio to be processed based on the target voice audio characteristics.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
acquiring an initial model and a plurality of text training data;
Determining sign language gesture prediction results corresponding to the text training data through the initial model;
Acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
Word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;
Respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;
Traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;
And acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
Determining the position information of each group of text segmentation corresponding to each frame of sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
And converting the multi-frame sign language gesture image into a target sign language gesture image according to the sorting result, and displaying the target sign language gesture image.
Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:
determining an audio frame timestamp of the audio to be processed;
And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.
The invention also provides an audio processing method.
Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of an audio processing method according to the present invention.
Embodiments of the present invention provide embodiments of audio processing methods, it being noted that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in a different order than what is shown or described herein. Specifically, the audio processing method of the present embodiment includes:
step S10, obtaining audio to be processed, and converting the audio to be processed into a target text;
It should be noted that, the execution body of the embodiment may be an intelligent terminal device with an audio processing function and a display function, where the intelligent terminal device may be an electronic device such as a computer or a mobile phone, or may be other devices capable of implementing the same or similar functions, which is not limited in this embodiment, and in this embodiment, a television terminal is taken as an example for illustration.
It should be understood that, if the information itself is intangible, if the information is to be understood and accepted by a person, the information must be represented by a certain method, for example, when some television terminals are playing videos, for example, when the television terminals play news live programs, most of the news live programs are not matched with synchronous broadcasting and text synchronous broadcasting of sign language broadcasters, so that audience with hearing impairment cannot understand news content when watching the news live programs, and thus the watching effect of the personnel is affected.
In this embodiment, the audio to be processed may be pre-recorded audio, such as audio data in documentaries, movie shows, etc., or real-time audio, such as live news programs or real-time audio data in the video call process, etc., which is not limited in this embodiment.
It should be understood that when the audio to be processed is real-time audio, the original audio data in the live video may be obtained in real time and analyzed, for example, feature extraction is performed on the original audio data to separate the target audio data and non-audio data (such as background music data and noise data) in the original audio data, then the target audio data is detected to determine whether there is an interruption in the target audio data, for example, the audio data is detected during the period from 10s to 120s during the detection period, then the audio data is detected again at 130s, then it is determined that there is an interruption in the target audio data, then the interruption time (for example, any second between 120s or 130 s) is determined, so as to divide the target audio data into a first segment of audio data and a second segment of audio data based on the interruption time, wherein the first segment of audio data refers to audio data of human voice before interruption, the second segment of audio data refers to audio data of audio to be processed before interruption, then the audio data of the first segment of audio is processed as audio data of human voice, and the target audio is processed again as audio data of the target audio to be processed.
When the audio to be processed can be a pre-recorded audio, optionally, the complete audio to be recorded is used as the audio to be processed, or the audio to be processed can be split into a plurality of audio segments, and each audio segment is used as the audio to be processed, which is not limited in this embodiment.
It should be understood that, in order to improve the accuracy of the audio processing result, noise removal and the like are required to be performed on the audio to be processed before the audio to be processed is converted into the target text, so as to remove the influence of the noise on the audio processing result to be processed.
In addition, for easy understanding, the embodiment specifically describes the above implementation manner of converting the audio to be processed into the target text:
Extracting the voice of the audio to be processed to obtain target voice audio in the audio to be processed;
and carrying out semantic recognition on the target voice audio to obtain target text matched with the target voice audio.
It should be understood that, in general, the audio to be processed is a mixed audio, that is, the audio includes not only a voice audio portion but also other environmental audio portions such as a background music audio portion, so in order to improve accuracy of an audio processing result, in this embodiment, voice extraction is performed on the audio to be processed to obtain a target voice audio in the audio to be processed, so as to obtain a target text according to the target voice audio.
Preferably, in this embodiment, a pre-trained audio separation model is used to perform voice extraction on the audio to be processed, specifically, feature extraction is performed on the audio to be processed to obtain feature vectors, and the feature vectors are input to the pre-trained audio separation model to output voice audio feature vectors through the pre-trained audio separation model, so that target voice audio is extracted from the audio to be processed through the voice audio feature vectors.
Preferably, in another embodiment, when the audio to be processed is a pre-recorded audio, it can be understood that the recorded audio is a synthesized audio, that is, the vocal audio and the background music audio are added in a channel mode, so that the audio to be processed can be decomposed into multiple sub-audio sections corresponding to multiple different channels based on the channel characteristics, sub-audio sections corresponding to the vocal channels are selected from the multiple different channels based on the channel identification, and the sub-audio sections corresponding to the vocal channels are used as the target vocal audio.
Specifically, in order to improve the accuracy of the voice extraction result, in this embodiment, a model is used to perform voice extraction on the audio to be processed, and in order to facilitate understanding, in this embodiment, the embodiment specifically describes the implementation manner of performing voice extraction on the audio to be processed to obtain the target voice audio in the audio to be processed:
acquiring audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
And acquiring target voice audio in the audio to be processed based on the target voice audio characteristics.
It should be understood that, the above-mentioned preset audio separation model refers to a model that is trained in advance by using a plurality of audio training data, in this embodiment, it is preferable to perform pre-denoising, pre-emphasis, framing, windowing, fast fourier transform and other processes on the audio to be processed to obtain mel-cepstrum coefficients, and take the mel-cepstrum coefficients as the audio characteristics of the audio to be processed.
After the target voice audio is obtained, performing phonetic symbol feature extraction on the target voice audio to determine phonetic symbol attributes corresponding to the target voice audio, such as Chinese phonetic symbols, english phonetic symbols and the like, if the phonetic symbol attributes corresponding to the target voice audio are detected to be Chinese phonetic symbols, dividing the target voice audio into a plurality of groups of phonetic symbols based on a preset Chinese phonetic symbol dividing rule, determining at least one candidate word matched with each group of audio based on a preset Chinese word bank, then combining the candidate words matched with each group of audio one by one to obtain a plurality of groups of texts and context fit weights corresponding to each group of texts, and finally selecting the text with the largest context fit weight as the target text.
In addition, in another embodiment, when the phonetic symbol attribute corresponding to the target voice audio is a non-chinese phonetic symbol, such as an english phonetic symbol, it should be understood that, in order to improve the accuracy of audio interpretation, the phonetic symbol attribute corresponding to the target voice audio should be converted into a chinese phonetic symbol first, and then the target voice audio is divided into several groups of phonetic symbols based on a preset chinese phonetic symbol division rule, so as to perform word recognition, thereby improving the accuracy of the word recognition result.
Step S20, extracting characteristics of the target text to obtain text characteristic data corresponding to the target text;
and step S30, acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.
Specifically, the text feature data includes keywords in the target text, attributes of each word in the target text, such as a human pronoun, a mood aid word, an adjective, a verb, and the like, and feature data for characterizing meaning of each word in the target text.
In this embodiment, a preset sign language gesture conversion model is adopted to determine a corresponding target sign language gesture image corresponding to a target text, and in this step, text feature data is input into the preset sign language gesture conversion model, so that the corresponding target sign language gesture image is output by the preset sign language gesture conversion model.
Furthermore, it should be noted that, since the target text is composed of a plurality of words or phrases, and one word or phrase is generally mapped with one sign language gesture image, in this embodiment, the target sign language gesture image is composed of a plurality of frames of sign language gesture images, and in some embodiments, after the plurality of frames of sign language gesture images are obtained, when the sign language gesture image is displayed, a problem that the sign language gesture image is not synchronous or not matched with the audio content occurs, so as to affect the user experience, in this embodiment, in order to ensure that the sign language gesture image is synchronous with the audio content in the video, an embodiment is provided, specifically when the target sign language gesture image is displayed:
determining display sequence information of each frame of sign language gesture image in the target sign language gesture image;
and displaying the target sign language gesture image according to the display order information.
Specifically, the display order information refers to the word or phrase corresponding to each frame of sign language gesture image, and is sequenced from left to right in the target text, for example, "i want to go to work today" the corresponding text is divided into "i", "today", "go to work", the sign language gesture image corresponding to "i" is ranked first, the sign language gesture image corresponding to "today" is ranked second, and so on, and based on the display order information, the multi-frame sign language gesture image is converted into a continuous dynamic sign language gesture for display output.
In addition, when outputting the sign language gesture image, care should be taken to ensure that the display time of the outputted sign language gesture image is synchronous with the playing time of the audio to be processed, preferably, when the audio to be processed can be recorded and played in advance, the audio frame time stamp of the audio data corresponding to each frame of sign language gesture image in the audio to be processed is determined, so as to determine the image frame time stamp of each frame of sign language gesture image in the target sign language gesture image according to the audio frame time stamp, preferably, when the audio to be processed is real-time audio, an audio stream with a period of time length is cached in advance, then, the audio stream is processed, and the sign language gesture image after the audio processing is output in real time.
In addition, in an embodiment, the step of displaying the dynamic sign language gesture includes:
determining an audio frame timestamp of the audio to be processed;
And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.
Specifically, the audio frame time stamp refers to playing time information of each frame of audio stream of the audio to be processed, and the image frame time stamp refers to displaying time information of each frame of sign language gesture image in the target sign language gesture image.
In addition, when the audio to be processed has a corresponding video stream, the audio to be processed and the target sign language gesture image are synchronously output and displayed in the target terminal.
It should be understood that, the above-mentioned target terminal refers to a playing terminal corresponding to a video to be processed, optionally, the target sign language gesture image is displayed in an area of the display screen where a video stream corresponding to an audio to be processed is displayed, for example, in a lower left side, a lower right side, an upper left side, an upper right side, etc. of the area, or the target sign language gesture image may also be displayed outside an area of the display screen where a video stream corresponding to an audio to be processed is displayed, etc. this embodiment is not limited thereto.
In this embodiment, the display time of each frame of sign language gesture image in the target sign language gesture image is determined according to the play time of each frame of audio stream of the audio to be processed, so as to ensure that the content display of the audio to be processed and the target sign language gesture image are synchronous, and according to the display position information of the video stream corresponding to the audio to be processed in the display screen of the target terminal and the display position information of the target sign language gesture image in the display screen of the target terminal, the coordination of the display picture is ensured while the diversity of the information transmission mode is ensured, and further the user experience is improved.
Compared with the existing terminal video playing mode, in the embodiment, the audio to be processed is obtained and converted into the target text; extracting features of the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into the corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.
Further, based on the first embodiment of the audio processing method of the present invention, a second embodiment of the audio processing method of the present invention is presented.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of an audio processing method according to the present invention.
The difference between the second embodiment of the audio processing method and the first embodiment of the audio processing method is that before the step of obtaining the multi-frame sign language gesture image corresponding to the multiple groups of text segmentation through a preset sign language gesture conversion model, the method further includes:
step S301, an initial model and a plurality of text training data are obtained;
Step S302, determining sign language gesture prediction results corresponding to the text training data through the initial model;
Step S303, obtaining a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;
Step S304, updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;
step S305, determining a preset sign language gesture conversion model according to the final model parameters.
Specifically, the text training data refers to text data marked with corresponding sign language gesture features, optionally, in this embodiment, an initial model is constructed based on a deep learning network or a convolutional neural network, then text features of each text training data are extracted, the text features are used as input values of the initial model, sign language gesture prediction results are output by the initial model, optionally, the sign language gesture prediction results are sign language gesture images, feature vectors corresponding to the sign language gesture images can also be obtained, for example, when the sign language gesture prediction results are sign language gesture prediction feature vectors, sign language gesture real feature vectors of sign language gesture real results corresponding to the text training data are extracted, bias values between the sign language gesture prediction feature vectors and the sign language gesture real feature vectors corresponding to each text training data are calculated, a loss function corresponding to the initial model is determined through bias values corresponding to each text training data, finally model parameters of the initial model are updated in a gradient descent mode, the loss function is converged or the model parameters corresponding to a preset training iteration round are reached as final model parameters, and then a preset sign language conversion model is obtained through preset sign language conversion of each set of sign language gesture images.
According to the method, the target sign language gesture image corresponding to the text feature data is obtained through the preset sign language gesture conversion model by constructing the preset sign language gesture conversion model in advance, so that the speed of converting the text into the sign language gesture image is increased, and user experience is further improved.
Further, based on the first embodiment of the audio processing method of the present invention, a third embodiment of the audio processing method of the present invention is proposed.
Referring to fig. 4, fig. 4 is a flowchart illustrating a third embodiment of an audio processing method according to the present invention.
The third embodiment of the audio processing method is different from the first embodiment of the audio processing method in that after the step of obtaining a plurality of groups of text words corresponding to the target text, the method further includes:
Step S40, word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;
S50, respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;
step S60, traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;
Step S70, acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
It should be understood that some specific words may be represented by some specific sign language gesture, so before the sign language gesture conversion is performed on the target text, the target text may be split into multiple groups of text word segments, for example, a stop word (such as get, place, etc.) or a adapting word (although, however, etc.) in the target text is located first, or the target text may be split into multiple groups of text word segments based on a preset dictionary, for example, the target text is split into multiple groups of text word segments first randomly, then whether each group of text word segments exists in the preset dictionary is detected, and if at least one group of text word segments does not exist in the preset dictionary, the target text is split again based on the group of text word segments until each group of text word segments exists in the preset dictionary.
In addition, it should be understood that, in order to avoid inaccurate acquired sign language gesture images caused by poor model effect of the preset sign language gesture conversion model, a preset sign language gesture text word stock is adopted in the implementation to acquire the sign language gesture images matched with the group text word segmentation.
Specifically, the semantic recognition result refers to word parameters such as word parts of speech and word meanings, after the semantic recognition result corresponding to each group of text word is obtained, a preset sign language gesture text word library is traversed to obtain target sign language gesture texts with consistent or high similarity to the word parameters such as the word parts of speech and the word meanings of the text word from the preset sign language gesture text word library, for example, "so" and "me", "good" and "good", "walk" and the like, and then sign language gesture images corresponding to the target sign language gesture texts matched with each group of text word in the preset sign language gesture text word library are determined.
Furthermore, in an embodiment, to ensure that sign language gesture images are synchronized with audio content, an implementation is provided, in particular, when displaying the multi-frame sign language gesture images:
Determining the position information of each group of text segmentation corresponding to each frame of sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
And converting the multi-frame sign language gesture image into a target sign language gesture image according to the sorting result, and displaying the target sign language gesture image.
Specifically, the above position information refers to the sequence of the characters of each group of text word in the target text from left to right, for example, "i want to go to work today" the corresponding text word is "i me", "today", "go to work", then rank the sign language gesture image corresponding to "i me" in the first place, rank the sign language gesture image corresponding to "today" in the second place, and so on, and based on the sequence result, convert the multi-frame sign language gesture image into a continuous dynamic sign language gesture.
In addition, when outputting the sign language gesture image, care should be taken to ensure that the display time of the outputted sign language gesture image is synchronous with the audio playing time in the video to be processed, preferably, when the video to be processed can be a prerecorded recorded video, the audio frame time stamp of the audio data corresponding to each frame of sign language gesture image in the video to be processed is determined, so as to determine the output time stamp of each frame of sign language gesture image according to the audio frame time stamp, preferably, when the audio to be processed can be a prerecorded recorded video, a video stream with a period of time is intercepted in advance, then, the video stream is subjected to audio processing, and the sign language gesture image after the audio processing is output in real time.
In addition, in an embodiment, the step of displaying the dynamic sign language gesture includes:
determining an audio frame timestamp of the audio to be processed;
And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.
Specifically, the audio frame time stamp refers to playing time information of each frame of audio stream of the audio to be processed, and the image frame time stamp refers to displaying time information of each frame of sign language gesture image in the target sign language gesture image.
In addition, when the audio to be processed has a corresponding video stream, the audio to be processed and the target sign language gesture image are synchronously output and displayed in the target terminal.
It should be understood that, the above-mentioned target terminal refers to a playing terminal corresponding to a video to be processed, optionally, the target sign language gesture image is displayed in an area of the display screen where a video stream corresponding to an audio to be processed is displayed, for example, in a lower left side, a lower right side, an upper left side, an upper right side, etc. of the area, or the target sign language gesture image may also be displayed outside an area of the display screen where a video stream corresponding to an audio to be processed is displayed, etc. this embodiment is not limited thereto.
In this embodiment, the display time of each frame of sign language gesture image in the target sign language gesture image is determined according to the play time of each frame of audio stream of the audio to be processed, so as to ensure that the content display of the audio to be processed and the target sign language gesture image are synchronous, and according to the display position information of the video stream corresponding to the audio to be processed in the display screen of the target terminal and the display position information of the target sign language gesture image in the display screen of the target terminal, the coordination of the display picture is ensured while the diversity of the information transmission mode is ensured, and further the user experience is improved.
In the embodiment, semantic recognition is carried out on each group of text word segmentation respectively to obtain a semantic recognition result corresponding to each group of text word segmentation; traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock; the method comprises the steps of obtaining multi-frame sign language gesture images corresponding to multiple groups of text segmentation based on target sign language gesture texts, displaying multi-frame sign language gestures, and accordingly avoiding the technical problem that the obtained sign language gesture images are inaccurate due to the fact that a model effect of a preset sign language gesture conversion model is poor.
The invention also provides an audio processing device. Referring to fig. 5, the audio processing apparatus includes:
The acquisition module 10 is used for acquiring audio to be processed and converting the audio to be processed into a target text;
the extracting module 20 is configured to perform feature extraction on the target text to obtain text feature data corresponding to the target text;
and the output module 30 is configured to obtain a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and display the target sign language gesture image.
In addition, the embodiment of the invention also provides a readable storage medium.
The readable storage medium has stored thereon an audio processing program which, when executed by a processor, implements the steps of the audio processing method as described above.
The readable storage medium of the present invention may be a computer readable storage medium, and the specific implementation manner of the readable storage medium is substantially the same as that of each embodiment of the audio processing method described above, and will not be described herein.
While the embodiments of the present invention have been described above with reference to the drawings, the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made thereto by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which are to be accorded the full scope of the present invention as defined by the following description and drawings, or by any equivalent structures or equivalent flow changes, or by direct or indirect application to other relevant technical fields.

Claims (9)

1. An audio processing method, characterized in that the audio processing method comprises the steps of:
acquiring audio to be processed, and converting the audio to be processed into a target text;
Extracting features of the target text to obtain text feature data corresponding to the target text;
acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image;
before the step of obtaining the target sign language gesture image corresponding to the text feature data through the preset sign language gesture conversion model, the method further comprises the following steps:
acquiring an initial model and a plurality of text training data;
determining sign language gesture prediction results corresponding to the text training data through the initial model, wherein the sign language gesture prediction results are sign language gesture prediction feature vectors;
acquiring a sign language gesture real characteristic vector of a sign language gesture real result corresponding to the text training data;
Determining a deviation value between the sign language gesture prediction feature vector corresponding to each text training data and the sign language gesture real feature vector, and determining a loss function based on the deviation value;
updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;
and determining a preset sign language gesture conversion model according to the final model parameters.
2. The audio processing method of claim 1, wherein the step of converting the audio to be processed into target text comprises:
Extracting the voice of the audio to be processed to obtain target voice audio in the audio to be processed;
and carrying out semantic recognition on the target voice audio to obtain a target text.
3. The audio processing method according to claim 2, wherein the step of performing human voice extraction on the audio to be processed to obtain target human voice audio in the audio to be processed comprises:
acquiring audio characteristics of the audio to be processed;
inputting the audio features into a preset audio separation model to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;
And acquiring target voice audio in the audio to be processed based on the target voice audio characteristics.
4. The audio processing method according to claim 1, further comprising, after the step of converting the audio to be processed into a target text:
Word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;
Respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;
Traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;
And acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.
5. The audio processing method of claim 4, wherein the step of displaying the multi-frame sign language gesture image comprises:
Determining the position information of each group of text segmentation corresponding to each frame of sign language gesture image in the target text;
sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;
And converting the multi-frame sign language gesture image into a target sign language gesture image according to the sorting result, and displaying the target sign language gesture image.
6. The audio processing method according to any one of claims 1 to 5, wherein the step of displaying the target sign language gesture image includes:
determining an audio frame timestamp of the audio to be processed;
And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.
7. An audio processing apparatus, characterized in that the audio processing apparatus comprises:
The acquisition module is used for acquiring the audio to be processed and converting the audio to be processed into a target text;
The extraction module is used for extracting the characteristics of the target text so as to obtain text characteristic data corresponding to the target text;
The output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image;
The output module is also used for acquiring an initial model and a plurality of text training data; determining sign language gesture prediction results corresponding to the text training data through the initial model, wherein the sign language gesture prediction results are sign language gesture prediction feature vectors; acquiring a sign language gesture real characteristic vector of a sign language gesture real result corresponding to the text training data;
The output module is further used for determining deviation values between the sign language gesture prediction feature vectors and the sign language gesture real feature vectors corresponding to the text training data, and determining a loss function based on the deviation values; updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters; and determining a preset sign language gesture conversion model according to the final model parameters.
8. An audio processing device comprising a memory, a processor and an audio processing program stored on the memory and executable on the processor, which audio processing program when executed by the processor implements the steps of the audio processing method according to any of claims 1-6.
9. A readable storage medium, characterized in that the readable storage medium has stored thereon an audio processing program, which when executed by a processor, implements the steps of the audio processing method according to any of claims 1-6.
CN202110139349.3A 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium Active CN113035199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110139349.3A CN113035199B (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110139349.3A CN113035199B (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113035199A CN113035199A (en) 2021-06-25
CN113035199B true CN113035199B (en) 2024-05-07

Family

ID=76459672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110139349.3A Active CN113035199B (en) 2021-02-01 2021-02-01 Audio processing method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113035199B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709548B (en) * 2021-08-09 2023-08-25 北京达佳互联信息技术有限公司 Image-based multimedia data synthesis method, device, equipment and storage medium
CN113722513B (en) * 2021-09-06 2022-12-20 抖音视界有限公司 Multimedia data processing method and equipment
CN114157920B (en) * 2021-12-10 2023-07-25 深圳Tcl新技术有限公司 Method and device for playing sign language, intelligent television and storage medium
CN113923521B (en) * 2021-12-14 2022-03-08 深圳市大头兄弟科技有限公司 Video scripting method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109740447A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Communication means, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110730360A (en) * 2019-10-25 2020-01-24 北京达佳互联信息技术有限公司 Video uploading and playing methods and devices, client equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN111640450A (en) * 2020-05-13 2020-09-08 广州国音智能科技有限公司 Multi-person audio processing method, device, equipment and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020140718A1 (en) * 2001-03-29 2002-10-03 Philips Electronics North America Corporation Method of providing sign language animation to a monitor and process therefor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109740447A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Communication means, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110730360A (en) * 2019-10-25 2020-01-24 北京达佳互联信息技术有限公司 Video uploading and playing methods and devices, client equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN111640450A (en) * 2020-05-13 2020-09-08 广州国音智能科技有限公司 Multi-person audio processing method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN113035199A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
KR102401942B1 (en) Method and apparatus for evaluating translation quality
US11917344B2 (en) Interactive information processing method, device and medium
CN112115706B (en) Text processing method and device, electronic equipment and medium
CN110517689B (en) Voice data processing method, device and storage medium
CN109218629B (en) Video generation method, storage medium and device
CN112040263A (en) Video processing method, video playing method, video processing device, video playing device, storage medium and equipment
CN109256133A (en) A kind of voice interactive method, device, equipment and storage medium
CN111883107B (en) Speech synthesis and feature extraction model training method, device, medium and equipment
CN110781328A (en) Video generation method, system, device and storage medium based on voice recognition
CN112399269B (en) Video segmentation method, device, equipment and storage medium
CN113392273A (en) Video playing method and device, computer equipment and storage medium
JP2012181358A (en) Text display time determination device, text display system, method, and program
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
CN110784662A (en) Method, system, device and storage medium for replacing video background
CN110517668A (en) A kind of Chinese and English mixing voice identifying system and method
CN113411674A (en) Video playing control method and device, electronic equipment and storage medium
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
CN109376145B (en) Method and device for establishing movie and television dialogue database and storage medium
KR102345625B1 (en) Caption generation method and apparatus for performing the same
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN116962787A (en) Interaction method, device, equipment and storage medium based on video information
CN116089601A (en) Dialogue abstract generation method, device, equipment and medium
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN115967833A (en) Video generation method, device and equipment meter storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant