CN113035199B

CN113035199B - Audio processing method, device, equipment and readable storage medium

Info

Publication number: CN113035199B
Application number: CN202110139349.3A
Authority: CN
Inventors: 田园
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2024-05-07
Anticipated expiration: 2041-02-01
Also published as: CN113035199A

Abstract

The invention discloses an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, wherein the method comprises the following steps: acquiring audio to be processed, and converting the audio to be processed into a target text; extracting features of the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into the corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.

Description

Audio processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method, apparatus, device, and readable storage medium.

Background

The information itself is intangible, if the information is to be understood and accepted by people, the information must be represented by a certain method, for example, when a television play plays video or news information, the information is usually transmitted by combining video with audio or words, and the information transmission mode is too single.

However, according to the latest research data, the number of people with hearing impairment in China reaches 2.2 hundred million, and more than 7000 tens of thousands of hearing loss exist, because most of the current playing terminals only support audio playing when video playing is carried out, for example, most of news live programs of main stream media are synchronously played without sign language broadcasters and synchronously played with words, namely, at present, the news content cannot be understood when people watch the news live programs due to single information transmission mode, so that watching experience of the people is influenced.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide an audio processing method, an audio processing device, audio processing equipment and a readable storage medium, and aims to solve the technical problem that the user experience is influenced due to the fact that the existing information transmission mode is too single.

To achieve the above object, the present invention provides an audio processing method including the steps of:

acquiring audio to be processed, and converting the audio to be processed into a target text;

Extracting features of the target text to obtain text feature data corresponding to the target text;

and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.

Preferably, the step of converting the audio to be processed into target text includes:

Extracting the voice of the audio to be processed to obtain target voice audio in the audio to be processed;

and carrying out semantic recognition on the target voice audio to obtain a target text.

Preferably, the step of extracting the voice of the audio to be processed to obtain the target voice audio in the audio to be processed includes:

acquiring audio characteristics of the audio to be processed;

inputting the audio features into a preset audio separation model to obtain an audio feature separation result corresponding to the audio features through the preset audio separation model, wherein the audio feature separation result comprises target human voice audio features;

And acquiring target voice audio in the audio to be processed based on the target voice audio characteristics.

Preferably, before the step of obtaining the sign language gesture image corresponding to the text feature data through the preset sign language gesture conversion model, the method further includes:

acquiring an initial model and a plurality of text training data;

Determining sign language gesture prediction results corresponding to the text training data through the initial model;

Acquiring a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;

updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;

and determining a preset sign language gesture conversion model according to the final model parameters.

Preferably, after the step of converting the audio to be processed into the target text, the method further includes:

Word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;

Respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;

Traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;

And acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.

Preferably, the step of displaying the multi-frame sign language gesture image includes:

Determining the position information of each group of text segmentation corresponding to each frame of sign language gesture image in the target text;

sorting the multi-frame sign language gesture images based on the position information to obtain a sorting result;

And converting the multi-frame sign language gesture image into a target sign language gesture image according to the sorting result, and displaying the target sign language gesture image.

Preferably, the step of displaying the target sign language gesture image includes:

determining an audio frame timestamp of the audio to be processed;

And determining an image frame time stamp of the target sign language gesture image based on the audio frame time stamp, so as to display the target sign language gesture image based on the image frame time stamp.

Further, to achieve the above object, the present invention also provides an audio processing apparatus including:

The acquisition module is used for acquiring the audio to be processed and converting the audio to be processed into a target text;

The extraction module is used for extracting the characteristics of the target text so as to obtain text characteristic data corresponding to the target text;

The output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image.

Further, in order to achieve the above object, the present invention also provides an audio processing apparatus including a memory, a processor, and an audio processing program stored on the memory and executable on the processor, the audio processing program implementing the steps of the audio processing method as described above when executed by the processor.

Further, to achieve the above object, the present invention also provides a readable storage medium having stored thereon an audio processing program which, when executed by a processor, implements the steps of the audio processing method as described above.

Compared with the existing terminal video playing mode, the method and the device have the advantages that the audio to be processed is obtained and converted into the target text; extracting features of the target text to obtain text feature data corresponding to the target text; the method comprises the steps of obtaining target sign language gesture images corresponding to text feature data through a preset sign language gesture conversion model, displaying the target sign language gesture images, converting audio into corresponding sign language gesture images, improving the diversity of information transmission modes, and further improving user experience.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of an audio processing method according to the present invention;

FIG. 3 is a flowchart of a second embodiment of an audio processing method according to the present invention;

FIG. 4 is a flowchart of a third embodiment of an audio processing method according to the present invention;

Fig. 5 is a schematic functional block diagram of an audio processing apparatus according to an embodiment of the invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides an audio processing device, referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operation environment related to an embodiment of the audio processing device of the invention.

As shown in fig. 1, the audio processing device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage audio processing device separate from the aforementioned processor 1001.

It will be appreciated by those skilled in the art that the hardware structure of the audio processing device shown in fig. 1 does not constitute a limitation of the audio processing device, and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and an audio processing program may be included in the memory 1005 as one type of readable storage medium. The operating system is a program for managing and controlling hardware and software resources of the audio processing device, and supports the operation of a network communication module, a user interface module, an audio processing program and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.

In the hardware structure of the audio processing device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server, and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may call an audio processing program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call an audio processing program stored in the memory 1005, and perform the following operations:

acquiring audio characteristics of the audio to be processed;

acquiring an initial model and a plurality of text training data;

determining an audio frame timestamp of the audio to be processed;

The invention also provides an audio processing method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of an audio processing method according to the present invention.

Embodiments of the present invention provide embodiments of audio processing methods, it being noted that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in a different order than what is shown or described herein. Specifically, the audio processing method of the present embodiment includes:

step S10, obtaining audio to be processed, and converting the audio to be processed into a target text;

It should be noted that, the execution body of the embodiment may be an intelligent terminal device with an audio processing function and a display function, where the intelligent terminal device may be an electronic device such as a computer or a mobile phone, or may be other devices capable of implementing the same or similar functions, which is not limited in this embodiment, and in this embodiment, a television terminal is taken as an example for illustration.

It should be understood that, if the information itself is intangible, if the information is to be understood and accepted by a person, the information must be represented by a certain method, for example, when some television terminals are playing videos, for example, when the television terminals play news live programs, most of the news live programs are not matched with synchronous broadcasting and text synchronous broadcasting of sign language broadcasters, so that audience with hearing impairment cannot understand news content when watching the news live programs, and thus the watching effect of the personnel is affected.

In this embodiment, the audio to be processed may be pre-recorded audio, such as audio data in documentaries, movie shows, etc., or real-time audio, such as live news programs or real-time audio data in the video call process, etc., which is not limited in this embodiment.

It should be understood that when the audio to be processed is real-time audio, the original audio data in the live video may be obtained in real time and analyzed, for example, feature extraction is performed on the original audio data to separate the target audio data and non-audio data (such as background music data and noise data) in the original audio data, then the target audio data is detected to determine whether there is an interruption in the target audio data, for example, the audio data is detected during the period from 10s to 120s during the detection period, then the audio data is detected again at 130s, then it is determined that there is an interruption in the target audio data, then the interruption time (for example, any second between 120s or 130 s) is determined, so as to divide the target audio data into a first segment of audio data and a second segment of audio data based on the interruption time, wherein the first segment of audio data refers to audio data of human voice before interruption, the second segment of audio data refers to audio data of audio to be processed before interruption, then the audio data of the first segment of audio is processed as audio data of human voice, and the target audio is processed again as audio data of the target audio to be processed.

When the audio to be processed can be a pre-recorded audio, optionally, the complete audio to be recorded is used as the audio to be processed, or the audio to be processed can be split into a plurality of audio segments, and each audio segment is used as the audio to be processed, which is not limited in this embodiment.

It should be understood that, in order to improve the accuracy of the audio processing result, noise removal and the like are required to be performed on the audio to be processed before the audio to be processed is converted into the target text, so as to remove the influence of the noise on the audio processing result to be processed.

In addition, for easy understanding, the embodiment specifically describes the above implementation manner of converting the audio to be processed into the target text:

and carrying out semantic recognition on the target voice audio to obtain target text matched with the target voice audio.

It should be understood that, in general, the audio to be processed is a mixed audio, that is, the audio includes not only a voice audio portion but also other environmental audio portions such as a background music audio portion, so in order to improve accuracy of an audio processing result, in this embodiment, voice extraction is performed on the audio to be processed to obtain a target voice audio in the audio to be processed, so as to obtain a target text according to the target voice audio.

Preferably, in this embodiment, a pre-trained audio separation model is used to perform voice extraction on the audio to be processed, specifically, feature extraction is performed on the audio to be processed to obtain feature vectors, and the feature vectors are input to the pre-trained audio separation model to output voice audio feature vectors through the pre-trained audio separation model, so that target voice audio is extracted from the audio to be processed through the voice audio feature vectors.

Preferably, in another embodiment, when the audio to be processed is a pre-recorded audio, it can be understood that the recorded audio is a synthesized audio, that is, the vocal audio and the background music audio are added in a channel mode, so that the audio to be processed can be decomposed into multiple sub-audio sections corresponding to multiple different channels based on the channel characteristics, sub-audio sections corresponding to the vocal channels are selected from the multiple different channels based on the channel identification, and the sub-audio sections corresponding to the vocal channels are used as the target vocal audio.

Specifically, in order to improve the accuracy of the voice extraction result, in this embodiment, a model is used to perform voice extraction on the audio to be processed, and in order to facilitate understanding, in this embodiment, the embodiment specifically describes the implementation manner of performing voice extraction on the audio to be processed to obtain the target voice audio in the audio to be processed:

acquiring audio characteristics of the audio to be processed;

It should be understood that, the above-mentioned preset audio separation model refers to a model that is trained in advance by using a plurality of audio training data, in this embodiment, it is preferable to perform pre-denoising, pre-emphasis, framing, windowing, fast fourier transform and other processes on the audio to be processed to obtain mel-cepstrum coefficients, and take the mel-cepstrum coefficients as the audio characteristics of the audio to be processed.

After the target voice audio is obtained, performing phonetic symbol feature extraction on the target voice audio to determine phonetic symbol attributes corresponding to the target voice audio, such as Chinese phonetic symbols, english phonetic symbols and the like, if the phonetic symbol attributes corresponding to the target voice audio are detected to be Chinese phonetic symbols, dividing the target voice audio into a plurality of groups of phonetic symbols based on a preset Chinese phonetic symbol dividing rule, determining at least one candidate word matched with each group of audio based on a preset Chinese word bank, then combining the candidate words matched with each group of audio one by one to obtain a plurality of groups of texts and context fit weights corresponding to each group of texts, and finally selecting the text with the largest context fit weight as the target text.

In addition, in another embodiment, when the phonetic symbol attribute corresponding to the target voice audio is a non-chinese phonetic symbol, such as an english phonetic symbol, it should be understood that, in order to improve the accuracy of audio interpretation, the phonetic symbol attribute corresponding to the target voice audio should be converted into a chinese phonetic symbol first, and then the target voice audio is divided into several groups of phonetic symbols based on a preset chinese phonetic symbol division rule, so as to perform word recognition, thereby improving the accuracy of the word recognition result.

Step S20, extracting characteristics of the target text to obtain text characteristic data corresponding to the target text;

and step S30, acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image.

Specifically, the text feature data includes keywords in the target text, attributes of each word in the target text, such as a human pronoun, a mood aid word, an adjective, a verb, and the like, and feature data for characterizing meaning of each word in the target text.

In this embodiment, a preset sign language gesture conversion model is adopted to determine a corresponding target sign language gesture image corresponding to a target text, and in this step, text feature data is input into the preset sign language gesture conversion model, so that the corresponding target sign language gesture image is output by the preset sign language gesture conversion model.

Furthermore, it should be noted that, since the target text is composed of a plurality of words or phrases, and one word or phrase is generally mapped with one sign language gesture image, in this embodiment, the target sign language gesture image is composed of a plurality of frames of sign language gesture images, and in some embodiments, after the plurality of frames of sign language gesture images are obtained, when the sign language gesture image is displayed, a problem that the sign language gesture image is not synchronous or not matched with the audio content occurs, so as to affect the user experience, in this embodiment, in order to ensure that the sign language gesture image is synchronous with the audio content in the video, an embodiment is provided, specifically when the target sign language gesture image is displayed:

determining display sequence information of each frame of sign language gesture image in the target sign language gesture image;

and displaying the target sign language gesture image according to the display order information.

Specifically, the display order information refers to the word or phrase corresponding to each frame of sign language gesture image, and is sequenced from left to right in the target text, for example, "i want to go to work today" the corresponding text is divided into "i", "today", "go to work", the sign language gesture image corresponding to "i" is ranked first, the sign language gesture image corresponding to "today" is ranked second, and so on, and based on the display order information, the multi-frame sign language gesture image is converted into a continuous dynamic sign language gesture for display output.

In addition, when outputting the sign language gesture image, care should be taken to ensure that the display time of the outputted sign language gesture image is synchronous with the playing time of the audio to be processed, preferably, when the audio to be processed can be recorded and played in advance, the audio frame time stamp of the audio data corresponding to each frame of sign language gesture image in the audio to be processed is determined, so as to determine the image frame time stamp of each frame of sign language gesture image in the target sign language gesture image according to the audio frame time stamp, preferably, when the audio to be processed is real-time audio, an audio stream with a period of time length is cached in advance, then, the audio stream is processed, and the sign language gesture image after the audio processing is output in real time.

In addition, in an embodiment, the step of displaying the dynamic sign language gesture includes:

determining an audio frame timestamp of the audio to be processed;

Specifically, the audio frame time stamp refers to playing time information of each frame of audio stream of the audio to be processed, and the image frame time stamp refers to displaying time information of each frame of sign language gesture image in the target sign language gesture image.

In addition, when the audio to be processed has a corresponding video stream, the audio to be processed and the target sign language gesture image are synchronously output and displayed in the target terminal.

It should be understood that, the above-mentioned target terminal refers to a playing terminal corresponding to a video to be processed, optionally, the target sign language gesture image is displayed in an area of the display screen where a video stream corresponding to an audio to be processed is displayed, for example, in a lower left side, a lower right side, an upper left side, an upper right side, etc. of the area, or the target sign language gesture image may also be displayed outside an area of the display screen where a video stream corresponding to an audio to be processed is displayed, etc. this embodiment is not limited thereto.

In this embodiment, the display time of each frame of sign language gesture image in the target sign language gesture image is determined according to the play time of each frame of audio stream of the audio to be processed, so as to ensure that the content display of the audio to be processed and the target sign language gesture image are synchronous, and according to the display position information of the video stream corresponding to the audio to be processed in the display screen of the target terminal and the display position information of the target sign language gesture image in the display screen of the target terminal, the coordination of the display picture is ensured while the diversity of the information transmission mode is ensured, and further the user experience is improved.

Compared with the existing terminal video playing mode, in the embodiment, the audio to be processed is obtained and converted into the target text; extracting features of the target text to obtain text feature data corresponding to the target text; and acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image, so that the audio is converted into the corresponding sign language gesture image, the diversity of information transmission modes is improved, and the user experience is further improved.

Further, based on the first embodiment of the audio processing method of the present invention, a second embodiment of the audio processing method of the present invention is presented.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of an audio processing method according to the present invention.

The difference between the second embodiment of the audio processing method and the first embodiment of the audio processing method is that before the step of obtaining the multi-frame sign language gesture image corresponding to the multiple groups of text segmentation through a preset sign language gesture conversion model, the method further includes:

step S301, an initial model and a plurality of text training data are obtained;

Step S302, determining sign language gesture prediction results corresponding to the text training data through the initial model;

Step S303, obtaining a sign language gesture real result corresponding to the text training data, and determining a loss function based on the sign language gesture prediction result and the sign language gesture real result;

Step S304, updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters;

step S305, determining a preset sign language gesture conversion model according to the final model parameters.

Specifically, the text training data refers to text data marked with corresponding sign language gesture features, optionally, in this embodiment, an initial model is constructed based on a deep learning network or a convolutional neural network, then text features of each text training data are extracted, the text features are used as input values of the initial model, sign language gesture prediction results are output by the initial model, optionally, the sign language gesture prediction results are sign language gesture images, feature vectors corresponding to the sign language gesture images can also be obtained, for example, when the sign language gesture prediction results are sign language gesture prediction feature vectors, sign language gesture real feature vectors of sign language gesture real results corresponding to the text training data are extracted, bias values between the sign language gesture prediction feature vectors and the sign language gesture real feature vectors corresponding to each text training data are calculated, a loss function corresponding to the initial model is determined through bias values corresponding to each text training data, finally model parameters of the initial model are updated in a gradient descent mode, the loss function is converged or the model parameters corresponding to a preset training iteration round are reached as final model parameters, and then a preset sign language conversion model is obtained through preset sign language conversion of each set of sign language gesture images.

According to the method, the target sign language gesture image corresponding to the text feature data is obtained through the preset sign language gesture conversion model by constructing the preset sign language gesture conversion model in advance, so that the speed of converting the text into the sign language gesture image is increased, and user experience is further improved.

Further, based on the first embodiment of the audio processing method of the present invention, a third embodiment of the audio processing method of the present invention is proposed.

Referring to fig. 4, fig. 4 is a flowchart illustrating a third embodiment of an audio processing method according to the present invention.

The third embodiment of the audio processing method is different from the first embodiment of the audio processing method in that after the step of obtaining a plurality of groups of text words corresponding to the target text, the method further includes:

Step S40, word segmentation processing is carried out on the target text so as to obtain a plurality of groups of text word segmentation corresponding to the target text;

S50, respectively carrying out semantic recognition on each group of text word segmentation to obtain semantic recognition results corresponding to each group of text word segmentation;

step S60, traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock;

Step S70, acquiring multi-frame sign language gesture images corresponding to the multiple groups of text segmentation based on the target sign language gesture text, and displaying the multi-frame sign language gesture images.

It should be understood that some specific words may be represented by some specific sign language gesture, so before the sign language gesture conversion is performed on the target text, the target text may be split into multiple groups of text word segments, for example, a stop word (such as get, place, etc.) or a adapting word (although, however, etc.) in the target text is located first, or the target text may be split into multiple groups of text word segments based on a preset dictionary, for example, the target text is split into multiple groups of text word segments first randomly, then whether each group of text word segments exists in the preset dictionary is detected, and if at least one group of text word segments does not exist in the preset dictionary, the target text is split again based on the group of text word segments until each group of text word segments exists in the preset dictionary.

In addition, it should be understood that, in order to avoid inaccurate acquired sign language gesture images caused by poor model effect of the preset sign language gesture conversion model, a preset sign language gesture text word stock is adopted in the implementation to acquire the sign language gesture images matched with the group text word segmentation.

Specifically, the semantic recognition result refers to word parameters such as word parts of speech and word meanings, after the semantic recognition result corresponding to each group of text word is obtained, a preset sign language gesture text word library is traversed to obtain target sign language gesture texts with consistent or high similarity to the word parameters such as the word parts of speech and the word meanings of the text word from the preset sign language gesture text word library, for example, "so" and "me", "good" and "good", "walk" and the like, and then sign language gesture images corresponding to the target sign language gesture texts matched with each group of text word in the preset sign language gesture text word library are determined.

Furthermore, in an embodiment, to ensure that sign language gesture images are synchronized with audio content, an implementation is provided, in particular, when displaying the multi-frame sign language gesture images:

Specifically, the above position information refers to the sequence of the characters of each group of text word in the target text from left to right, for example, "i want to go to work today" the corresponding text word is "i me", "today", "go to work", then rank the sign language gesture image corresponding to "i me" in the first place, rank the sign language gesture image corresponding to "today" in the second place, and so on, and based on the sequence result, convert the multi-frame sign language gesture image into a continuous dynamic sign language gesture.

In addition, when outputting the sign language gesture image, care should be taken to ensure that the display time of the outputted sign language gesture image is synchronous with the audio playing time in the video to be processed, preferably, when the video to be processed can be a prerecorded recorded video, the audio frame time stamp of the audio data corresponding to each frame of sign language gesture image in the video to be processed is determined, so as to determine the output time stamp of each frame of sign language gesture image according to the audio frame time stamp, preferably, when the audio to be processed can be a prerecorded recorded video, a video stream with a period of time is intercepted in advance, then, the video stream is subjected to audio processing, and the sign language gesture image after the audio processing is output in real time.

determining an audio frame timestamp of the audio to be processed;

In the embodiment, semantic recognition is carried out on each group of text word segmentation respectively to obtain a semantic recognition result corresponding to each group of text word segmentation; traversing a preset sign language gesture text word stock based on the semantic recognition result to obtain a target sign language gesture text matched with the semantic recognition result in the preset sign language gesture text word stock; the method comprises the steps of obtaining multi-frame sign language gesture images corresponding to multiple groups of text segmentation based on target sign language gesture texts, displaying multi-frame sign language gestures, and accordingly avoiding the technical problem that the obtained sign language gesture images are inaccurate due to the fact that a model effect of a preset sign language gesture conversion model is poor.

The invention also provides an audio processing device. Referring to fig. 5, the audio processing apparatus includes:

The acquisition module 10 is used for acquiring audio to be processed and converting the audio to be processed into a target text;

the extracting module 20 is configured to perform feature extraction on the target text to obtain text feature data corresponding to the target text;

and the output module 30 is configured to obtain a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and display the target sign language gesture image.

In addition, the embodiment of the invention also provides a readable storage medium.

The readable storage medium has stored thereon an audio processing program which, when executed by a processor, implements the steps of the audio processing method as described above.

The readable storage medium of the present invention may be a computer readable storage medium, and the specific implementation manner of the readable storage medium is substantially the same as that of each embodiment of the audio processing method described above, and will not be described herein.

While the embodiments of the present invention have been described above with reference to the drawings, the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made thereto by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which are to be accorded the full scope of the present invention as defined by the following description and drawings, or by any equivalent structures or equivalent flow changes, or by direct or indirect application to other relevant technical fields.

Claims

1. An audio processing method, characterized in that the audio processing method comprises the steps of:

acquiring a target sign language gesture image corresponding to the text feature data through a preset sign language gesture conversion model, and displaying the target sign language gesture image;

before the step of obtaining the target sign language gesture image corresponding to the text feature data through the preset sign language gesture conversion model, the method further comprises the following steps:

acquiring an initial model and a plurality of text training data;

determining sign language gesture prediction results corresponding to the text training data through the initial model, wherein the sign language gesture prediction results are sign language gesture prediction feature vectors;

acquiring a sign language gesture real characteristic vector of a sign language gesture real result corresponding to the text training data;

Determining a deviation value between the sign language gesture prediction feature vector corresponding to each text training data and the sign language gesture real feature vector, and determining a loss function based on the deviation value;

2. The audio processing method of claim 1, wherein the step of converting the audio to be processed into target text comprises:

3. The audio processing method according to claim 2, wherein the step of performing human voice extraction on the audio to be processed to obtain target human voice audio in the audio to be processed comprises:

acquiring audio characteristics of the audio to be processed;

4. The audio processing method according to claim 1, further comprising, after the step of converting the audio to be processed into a target text:

5. The audio processing method of claim 4, wherein the step of displaying the multi-frame sign language gesture image comprises:

6. The audio processing method according to any one of claims 1 to 5, wherein the step of displaying the target sign language gesture image includes:

determining an audio frame timestamp of the audio to be processed;

7. An audio processing apparatus, characterized in that the audio processing apparatus comprises:

The output module is used for acquiring a target sign language gesture image corresponding to the text characteristic data through a preset sign language gesture conversion model and displaying the target sign language gesture image;

The output module is also used for acquiring an initial model and a plurality of text training data; determining sign language gesture prediction results corresponding to the text training data through the initial model, wherein the sign language gesture prediction results are sign language gesture prediction feature vectors; acquiring a sign language gesture real characteristic vector of a sign language gesture real result corresponding to the text training data;

The output module is further used for determining deviation values between the sign language gesture prediction feature vectors and the sign language gesture real feature vectors corresponding to the text training data, and determining a loss function based on the deviation values; updating model parameters of the initial model in a gradient descent mode, and taking the model parameters corresponding to the convergence of the loss function or the model training round reaching a preset training iteration round as final model parameters; and determining a preset sign language gesture conversion model according to the final model parameters.

8. An audio processing device comprising a memory, a processor and an audio processing program stored on the memory and executable on the processor, which audio processing program when executed by the processor implements the steps of the audio processing method according to any of claims 1-6.

9. A readable storage medium, characterized in that the readable storage medium has stored thereon an audio processing program, which when executed by a processor, implements the steps of the audio processing method according to any of claims 1-6.