CN115938383A

CN115938383A - Audio and video processing method and device, server and storage medium

Info

Publication number: CN115938383A
Application number: CN202211320945.2A
Authority: CN
Inventors: 邓鹏�
Original assignee: Shenzhen Jiuzhou Electric Appliance Co Ltd
Current assignee: Shenzhen Jiuzhou Electric Appliance Co Ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-04-07

Abstract

The application discloses an audio and video processing method, an audio and video processing device, a server and a storage medium, which belong to the technical field of audio and video processing, wherein the method comprises the following steps: receiving an audio and video file sent by terminal equipment; acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound; determining character information, sound occurrence time and sound playing time length corresponding to the non-human voice data; determining a display area corresponding to the text information; and sending the display area, the text information, the sound occurrence time and the sound playing time to a terminal device, so that the terminal device displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file. Therefore, the method and the device convert the non-human voice data in the audio and video file into the text information to be displayed in the video frame display picture, so that the person with hearing impairment can better know the detail content of the audio and video.

Description

Audio and video processing method and device, server and storage medium

Technical Field

The present application relates to the field of audio and video processing technologies, and in particular, to an audio and video processing method and apparatus, a server, and a storage medium.

Background

In the related art, in order to enable a hearing-impaired person to better view and understand an audio/video product, a method of adding a hearing-impaired subtitle, such as a Closed Caption (CC), to the audio/video product is mainly used.

However, with the popularity of live broadcasting, hearing-impaired people can only see live main captions, such as interactive information sent by other viewers in the comment area, or audio and text information obtained by converting the main broadcasting sound into text information, but cannot see other sound information related to the sound, such as rain, thunder, background music, and the like. Which in turn results in the inability of hearing impaired people to better view and understand the details of live video.

Content of application

The application mainly aims to provide an audio and video processing method, an audio and video processing device, a server and a storage medium, and aims to solve the technical problem that people with hearing impairment can only see live main subtitles.

In order to achieve the above object, the present application provides an audio/video processing method, for a server, including:

receiving an audio and video file sent by terminal equipment;

acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound;

determining character information, sound occurrence time and sound playing time length corresponding to the non-human voice data;

determining a display area corresponding to the text information;

and sending the display area, the text information, the sound occurrence time and the sound playing time to a terminal device, so that the terminal device displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file.

Optionally, the determining the display area corresponding to the text information includes:

determining a sound channel corresponding to the non-human voice data according to the non-human voice data;

and determining a display area corresponding to the text information according to the sound channel.

Optionally, after receiving the audio and video file sent by the terminal device, the method further includes:

acquiring human voice data according to the audio and video file;

determining voice character information, voice occurrence time and voice playing time corresponding to the voice and voice data;

determining a voice display area corresponding to the voice character information;

and sending the voice text information, the voice occurrence time, the voice playing time and the voice display area to the terminal equipment so that the terminal equipment displays the voice text information according to the voice occurrence time, the voice playing time and the voice display area when playing the audio and video file.

Optionally, the determining the voice display area corresponding to the voice text information includes:

acquiring subtitle information according to the audio and video file;

determining a subtitle display area corresponding to the subtitle information;

determining a subtitle-free display area which does not conflict with the subtitle display area according to the subtitle display area;

and determining the caption-free display area as a voice display area corresponding to the human voice character information.

In a second aspect, the present application provides an audio and video processing method, for a terminal device, including:

sending an audio/video file to a server so that the server obtains non-human voice data according to the audio/video file, and determining text information, voice occurrence time, voice playing time and a display area corresponding to the text information corresponding to the non-human voice data;

receiving the display area, the text information, the sound occurrence time and the sound playing time length sent by the server;

and displaying the text information in the audio and video according to the display area, the text information, the sound occurrence time and the sound playing time.

Optionally, before sending the audio/video file to the server, the method further includes:

and slicing the video to be played according to a preset time length to obtain the audio and video file.

In a third aspect, the present application provides an audio/video processing apparatus configured to a server, the apparatus including:

the first receiving module is used for receiving the audio and video files sent by the terminal equipment;

the acquisition module is used for acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound;

the first determining module is used for determining the text information, the sound occurrence time and the sound playing time length corresponding to the non-human voice data;

the second determining module is used for determining a display area corresponding to the text information;

the first sending module is used for sending the display area, the text information, the sound occurrence time and the sound playing time to the terminal equipment, so that the terminal equipment displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file.

In a fourth aspect, the present application provides an audio/video processing apparatus configured in a terminal device, where the apparatus includes:

the second sending module is used for sending the audio and video file to a server so that the server obtains non-human voice data according to the audio and video file and determines character information, voice occurrence time, voice playing time and a display area corresponding to the character information corresponding to the non-human voice data;

the second receiving module is used for receiving the display area, the character information, the sound occurrence time and the sound playing time length which are sent by the server;

and the display module is used for displaying the text information in the audio and video according to the display area, the text information, the sound occurrence time and the sound playing time length.

In a fifth aspect, the present application further provides an apparatus for audio/video processing, a memory, a processor, and an audio/video processing program stored in the memory and executable on the processor, where the audio/video processing program is configured to implement the steps of the audio/video processing method.

In a sixth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio/video processing method of any of the embodiments of the present application.

An audio and video processing method provided by the embodiment of the application includes: receiving an audio and video file sent by terminal equipment; acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound; determining character information, sound occurrence time and sound playing time length corresponding to the non-human voice data; determining a display area corresponding to the text information; and sending the display area, the text information, the sound occurrence time and the sound playing time to a terminal device, so that the terminal device displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file. Therefore, the method and the device extract the non-human voice data in the video, determine the text information corresponding to the non-human voice data, and display the text information in the corresponding display area according to the appearance time of the non-human voice data and the sound playing time length when the non-human voice data appear in the played video, so that the hearing-impaired people can know other sound information related to the sound except the main stream subtitles of the video, and further help the hearing-impaired people to better watch and understand the live video.

Drawings

Fig. 1 is a schematic diagram of an architecture of an audio/video processing method according to the present application;

fig. 2 is a schematic diagram of a hardware structure of an embodiment of an audio/video processing method according to the present application;

fig. 3 is a schematic flowchart of a first embodiment of an audio/video processing method according to the present application;

fig. 4 is a schematic flowchart of a second embodiment of the audio/video processing method according to the present application;

FIG. 5 is a schematic diagram illustrating a display of text information corresponding to left channel non-human voice data according to the present application;

FIG. 6 is a schematic diagram illustrating the display of text information corresponding to right channel non-human voice data according to the present application;

FIG. 7 is a schematic diagram illustrating display of text information corresponding to left channel and right channel non-human voice data according to the present application;

fig. 8 is a schematic diagram of a third embodiment of the audio/video processing method according to the present application;

fig. 9 is a schematic diagram of a fourth embodiment of the audio/video processing method according to the present application;

FIG. 10 is a schematic view showing the voice data of the applicant corresponding to the text message of the voice;

fig. 11 is a schematic diagram of a first structural framework of the audio/video processing device according to the present application;

fig. 12 is a schematic diagram of a second structural framework of the audio/video processing device according to the present application.

The implementation, functional features and advantages of the object of the present application will be further explained with reference to the embodiments, and with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Due to the prior art, when a hearing-impaired person watches live broadcast, the hearing-impaired person can only see the main stream subtitles of the main broadcast, but the hearing-impaired person cannot see other information related to sound in the live broadcast video, such as rain, thunder, background music and the like. Which in turn results in the inability of hearing impaired people to better view and understand live video.

The application provides a solution, non-human voice sound data in the video is extracted, the corresponding text information is determined according to the non-human voice sound data, the sound occurrence time of the non-human voice sound data, the sound playing time and the display area corresponding to the text information are determined simultaneously, the server sends the text information, the sound occurrence time, the sound playing time and the display area to the terminal equipment, so that the terminal equipment displays the text information when playing audio and video according to the display area, the sound occurrence time and the sound playing time. Therefore, when the non-human voice appears to the end, the text information corresponding to the non-human voice is displayed in the display area, so that the hearing-impaired people can be helped to better know the live details, and the watching experience of the hearing-impaired people is further improved.

In the following, an audio/video processing system applied to the implementation of the technology of the present application will be described:

referring to fig. 1, fig. 1 is an architecture schematic diagram of an audio and video processing system according to an exemplary embodiment. As shown in fig. 1, the audio/video processing system may include a server 11, a network 12, and a terminal device 13.

The server 11 may be a physical server comprising a separate host, or the server 11 may be a virtual server carried by a cluster of hosts. In the operation process, the server 11 may operate a server-side program of an application to implement a related service function of the application, for example, when the terminal device 13 uploads an audio/video file, the server 11 may serve as the server that receives the audio/video file and process the audio/video file.

The network 12 may include various types of wired or wireless networks. In one embodiment, the Network 12 may include the Public Switched Telephone Network (PSTN) and the Internet. The terminal device 13 may interact with the server 11 via the network 12.

The terminal device 13 may comprise an electronic device of the type such as: a doctor workstation, a smart phone, a tablet device, a laptop computer, a palm top computer (PDAs), etc., which are not limited by one or more embodiments of the present disclosure. In operation, the terminal device 13 may run a server-side program to implement the relevant service functions of the application. It is understood that, in other embodiments, the terminal device 13 may run some applications, and the applications have functions of displaying, modifying, and the like loaded therein, for example, when the terminal device 13 plays audio and video, the terminal device 13 may select the played audio and video.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an audio and video processing device in a hardware operating environment according to an embodiment of the present application.

As shown in fig. 2, the audio/video processing terminal may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.

Those skilled in the art will appreciate that the configuration shown in fig. 2 does not constitute a limitation of the audiovisual processing terminal, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 2, the memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and an audio-video processing program.

In the audio/video processing terminal shown in fig. 2, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the audio/video processing terminal of the present application may be disposed in the audio/video processing terminal, and the audio/video processing terminal invokes the audio/video processing program stored in the memory 1005 through the processor 1001 and executes the audio/video processing method provided in the embodiment of the present application.

Based on the hardware structure of the audio/video processing but not limited to the above hardware structure, the present application provides a first embodiment of an audio/video processing method. Referring to fig. 3, fig. 3 shows a schematic flow diagram of a first embodiment of the audio/video processing method of the present application.

It should be noted that, although a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein.

In this embodiment, the audio/video processing method includes:

s10, receiving an audio and video file sent by terminal equipment;

the execution main body of the audio and video processing method is a server, the server is one of computers, and can provide calculation or application services for other clients in a network, for example, processing of extracting non-human voice data is performed on a video file sent by a terminal device. The server may be an intelligent AI server, which is not limited in this embodiment.

In this embodiment of the application, the terminal device may be a user device with display and interaction functions, such as a smart phone, a tablet device, a notebook computer, and the like, which is not limited in this application. The audio and video files include audio files and video files. The audio and video files can be recorded television clips, recorded live clips and the like.

For example, when a user watches a television drama a by using a computer, the user can cut some drama fragments in the television drama a and then integrate the cut fragments to obtain new audio and video fragments, and the user can upload the audio and video fragments as audio and video files to a server for processing.

S20, acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound;

in the embodiment of the present application, the non-human voice data may be other sounds besides the main playing sound such as the main human voice or the narration sound in the audio file, including but not limited to the original scene background sound of the scene or the background sound added by the producer. For example, when the audio/video file is a sea-catching video recorded by a user, the obtained non-human voice data may be a sound of sea waves beating a beach.

S30, determining character information, sound occurrence time and sound playing time length corresponding to the non-human voice data;

in the embodiment of the present application, the text message may be type information of the non-human voice data, such as "waning wang" when the non-human voice data is an animal call, such as a puppy; when the non-human voice data is the voice of the nature, such as thunder, the text information can be 'rumble bloom'. Or, the text information may also be description information of the non-human voice data, for example, when rain occurs in the non-human voice data, the text information may be a brief description of the rain condition, such as "weather changes from medium rain to heavy rain". Or, in the audio/video file, the text information may also be the origin information of the non-human voice data, for example, when the non-human voice data is the voice of the warning of whistling at a curve by the driver, the text information may be that "after the driver presses a whistling button on the steering wheel, the automobile makes a dripping sound".

The sound occurrence time may be time stamp information of the first occurrence of the non-human sound data in the audio/video file.

The sound playing time length can be continuous time length from the moment when the non-human sound appears in the audio and video file.

For example, in an audio/video file, a car whistling sound occurs at the time of 3 minutes and 10 seconds until the car whistling sound ends after 5 minutes and 10 seconds, that is, the text information corresponding to the determined non-human voice sound data is the car whistling sound, the sound occurrence time is 3 minutes and 10 seconds, and the sound playing time is 2 minutes or 120 seconds. The unit of calculating the sound playing time length may be seconds, which is not limited in this embodiment.

S40, determining a display area corresponding to the text information;

in the embodiment of the present application, the display area may be an area that does not affect the display of the main picture in the video frame corresponding to the non-human voice data and does not respond to the display of the related subtitle information.

For example, in a live video of shopping for a certain clothing, the lower left of the display screen is a subtitle area where audiences release opinions, and the upper left of the display screen can be related introductions of a host of the live broadcast, such as height and weight information. Therefore, in order to avoid the superposition of the text information and the caption information, the display area of the text information can be determined in other areas of the display picture. For example, the left central area of the display screen is used as the display area of the text information.

And S50, sending the display area, the text information, the sound occurrence time and the sound playing time to a terminal device, so that the terminal device displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file.

Illustratively, after receiving a video file sent by the terminal device and a national language required to be displayed by a user, the server stores an audio/video file uploaded by the terminal device and extracts an audio file and corresponding playing time in the audio/video file. Then, the specific components in the background sound in the audio file are identified through the audio AI, such as rain sound, thunder sound, automobile whistling sound, wind sound, collision sound, and the like, that is, the non-human sound data is obtained from the audio/video file, and the identified non-human sound data is translated into corresponding character information according to the type or other attribute information of the non-human sound data. If the sound of the automobile whistle is recognized, the corresponding character information is 'automobile whistle'. And simultaneously, the occurrence time of the non-human voice data and the voice playing time length are also determined. And determining a display area which does not conflict with the original subtitle information from the display picture of each video frame in the video file as a display area of the character information. And the server transmits the determined text information, the sound occurrence time, the sound playing time and the display area to the terminal equipment, and the terminal equipment displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file.

In this embodiment, according to the occurrence time of the non-human voice data and the sound playing time length, the text information corresponding to the non-human voice data is displayed in the corresponding display area, so that a person with hearing impairment can see the text information related to the sound except for the main stream subtitles in the video frame display picture, and further help the person with hearing impairment better watch and understand the live video.

Further, as an embodiment, referring to fig. 4, the present application provides a second embodiment of an audio and video processing method, and referring to fig. 4, fig. 4 shows a flowchart of the second embodiment of the audio and video processing method.

In this embodiment, the step S40 includes:

step S401, determining a sound channel corresponding to the non-human voice data according to the non-human voice data;

and S402, determining a display area corresponding to the text information according to the sound channel.

In the embodiment of the present application, the sound channels may be mutually independent audio signals, such as a left channel and a right channel, which are collected or played back at different spatial positions when sound is recorded or played. The left channel is typically a compressed bass region signal, and is typically positioned at a left side sound source location; the right channel is typically a compressed mid and high audio range signal and is typically positioned to the right of the sound source orientation.

For example, after obtaining the non-human voice data through the audio AI recognition, if the non-human voice data is only output from the left channel, the display area of the text information is in the blank area of the left part of the display screen of the video frame. As shown in fig. 5, if the non-human voice is outputted only from the right channel, the display area of the text information is a blank area in the right portion of the video frame display screen. Further, as shown in fig. 6, if the non-human voice data is output from the left channel and the right channel at the same time, the display area of the text information may be a blank area on the left side of the display screen of the video frame or a blank area on the right side of the video, or as shown in fig. 7, the text information may be simultaneously displayed in blank areas on the left and right sides of the display screen of the video frame.

In this embodiment, a display area of text information in a video frame display screen is determined based on an output channel of non-human voice data, and the display area of text information display is fixed to some extent. Therefore, when watching the audio and video, the hearing-impaired person can know the specific detail content of the audio and video file by checking the displayed text information in the fixed display area, and the convenience of the hearing-impaired person in checking the text information corresponding to the non-human voice data is further improved.

Further, as an embodiment, referring to fig. 8, the present application provides a third embodiment of an audio and video processing method, and referring to fig. 8, fig. 8 shows a flowchart of the third embodiment of the audio and video processing method.

In this embodiment, after the step S10, the method further includes:

s101, acquiring voice data according to the audio and video file;

step S102, determining voice character information, voice occurrence time and voice playing time corresponding to the voice and voice data;

step S103, determining a voice display area corresponding to the voice character information;

and step S104, sending the voice character information, the voice occurrence time, the voice playing time and the voice display area to a terminal device, so that the terminal device displays the voice character information according to the voice occurrence time, the voice playing time and the voice display area when playing the audio and video file.

In this embodiment of the present application, the human voice data may be a main output audio in the audio/video file, such as a voice of a dialog of a main character, for example, when the audio/video file is a two-person vocal dialog, the human voice data obtained according to the audio/video file is a voice of a dialog of a mutually performing character. The voice text information may be text information of voice data, such as when the voice data is a dialog of a leading actor in a drama performance, the voice text information may be a speech line spoken by the leading actor.

The time of occurrence of the voice can be timestamp information of the first occurrence of voice data in the audio/video file; the sound playing time length can be continuous time length from the moment when the human voice appears in the audio and video file.

For example, in the audio/video file, a starts speaking from the 1 st minute 20 seconds, the 1 st minute 50 seconds end, the occurrence time of the human voice is the 1 st minute 20 seconds, and the sound playing time is half a minute or 30 seconds. The unit of calculating the sound playing time length may be seconds, which is not limited in this embodiment. The voice display area may be an area where voice data does not affect the display of the main picture and does not affect the display of the related subtitle information in the corresponding video frame.

Illustratively, after receiving an audio/video file sent by the terminal device and a language required to be displayed by a user, the server stores the audio/video file uploaded by the terminal device and extracts the audio file and the corresponding playing time in the audio/video file. Then, extracting voice data in the audio, translating the voice data into characters corresponding to the language required to be displayed by a user by utilizing AI identification, extracting a video frame display picture corresponding to the voice data after determining the voice occurrence time and the voice playing time length, judging whether subtitle information exists in the video frame display picture or not by means of AI identification, acquiring a subtitle display area of the subtitle information in the video frame display picture when the subtitle information exists in the video frame display picture, finally determining a subtitle-free display area which does not conflict with the subtitle display area in the video frame display picture, and determining a voice display area corresponding to the voice character information in the subtitle-free display area, namely acquiring the subtitle information according to the audio and video files; determining a subtitle display area corresponding to the subtitle information; determining a subtitle-free display area which does not conflict with the subtitle display area according to the subtitle display area; and determining the subtitle-free display area as a voice display area corresponding to the information of the human voice characters.

The subtitle information can be keyword information in a video frame display picture, such as a topic name of the current video live broadcast; or the caption information may also be comment information appearing in the video frame display, such as a comment sent by the viewer. After determining the information, the server sends the voice character information, the voice occurrence time, the voice playing time and the voice display area to the terminal equipment, so that the terminal equipment displays the voice character information according to the voice display area, the voice occurrence time and the voice playing time when playing the audio and video file.

In this embodiment, according to the occurrence time of the voice sound data and the voice playing time length, voice text information corresponding to the voice sound data is displayed in the corresponding voice display area, so that a person with hearing impairment can visually and clearly know main contents related to audio and video when watching the audio and video, and the watching experience of the person with hearing impairment is improved.

Further, as an embodiment, referring to fig. 9, the present application provides a fourth embodiment of the audio/video processing method. Referring to fig. 9, fig. 9 shows a flowchart of a fourth embodiment of the audio-video processing method.

In this embodiment, the audio/video processing method includes:

step S60, sending an audio/video file to a server so that the server obtains non-human voice data according to the audio/video file, and determining character information, voice occurrence time, voice playing time and a display area corresponding to the character information corresponding to the non-human voice data;

step S70, receiving the display area, the character information, the sound occurrence time and the sound playing time length sent by the server;

and S80, displaying the text information in the audio and video according to the display area, the text information, the sound occurrence time and the sound playing time length.

Specifically, the terminal device may be a user device with display and interaction functions, such as a smart phone, a tablet device, a notebook computer, and the like, which is not limited in this application. For example, the user can select the audio and video to watch on the mobile phone, such as a television play, a short video, a live broadcast, and the like.

In the embodiment of the application, after the terminal device uploads an audio/video file to the server, within a preset time, the terminal device receives text information, a sound occurrence time, a sound playing time and a display area corresponding to the text information, which are sent by the server and correspond to non-human voice data, and displays the text information in a corresponding video frame display picture according to the display area, the text information, the sound occurrence time and the sound playing time when playing audio/video. The non-human voice data in the audio and video file is expressed in the video frame display picture in a text mode, so that the method can help the hearing-impaired people to know the details of the audio and video, and further improve the watching experience of the hearing-impaired people.

In another example, after receiving the audio/video file, the server processes the audio/video file to obtain the voice text information, the voice occurrence time, the voice playing time length, and the voice display area corresponding to the voice data (see the third embodiment specifically, which is not described herein again), and sends the voice text information, the voice occurrence time, the voice playing time length, and the voice display area to the terminal device. After receiving the voice text information, the voice occurrence time, the voice playing time and the voice display area which are sent by the server and correspond to the voice data, the terminal equipment displays the voice text information in the corresponding video frame display picture according to the voice text information, the voice occurrence time, the voice playing time and the voice display area when playing audio and video. It should be noted that, in order to avoid the conflict between the display positions of the voice display area and the display area, the user may adjust the position of the voice display area in the video frame display screen by himself. For example, when non-human voice data is output from the left channel and the right channel at the same time and character information is simultaneously displayed in the blank areas on the left and right sides of the video frame display screen, as shown in fig. 10, the lower right of the video frame display screen may be used as the human voice display area of the human voice character information.

In this embodiment, the terminal device displays the text information corresponding to the non-human voice data in the display area of the video frame display picture according to the display area, the text information, the voice occurrence time and the voice playing time, so that a person with hearing impairment can better know details of audio and video. And after the display area of the non-human voice sound data in the video frame display picture is determined, the non-subtitle display area is determined from the video frame display picture, and then the human voice display area of the human voice character information corresponding to the human voice sound data is determined in the non-subtitle display area, so that the situation that the display area is overlapped with the human voice display area, and then the situation that the caption is difficult to see by the hearing-impaired human voice is caused can be avoided.

As an embodiment, in a specific implementation, before sending the audio/video file to the server, in order to improve the transmission efficiency of the transmission terminal device, before step S60 in this embodiment, the method may further include:

In the embodiment of the present application, the slicing process may be an operation of cutting the complete video data into a series of data frames and additional extra frame headers and trailers according to a preset time interval or key frame. For example, after the terminal device is connected to the server, the national language required to be displayed by the client is uploaded, after the user starts the playing of the audio and video, the intelligent AI application starts to record the played audio and video file, and the intelligent AI slices the played audio and video file every 10 seconds to obtain the audio and video file with the duration of 10 seconds. The preset time period may be 5 seconds or 10 seconds, which is not limited in this embodiment.

In this embodiment, a video to be played is sliced according to a preset duration to obtain an audio/video file, and since the size of an index file corresponding to the sliced audio/video file is usually only dozens of K, the transmission efficiency of the terminal device for uploading the audio/video file to the server can be improved, and the processing efficiency of the server can be improved when the server processes the audio/video file.

Based on the same application concept, the audio and video processing device is provided for a server, and referring to fig. 11, fig. 11 is a schematic diagram of a first structural framework of the audio and video processing device.

The first receiving module 10 is configured to receive an audio/video file sent by a terminal device;

an obtaining module 20, configured to obtain non-human voice sound data according to the audio/video file; wherein the non-human voice data is audio/video background sound;

a first determining module 30, configured to determine text information, a sound occurrence time, and a sound playing time corresponding to the non-human voice data;

a second determining module 40, configured to determine a display area corresponding to the text message;

the first sending module 50 is used for sending the display area, the text information, the sound occurrence time and the sound playing time to the terminal equipment, so that the terminal equipment displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file.

According to the technical scheme of the embodiment, through mutual matching of the functional modules, after the server receives the audio and video file sent by the terminal equipment; acquiring non-human voice data according to the audio and video file; wherein the non-human voice data is audio/video background sound; determining character information, sound occurrence time and sound playing time length corresponding to the non-human voice data; determining a display area corresponding to the text information; and sending the display area, the text information, the sound occurrence time and the sound playing time to a terminal device, so that the terminal device displays the text information according to the display area, the sound occurrence time and the sound playing time when playing the audio and video file. According to the occurrence time of the non-human voice data and the sound playing time length, text information corresponding to the non-human voice data is displayed in a display area in a video frame display picture, so that a person with hearing impairment can see other text information related to sound except for the mainstream subtitles, and the person with hearing impairment is helped to better watch and understand live video.

Based on the same application concept, the audio/video processing device is provided, and is used for a terminal device, and referring to fig. 11, fig. 11 is a schematic diagram of a second structural framework of the audio/video processing device.

The second sending module 60 is configured to send the audio/video file to a server, so that the server obtains non-human voice data according to the audio/video file, and determines text information, a voice occurrence time, a voice playing time and a display area corresponding to the text information corresponding to the non-human voice data;

a second receiving module 70, configured to receive the display area, the text information, the sound occurrence time, and the sound playing time sent by the server;

and the display module 80 is used for displaying the text information in the audio and video according to the display area, the text information, the sound occurrence time and the sound playing time length.

It should be noted that, in this embodiment, various embodiments of the audio/video processing apparatus and technical effects achieved by the embodiments may refer to various implementation manners of the audio/video processing method in the foregoing embodiments, and details are not described here.

According to the technical scheme, through the mutual cooperation of the functional modules, after the terminal equipment sends the audio and video file to the server, the server processes the audio and video file to obtain the text information, the sound occurrence time, the sound playing time and the display area corresponding to the text information corresponding to the non-human voice data, and sends the text information, the sound occurrence time, the sound playing time and the display area corresponding to the text information to the terminal equipment. After the terminal device receives the text information, the sound occurrence time, the sound playing time and the display area corresponding to the text information, which are sent by the server and correspond to the non-human voice data, when the terminal device plays the audio/video file, the terminal device displays the text information corresponding to the non-human voice data in the display area of the video frame display picture according to the display area, the text information, the sound occurrence time and the sound playing time, namely the non-human voice data in the audio/video is displayed in the video frame display picture in a text mode, so that the terminal device can help a hearing-impaired person to know the details of the audio/video, and further improve the watching experience of the hearing-impaired person.

In addition, an embodiment of the present application further provides a computer storage medium, where an audio/video processing program is stored on the storage medium, and when the audio/video processing program is executed by a processor, the steps of the audio/video processing method as described above are implemented. Therefore, a detailed description thereof will be omitted. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. It is determined that, by way of example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, which may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, where the computer software product is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only memory (ROM), a random-access memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods of the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims

1. An audio and video processing method, for a server, the method comprising:

receiving an audio and video file sent by terminal equipment;

acquiring non-human voice data according to the audio and video file; the non-human voice data are audio and video background sounds;

determining a display area corresponding to the text information;

2. The audio/video processing method according to claim 1, wherein the determining a display area corresponding to the text information includes:

3. The audio/video processing method according to claim 1, wherein after receiving the audio/video file sent by the terminal device, the method further comprises:

acquiring human voice data according to the audio and video file;

determining voice character information, voice occurrence time and voice playing time corresponding to the voice data;

4. The audio/video processing method according to claim 3, wherein the determining of the human voice display area corresponding to the human voice text information includes:

acquiring subtitle information according to the audio and video file;

determining a subtitle display area corresponding to the subtitle information;

5. An audio and video processing method, which is used for a terminal device, the method comprising:

and displaying the text information in the audio and video according to the display area, the text information, the sound occurrence time and the sound playing time length.

6. The audio-video processing method according to claim 5, wherein before sending the audio-video file to the server, the method further comprises:

7. An audio/video processing apparatus, provided in a server, the apparatus comprising:

8. An audio/video processing apparatus, configured to be provided in a terminal device, the apparatus comprising:

9. A server, comprising: a processor, a memory and an audio-video processing program stored in said memory, said audio-video processing program, when executed by said processor, implementing the steps of the audio-video processing method according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an audio-video processing program which, when executed by a processor, implements the audio-video processing method according to any one of claims 1 to 6.