CN113849689A

CN113849689A - Audio and video data processing method and device, electronic equipment and medium

Info

Publication number: CN113849689A
Application number: CN202111125712.2A
Authority: CN
Inventors: 吴悦; 曹溪语; 李晋芳; 陈进生; 王正宜; 黄正伟; 郑天悦; 毕影全; 张晶; 秦志伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2021-12-28

Abstract

The disclosure discloses an audio and video data processing method, device, equipment, medium and product, and relates to the technical field of voice. The audio and video data processing method comprises the following steps: processing the audio and video data to obtain a first voice element set and first time information aiming at the first voice element set, and matching the first voice element set with a second voice element set, wherein the second voice element set is associated with the text data; determining second time information for the text data based on a matching result between the first voice element set and the second voice element set and the first time information; text data and audio-video data are output in association based on the second time information.

Description

Audio and video data processing method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of speech technologies, and more particularly, to an audio/video data processing method and apparatus, an electronic device, a medium, and a program product.

Background

In the scene of audio and video processing, corresponding text is usually required to be added to the audio and video, for example, subtitle information is added to the audio and video. In the related technology, when the text is added to the audio and video, the matching degree of the text and the audio and video is low, the labor cost is high, and the operation is complicated.

Disclosure of Invention

The present disclosure provides an audio and video data processing method, apparatus, electronic device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an audio and video data processing method, including: processing audio and video data to obtain a first voice element set and first time information aiming at the first voice element set; matching the first set of speech elements with a second set of speech elements, wherein the second set of speech elements is associated with text data; determining second time information for the text data based on the first time information and a result of the matching between the first set of speech elements and the second set of speech elements; and outputting the text data and the audio and video data in an associated manner based on the second time information.

According to another aspect of the present disclosure, there is provided an audio-video data processing apparatus including: the device comprises a processing module, a matching module, a determining module and an output module. The processing module is used for processing audio and video data to obtain a first voice element set and first time information aiming at the first voice element set; a matching module to match the first set of speech elements with a second set of speech elements, wherein the second set of speech elements is associated with text data; a determination module configured to determine second time information for the text data based on a matching result between the first voice element set and the second voice element set and the first time information; and the output module is used for outputting the text data and the audio and video data in a correlation manner based on the second time information.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the audio and video data processing method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to execute the above-described av data processing method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the audiovisual data processing method described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically shows a system architecture of an audio-video data processing method and apparatus according to an embodiment of the present disclosure;

fig. 2 schematically shows a flow chart of an audio-video data processing method according to an embodiment of the present disclosure;

fig. 3 schematically illustrates a schematic diagram of an audio-video data processing method according to an embodiment of the present disclosure;

4A-4B schematically illustrate a schematic diagram of an audio-video data processing method according to an embodiment of the present disclosure;

fig. 5 schematically shows a block diagram of an audiovisual data processing arrangement according to an embodiment of the present disclosure; and

fig. 6 is a block diagram of an electronic device for performing audiovisual data processing to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides an audio and video data processing method. The audio and video data processing method comprises the following steps: and processing the audio and video data to obtain a first voice element set and first time information aiming at the first voice element set. Then, the first set of speech elements is matched with a second set of speech elements, the second set of speech elements being associated with the text data, and second time information for the text data is determined based on a matching result between the first set of speech elements and the second set of speech elements and the first time information. Next, based on the second time information, the text data and the audiovisual data are output in association.

Fig. 1 schematically shows a system architecture of an audio-video data processing method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (examples only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablets, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the audio/video data processing method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the audio and video data processing device provided by the embodiment of the present disclosure may be disposed in the server 105. The audio and video data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster which is different from the server 105 and can communicate with the

clients

101, 102, 103 and/or the server 105. Correspondingly, the audio-video data processing device provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

clients

101, 102, 103 and/or the server 105.

For example, the audio-video data and the text data may be transmitted through the

clients

101, 102, 103, and after the server 105 receives the audio-video data and the text data from the

clients

101, 102, 103 through the network 104, the server 105 may obtain time information for the text data based on the audio-video data and the text data and output the text data and the audio-video data in association based on the time information.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

An embodiment of the present disclosure provides an audio and video data processing method, and an audio and video data processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4B in conjunction with the system architecture of fig. 1. The audio-video data processing method of the embodiment of the present disclosure may be performed by the server 105 shown in fig. 1, for example.

Fig. 2 schematically shows a flowchart of an audio-video data processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the audio-video data processing method 200 according to the embodiment of the present disclosure may include, for example, operations S210 to S240.

In operation S210, the audio and video data is processed to obtain a first speech element set and first time information for the first speech element set.

In operation S220, the first set of speech elements is matched with a second set of speech elements, the second set of speech elements being associated with text data.

In operation S230, second time information for the text data is determined based on the first time information and a matching result between the first voice element set and the second voice element set.

In operation S240, the text data and the audiovisual data are output in association based on the second time information.

Illustratively, when editing the audio and video data, a user can upload the audio and video data and the corresponding text data so as to output the text data and the audio and video data in a correlated manner. For example, speech recognition may be performed on audio in the audio-video data to obtain a first speech element set, where the first speech element set includes a plurality of first speech elements, each first speech includes, for example, a phoneme, and the phoneme includes, for example, a vowel, a consonant, and the like. After the first voice element set is obtained, determining the time information of each first voice element appearing in the audio and video data, and determining the time information corresponding to the plurality of first voice elements as the first time information aiming at the first voice element set.

Then, a second speech element set for the text data is obtained, the second speech element set including, for example, a plurality of second speech elements, each second speech including, for example, a phoneme including, for example, a vowel, a consonant, and the like. For example, for each word in the text data, a phoneme for each word is determined to obtain the second set of speech elements.

After the first set of voice elements and the second set of voice elements are obtained, a first voice element in the first set of voice elements and a second voice element in the second set of voice elements may be matched to determine a second voice element matched with the first voice element, and the first time information for the first voice element is used as the time information for the second voice element, so as to obtain second time information for the text data.

The second time information indicates, for example, the time when the text data appears in the audio/video data, and therefore, the text data and the audio/video data can be output in association based on the second time information, so that the text data is output at a corresponding moment when the audio/video is played, and the text data is output as subtitle data of the audio/video data.

According to the embodiment of the disclosure, a first voice element set is obtained by processing audio and video data, the first voice element set is matched with a second voice element set aiming at text data, second time information aiming at the text data is determined based on a matching result and first time information aiming at the first voice element set, and the text data and the audio and video data are output in a correlation mode according to the second time information. Therefore, through the technical scheme of the embodiment of the disclosure, the matching of the corresponding subtitle data for the audio and video data is realized, the efficiency and the accuracy of subtitle matching are improved, the labor cost required by subtitle matching is reduced, and the operation complexity of subtitle matching is reduced.

Fig. 3 schematically illustrates a schematic diagram of an audio-video data processing method according to an embodiment of the present disclosure.

As shown in fig. 3, for the audio-video data 310, a plurality of audio frames, for example, n audio frames, are extracted from the audio-video data 310, where n is an integer greater than or equal to 1. Then, the plurality of audio frames are processed to obtain a plurality of audio features which are in one-to-one correspondence with the plurality of audio frames. For example, feature extraction is performed on n audio frames, resulting in n audio features 320. For example, when processing an audio frame to obtain audio features, feature extraction may be performed by a pre-trained acoustic model. Acoustic modeThe patterns include, for example, a time-delay neural network model. Then, according to the time information of the audio-video data, the time information of each audio frame is determined as the first time information for the first speech element set. For example, the time information corresponding to the n audio features 320 is t₁～t_nWill t₁～t_nAs the first time information. t is t₁～t_nAny of which may be a time of day or a time period.

For the text data, a second set of speech elements corresponding to the text data is determined, the second set of speech elements being represented, for example, in the manner of the state diagram 330. The state diagram 330 includes, for example, a plurality of second speech elements including, for example, "f", "u: "," b "," a: "and the like. For example, at least one speech element corresponding to each word in the text data is determined in sequence, and the speech elements corresponding to all words in the text data are arranged in sequence to obtain the state diagram 330.

For example, for the plurality of audio features 320, each audio feature may be recognized by an acoustic model, resulting in a plurality of first speech elements in one-to-one correspondence with the plurality of audio features 320 as a first speech element set.

For example, for each audio feature 320, a plurality of candidate speech elements corresponding to the audio feature 320 and a plurality of target probabilities corresponding to the plurality of candidate speech elements are determined, each target probability of the plurality of target probabilities characterizing a probability that the recognition result of the audio feature is the corresponding candidate speech element. For example, for a first audio feature 320, the acoustic model outputs 4 candidate speech elements "f", "u: "," b "," a: ", and the 4 candidate phonetic elements" f "," u: "," b "," a: "corresponding 4 probabilities 0.7, 0.1.

In an example, the candidate speech element "f" corresponding to the maximum probability may be used as the first speech element corresponding to the first audio feature 320, so as to obtain the first speech element corresponding to each audio feature. And taking the first voice elements corresponding to the plurality of audio features as a first voice element set.

Illustratively, the plurality of candidate speech elements for each audio feature 320 may include, for example, a plurality of second speech elements "f", "u: "," b "," a: ".

In another example, for each audio feature 320, a candidate speech element is determined from a plurality of candidate speech elements as the first speech element corresponding to the audio feature based on the plurality of target probabilities and the audio semantic information for the audio feature 320. The audio semantic information is, for example, a context in the audio-video data 310.

For example, taking the third audio feature 320 as an example, the audio feature 320 corresponds to the number of candidate speech elements "f", "u: "," b "," a: ", and the 4 candidate phonetic elements" f "," u: "," b "," a: "corresponds to 4 probabilities 0.5, 0.4, 0.05. Based on the probability and the context, the candidate speech element "u: ", as the first speech element corresponding to the third audio feature 320. It can be appreciated that this approach combines the probability and the context to determine the first speech element for each audio feature, making the recognition result of the first speech element more accurate. For example, if the context indicates that the reading of the nearby audio is "fu", then when the first speech element corresponding to the second audio feature 320 is "f", it is determined that the probability of the first speech element corresponding to the third audio feature 320 is "u: ".

In another example, for state diagram 330, each second speech element in the plurality of second speech elements includes at least one speech state, and the disclosed embodiments are illustrated with each second speech element including 3 speech states. The 3 speech states corresponding to each second speech element may be different, and the different speech states are different, for example, in sound velocity, timbre, tone, and the like. State diagram 330 is composed of a plurality of states corresponding to each second speech element.

The target probability for each audio feature 320 may include a probability that each audio feature 320 corresponds to multiple speech states of each second speech element, i.e., one target probability for each speech state. Then, each first voice element in the first voice element set is matched with each voice state based on the target probability and the audio semantic information to obtain a matching result of each first voice element and the voice state, so that each audio feature is matched into the state diagram 330 to obtain a matching path 331.

Next, for a first speech element matching each speech state, first time information corresponding to the first speech element is determined as time information for each speech state. For example, for a first speech element (corresponding to a first audio feature) matching a first speech state of a second speech element "f", the first time information corresponding to the first speech element is t₁The first time information t₁Time information for the first speech state for the second speech element "f" is determined.

And determining the time information of each voice state as the time information of the second voice element corresponding to the voice state. For example, for a first audio feature 320 and a second audio feature 320, which correspond to a first speech state and a second speech state of a second speech element "f", respectively, the first time information of the first speech state of the second speech element "f" is t₁And the first time information of the second speech state is t₂As time information of the second voice element "f". It can be understood that the matching of the voice elements is realized through the matching of the voice states, the granularity of the matching of the voice elements is improved, and the matching accuracy is further improved.

Then, second time information for the text data is determined based on the time information for each second speech element. For example, after the second time information for the text data is obtained, for each sentence in the text data, a time period corresponding to each sentence is determined. For example, the text between each two punctuations is a sentence, and the punctuations may be commas, periods, etc. And after the time period corresponding to each statement is obtained, outputting the corresponding statement in each time period when subsequently outputting the audio and video data, thereby realizing outputting the caption data in the audio and video data.

According to the embodiment of the disclosure, a plurality of phonemes are obtained by processing audio and video data, a plurality of factors are matched with a plurality of factors of text data, and according to a matching result, time information corresponding to the factors of the audio and video data is used as time information of the phonemes of the text data, so that the text data and the audio and video data can be conveniently output in a correlated manner according to the time information of the phonemes of the text data, and the subtitle data can be output in the audio and video data. Therefore, the embodiment of the disclosure realizes matching of the corresponding subtitle data for the audio and video data, improves the efficiency and accuracy of subtitle matching, reduces the labor cost required by subtitle matching, and reduces the complexity of subtitle matching operation.

Fig. 4A to 4B schematically show a schematic diagram of an audio-video data processing method according to an embodiment of the present disclosure.

As shown in fig. 4A to 4B, when a user performs audio/video editing through a client, the audio/video data 410 may be uploaded to an application program, and the text data 420 corresponding to the audio/video data 410 may be imported into the application program. The application program sends the audio/video data 410 and the text data 420 to the server for processing, and the server may execute the above method to obtain the second time information for the text data 420. Then, the audio and video data 410 and the text data 420 are output in association based on the second time information, the output result 430 includes the text data 420 as subtitle data of the audio and video data 410, and the output result 430 may be presented at the client. For example, when the audio/video is played to the scene of "great family goodness", the text "great family goodness" is displayed as a subtitle.

Fig. 5 schematically shows a block diagram of an audiovisual data processing arrangement according to an embodiment of the present disclosure.

As shown in fig. 5, the audio-video data processing apparatus 500 of the embodiment of the present disclosure includes, for example, a processing module 510, a matching module 520, a determining module 530, and an output module 540.

The processing module 510 may be configured to process the audio-video data to obtain a first set of speech elements and first time information for the first set of speech elements. According to the embodiment of the present disclosure, the processing module 510 may perform, for example, the operation S210 described above with reference to fig. 2, which is not described herein again.

The matching module 520 may be used to match the first set of speech elements with a second set of speech elements, where the second set of speech elements is associated with text data. According to the embodiment of the present disclosure, the matching module 520 may perform, for example, the operation S220 described above with reference to fig. 2, which is not described herein again.

The determining module 530 may be configured to determine second time information for the text data based on the matching result between the first set of speech elements and the second set of speech elements and the first time information. According to the embodiment of the present disclosure, the determining module 530 may, for example, perform operation S230 described above with reference to fig. 2, which is not described herein again.

The output module 540 may be configured to output the text data and the audiovisual data in association based on the second time information. According to the embodiment of the present disclosure, the output module 540 may, for example, perform the operation S240 described above with reference to fig. 2, which is not described herein again.

According to an embodiment of the present disclosure, the processing module 510 includes: the device comprises an extraction submodule, a processing submodule, a first determination submodule and a second determination submodule. The extraction submodule is used for extracting a plurality of audio frames from the audio and video data; the processing submodule is used for processing the plurality of audio frames to obtain a plurality of audio features which are in one-to-one correspondence with the plurality of audio frames; the first determining submodule is used for determining a plurality of first voice elements which are in one-to-one correspondence with the plurality of audio features to serve as a first voice element set; and the second determining submodule is used for determining the time information of each audio frame in the plurality of audio frames as the first time information according to the time information of the audio and video data.

According to an embodiment of the disclosure, for each of the plurality of audio features, the first determining sub-module comprises: a first determination unit and a second determination unit. A first determining unit, configured to determine a plurality of candidate speech elements corresponding to the audio feature and a plurality of target probabilities corresponding to the candidate speech elements, where each target probability in the plurality of target probabilities represents a probability that a recognition result of the audio feature is the corresponding candidate speech element; and a second determining unit, configured to determine, based on the plurality of target probabilities and the audio semantic information, one candidate speech element from the plurality of candidate speech elements as the first speech element corresponding to the audio feature.

According to an embodiment of the present disclosure, the second set of speech elements comprises a plurality of second speech elements, each second speech element of the plurality of second speech elements comprising at least one speech state; the matching module 520 is further configured to: each first speech element in the first set of speech elements is matched to each speech state.

According to an embodiment of the present disclosure, the determining module 530 includes: a third determining submodule, a fourth determining submodule and a fifth determining submodule. A third determining submodule, configured to determine, for a first voice element that matches each voice state, first time information corresponding to the first voice element as time information for each voice state; the fourth determining submodule is used for determining the time information aiming at each voice state as the time information of the second voice element corresponding to the voice state; a fifth determining sub-module for determining second time information for the text data based on the time information for the second speech element.

According to an embodiment of the present disclosure, the output module 540 is further configured to: and outputting the text data as subtitle data of the audio and video data based on the second time information.

According to an embodiment of the present disclosure, a first speech element of the first set of speech elements comprises a phoneme and a second speech element of the second set of speech elements comprises a phoneme.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. The electronic device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the audio-video data processing method. For example, in some embodiments, the audiovisual data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the audiovisual data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the audiovisual data processing method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable audiovisual data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio and video data processing method comprises the following steps:

processing audio and video data to obtain a first voice element set and first time information aiming at the first voice element set;

matching the first set of speech elements with a second set of speech elements, wherein the second set of speech elements is associated with text data;

determining second time information for the text data based on the first time information and a result of the matching between the first set of speech elements and the second set of speech elements; and

and outputting the text data and the audio and video data in an associated manner based on the second time information.

2. The method of claim 1, wherein the processing audio-visual data resulting in a first set of speech elements and first time information for the first set of speech elements comprises:

extracting a plurality of audio frames from the audio and video data;

processing the plurality of audio frames to obtain a plurality of audio features which are in one-to-one correspondence with the plurality of audio frames;

determining a plurality of first voice elements in one-to-one correspondence with the plurality of audio features as the first voice element set; and

and determining the time information of each audio frame in the plurality of audio frames as the first time information according to the time information of the audio and video data.

3. The method of claim 2, wherein the determining a plurality of first speech elements in one-to-one correspondence with the plurality of audio features as the first set of speech elements comprises, for each audio feature of the plurality of audio features:

determining a plurality of candidate speech elements corresponding to the audio feature and a plurality of target probabilities corresponding to the candidate speech elements, wherein each target probability of the plurality of target probabilities characterizes a probability that the recognition result of the audio feature is the corresponding candidate speech element; and

and determining one candidate voice element from the candidate voice elements based on the target probabilities and the audio semantic information, wherein the candidate voice element is used as the first voice element corresponding to the audio feature.

4. The method of claim 1 or 2, wherein the second set of speech elements comprises a plurality of second speech elements, each of the plurality of second speech elements comprising at least one speech state;

said matching the first set of speech elements with a second set of speech elements comprises: matching each first speech element in the first set of speech elements with each speech state.

5. The method of claim 4, wherein the determining second time information for the text data based on the first time information and matching results between the first set of speech elements and the second set of speech elements comprises:

for a first voice element matched with each voice state, determining first time information corresponding to the first voice element as time information for each voice state;

determining time information for each voice state as time information of a second voice element corresponding to the voice state; and

determining second temporal information for the text data based on the temporal information for the second speech element.

6. The method according to any one of claims 1-5, wherein the outputting the text data and the audio-visual data in association based on the second time information comprises:

and outputting the text data as subtitle data of the audio and video data based on the second time information.

7. The method of any of claims 1-6, wherein a first speech element of the first set of speech elements comprises a phoneme and a second speech element of the second set of speech elements comprises a phoneme.

8. An audio-video data processing apparatus comprising:

the processing module is used for processing audio and video data to obtain a first voice element set and first time information aiming at the first voice element set;

a matching module to match the first set of speech elements with a second set of speech elements, wherein the second set of speech elements is associated with text data;

a determination module configured to determine second time information for the text data based on a matching result between the first voice element set and the second voice element set and the first time information; and

and the output module is used for outputting the text data and the audio and video data in a correlation manner based on the second time information.

9. The apparatus of claim 8, wherein the processing module comprises:

the extraction submodule is used for extracting a plurality of audio frames from the audio and video data;

the processing submodule is used for processing the audio frames to obtain a plurality of audio features which are in one-to-one correspondence with the audio frames;

a first determining submodule, configured to determine a plurality of first speech elements that correspond to the plurality of audio features one to one, as the first speech element set; and

and the second determining submodule is used for determining the time information of each audio frame in the plurality of audio frames as the first time information according to the time information of the audio and video data.

10. The apparatus of claim 9, wherein, for each of the plurality of audio features, the first determination submodule comprises:

a first determining unit, configured to determine a plurality of candidate speech elements corresponding to the audio feature and a plurality of target probabilities corresponding to the candidate speech elements, where each target probability in the plurality of target probabilities characterizes a probability that a recognition result of the audio feature is a corresponding candidate speech element; and

a second determining unit, configured to determine, based on the plurality of target probabilities and audio semantic information, one candidate speech element from the plurality of candidate speech elements as the first speech element corresponding to the audio feature.

11. The apparatus of claim 8 or 9, wherein the second set of speech elements comprises a plurality of second speech elements, each of the plurality of second speech elements comprising at least one speech state;

the matching module is further configured to: matching each first speech element in the first set of speech elements with each speech state.

12. The apparatus of claim 11, wherein the means for determining comprises:

a third determining submodule, configured to determine, for a first voice element that matches each voice state, first time information corresponding to the first voice element as time information for each voice state;

a fourth determining submodule, configured to determine, as time information of a second voice element corresponding to each voice state, time information for each voice state; and

a fifth determining sub-module for determining second time information for the text data based on the time information for the second speech element.

13. The apparatus of any of claims 8-12, wherein the output module is further configured to:

14. The apparatus of any of claims 8-13, wherein a first speech element of the first set of speech elements comprises a phoneme and a second speech element of the second set of speech elements comprises a phoneme.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.