CN112906650B

CN112906650B - Intelligent processing method, device, equipment and storage medium for teaching video

Info

Publication number: CN112906650B
Application number: CN202110315710.3A
Authority: CN
Inventors: 梁嘉兴
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2023-08-15
Anticipated expiration: 2041-03-24
Also published as: CN112906650A

Abstract

The disclosure provides an intelligent processing method, device, equipment and storage medium for teaching videos, relates to the technical field of computers, and particularly relates to the technical field of online teaching. The specific implementation scheme is as follows: performing language form processing on teaching audio in the teaching video to obtain a language form processing result of the teaching audio; respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image; and cross-checking at least two of the language form processing result, the action type processing result and the mouth shape type processing result to obtain a teaching video processing result. The embodiment of the disclosure can improve the processing efficiency of the teaching video.

Description

Intelligent processing method, device, equipment and storage medium for teaching video

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of online teaching, and specifically relates to an intelligent processing method, device, equipment and storage medium of a teaching video.

Background

With the development of computer technology, users can learn through the internet in an electronic environment composed of communication technology, microcomputer technology, computer technology, artificial intelligence, network technology, multimedia technology, and the like.

In an online learning scene, a teacher can record teaching videos in advance. How to process teaching videos is important.

Disclosure of Invention

The disclosure provides an intelligent processing method, device, equipment and storage medium for teaching videos.

According to an aspect of the present disclosure, there is provided an intelligent processing method of a teaching video, including:

performing language form processing on teaching audio in the teaching video to obtain a language form processing result of the teaching audio;

respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

and cross-checking at least two of the language form processing result, the action type processing result and the mouth shape type processing result to obtain a teaching video processing result.

According to another aspect of the present disclosure, there is provided an intelligent processing device for teaching video, including:

the language form processing module is used for carrying out language form processing on teaching audio in the teaching video to obtain a language form processing result of the teaching audio;

the action mouth shape processing module is used for respectively processing action types and mouth shapes of teaching images in the teaching video to obtain action type processing results and mouth shape type processing results of the teaching images;

And the cross checking module is used for carrying out cross checking on at least two of the language form processing result, the action type processing result and the mouth shape type processing result so as to obtain a teaching video processing result.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the intelligent processing method of teaching video provided by any of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the intelligent processing method of the teaching video provided by any embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the intelligent processing method of teaching video provided by any embodiment of the present disclosure.

According to the technology disclosed by the invention, the processing efficiency of the teaching video can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an intelligent processing method of teaching video according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of another intelligent processing method of teaching video according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of yet another intelligent processing method of teaching video according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an intelligent processing device for teaching video according to an embodiment of the disclosure;

fig. 5 is a block diagram of an electronic device for implementing the intelligent processing method of teaching video according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes in detail the solution provided by the embodiments of the present disclosure with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an intelligent processing method of a teaching video according to an embodiment of the present disclosure, where the embodiment of the present disclosure may be applicable to a case of processing audio and images in the teaching video. The method can be executed by an intelligent processing device of teaching video, and the device can be realized in a hardware and/or software mode and can be configured in electronic equipment. Referring to fig. 1, the method specifically includes the following:

s110, carrying out language form processing on teaching audio in the teaching video to obtain a language form processing result of the teaching audio;

s120, respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

s130, performing cross check on at least two of the language form processing result, the action type processing result and the mouth shape type processing result to obtain a teaching video processing result.

The teaching video can be video which is recorded in advance by a teacher and used for students to learn online. The teaching video may include teaching audio and teaching image, and the teaching audio and the teaching image may be associated with each other by a time stamp, that is, the teaching audio and the teaching image are associated with each other at the same time.

The language forms may include spoken forms and written forms, wherein the spoken forms refer to spoken stop words or spoken Buddhist, etc. The action types may include valid actions and invalid actions; the effective actions are indispensable actions in the teaching process, and the ineffective actions are non-indispensable actions. The mouth shape types may include a valid mouth shape and an invalid mouth shape; the effective mouth shape is an indispensable mouth shape in the teaching process, and the ineffective mouth shape is a non-indispensable mouth shape in the teaching process.

Specifically, teaching audio in teaching video can be obtained, language form processing is carried out on the teaching audio, and word sets belonging to different language forms are obtained; the method comprises the steps of obtaining teaching images in a teaching video, respectively identifying actions and mouth shapes in the teaching images, determining the types of the actions and the mouth shapes, and obtaining action sets belonging to different types and mouth shape sets belonging to different types.

The cross check is to select at least one item from the language form processing result, the action type processing result and the mouth shape type processing result as a check standard, select at least one item as a check object, and check the check object by adopting a calibration standard. That is, in the case where the types are not identical between the check standard and the check object, the type of the check standard is regarded as the type of the check object, that is, the type of the check standard is regarded as being in control. For example, the action type processing result and/or the mouth shape processing result can be checked and adjusted by taking the language form processing result as a standard; the method can also adopt the action type processing result and/or the mouth shape processing result as the standard to check and adjust the language form processing result; two of language form processing results, action type processing results and mouth shape processing results can be adopted to verify and adjust the other one. Through cross checking among the language form, the action type and the mouth shape type, the accuracy of the checking result can be improved, namely the accuracy of the language form, the action type and the mouth shape type can be improved; and through automatic teaching video processing, need not to rely on the manual work to handle teaching video, can also improve teaching video's processing efficiency.

According to the technical scheme, the language form, the action type and the mouth shape type of the teaching video are automatically cross-checked, manual dependency is not needed, the processing efficiency of the teaching video can be improved, and the quality of the processing result of the teaching video can be improved.

Fig. 2 is a flow chart of another intelligent processing method for teaching video according to an embodiment of the disclosure. This embodiment is an alternative to the embodiments described above. Referring to fig. 2, the intelligent processing method for teaching video provided in this embodiment includes:

s210, carrying out language form processing on teaching audio in the teaching video to obtain a language form processing result of the teaching audio;

s220, respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

s230, checking the action type processing result and the mouth shape type processing result according to the language form processing result based on the timestamp association relation to obtain a new action type processing result and a new mouth shape type processing result;

S240, aligning the language form processing result, the new action type processing result and the new mouth shape type processing result based on the time stamp association relation to obtain a teaching video processing result.

In the embodiment of the disclosure, the language form processing result can be used as a verification standard, the action type processing result and the mouth shape type processing result are verified, and the new action type processing result and the new mouth shape type processing result are obtained by performing type adjustment on the action type processing result and the mouth shape type processing result which are different from the associated language form processing result types. And aligning the language form processing result, the new action type processing result and the new mouth type processing result based on the time stamp association relation, so that the types of the language form processing result, the new action type processing result and the new mouth type processing result associated with the same time stamp are the same, namely realizing the type alignment. The accuracy of language forms, actions and mouth types can be further improved through cross checking, the language forms, the action types and the mouth types which are associated with the same time stamp can be consistent, and the teaching video processing results can be further processed conveniently, so that the quality of teaching videos is further improved.

In an alternative embodiment, the linguistic form processing results include a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and an effective action set; the mouth shape type processing results include an invalid mouth shape set and a valid mouth shape set.

Wherein, one word in the spoken language set and the written language set can be associated with at least one teaching video time stamp, one action in the invalid action set and the valid action set can be associated with at least one teaching video time stamp, and one mouth shape in the invalid mouth shape set and the valid mouth shape set can be associated with at least one teaching video time stamp; the words, actions and mouth shapes associated with the same teaching video time stamp are associated with each other.

In an alternative embodiment, based on the timestamp association relationship, verifying the action type processing result and the mouth shape type processing result according to the language form processing result includes: based on the timestamp association relationship, acquiring an action associated with the spoken word in the spoken word set, and adjusting the action associated with the spoken word into the invalid action set when the action associated with the spoken word belongs to the valid action set; based on the timestamp association, a mouth shape associated with the spoken words in the spoken word set is obtained, and the mouth shape associated with the spoken words is adjusted to the invalid mouth shape set when the mouth shape associated with the spoken words belongs to the valid mouth shape set.

Specifically, the spoken words in the spoken word set can be traversed, the actions associated with the spoken words are obtained based on the timestamp association relation, and when the associated actions belong to the valid action set, the associated actions are adjusted to be in the invalid action set, namely when the associated actions are valid actions, the associated actions are adjusted to be invalid actions; the mouth shape associated with the spoken utterance may be acquired, and the associated mouth shape may be adjusted to an invalid mouth shape when the associated mouth shape is a valid mouth shape. By adjusting the action associated with the spoken word to an invalid action and adjusting the mouth shape associated with the spoken word to an invalid mouth shape, the accuracy of the type of the action and the mouth shape can be improved. The method is particularly suitable for the situation that the accuracy of the language form processing result is higher than that of the action type processing result and the mouth shape type processing result.

In an optional implementation manner, the aligning the language form processing result, the new action type processing result and the new mouth type processing result based on the timestamp association relationship includes: aiming at target words in the spoken word set and the written word set, acquiring target actions associated with the target words and target mouth shapes associated with the target words based on a timestamp association relation; determining that the target word also belongs to the same type under the condition that the target action and the target mouth shape belong to the same type set; and under the condition that the target action and the target mouth shape belong to different types of sets, acquiring a labeling type, and taking the labeling type as the type of the target word, the target action and the target mouth shape.

The target words can belong to a spoken language set or a written language set, namely the target words can be spoken words or written words. And (3) through traversing each word (namely the target word) in the language form processing result, acquiring a target action and a target mouth shape associated with the target word based on the timestamp association relation.

Under the condition that the types of the target action and the target mouth shape are both valid, the target word can be determined to belong to a written word (namely, the target word is valid as well); in the event that both the target action and the type of target spoken form are invalid, it may be determined that the target word belongs to the spoken word (i.e., the target word is also invalid). And, in the case where the type of the target action is different from the type of the target mouth shape, that is, in the case where one type is valid and the other type is invalid, the labeling type determined based on the quantization standard may be acquired, and the labeling type is taken as the type of the three of the target word, the target action, and the target mouth shape. Wherein the spoken language form, the ineffective action and the type between the ineffective mouth forms are the same; the written language form, the effective actions and the types between the effective mouth shapes are the same. Through the type alignment among the language form, the action and the mouth shape, the accuracy of the types of the language form, the action and the mouth shape can be further improved.

In addition, the method can also adopt language form and action type to carry out alignment verification on the mouth shape type, namely traversing the mouth shapes in the effective mouth shape set and the ineffective mouth shape set to obtain words related to the mouth shape and actions related to the mouth shape; in the case that the associated word and the associated action belong to the same type set, determining that the mouth shape also belongs to the same type; and under the condition that the associated words and the associated actions belong to different types of sets, the labeling type is used as the type of the associated words, the associated actions and the mouth shape. The method can also adopt language form and mouth shape type to carry out alignment check on the action type, namely traversing the actions in the effective action set and the ineffective action set, and acquiring words related to the actions and mouth shapes related to the actions; in the case that the associated word and the associated mouth shape belong to the same type set, determining that the action also belongs to the same type; and when the associated word and the associated mouth shape belong to different types of sets, the labeling type is used as the type of the associated word, the action and the associated mouth shape.

In addition, in the teaching video processing result, if the word of any time stamp belongs to a spoken word, the action of the time stamp belongs to an invalid action, and the mouth shape is in an invalid mouth shape, the teaching audio and the teaching image of the time stamp can be directly deleted, so that invalid information in the teaching video is removed, the quality of the teaching video is improved, the duration of the teaching video is shortened, and the online learning efficiency is improved. It should be noted that the key knowledge in the teaching video can be marked with key points, so that the learning efficiency of the key knowledge is improved.

According to the technical scheme, the spoken language is adopted to verify the action type and the mouth shape type, and the language form, the action and the mouth shape are aligned, so that the quality of a teaching video processing result can be further improved; and invalid information in the teaching video processing result can be removed, so that the duration of the teaching video is shortened, and the online learning efficiency is improved.

Fig. 3 is a flow chart of another intelligent processing method for teaching video according to an embodiment of the disclosure. This embodiment is an alternative to the embodiments described above. Referring to fig. 3, the intelligent processing method for teaching video provided in this embodiment includes:

s310, extracting words in a language form of spoken words from teaching audio of a teaching video based on a spoken word dictionary, and replacing the words with written words;

s320, identifying overlapping words in the teaching audio, and performing de-overlapping processing on the overlapping words outside the overlapping word white list to obtain a language form processing result;

s330, respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

S340, performing cross check on at least two of the language form processing result, the action type processing result and the mouth shape type processing result to obtain a teaching video processing result.

Wherein the spoken word dictionary includes spoken words, and written words associated with the spoken words. Specifically, the whole teaching audio can be subjected to sentence breaking through semantic analysis, sentence breaking results are converted into text sentences, the text sentences are compared with spoken words in a spoken word dictionary, the spoken words in the text sentences are recognized, and the spoken words are replaced by written words.

Where overlapping words refer to words in which there are overlapping characters, i.e., in which there are characters that occur twice or more in succession, for example, "this is the overlapping word". The overlapped word white list comprises overlapped words conforming to grammar regulations, the overlapped word white list can be constructed through overlapped words commonly used in a statistical dictionary, for example, although overlapped words exist in warm sun, the overlapped words conform to grammar regulations, and the overlapped word white list can be added into the overlapped word white list.

Specifically, overlapping words and expressions of the teaching audio can be extracted based on a character string matching technology, so that overlapping words and expressions in the teaching audio are obtained; and matching the overlapped words in the teaching audio with the overlapped word whitelist, reserving the overlapped words belonging to the overlapped word whitelist in the teaching audio, and performing de-overlapping treatment on the overlapped words not belonging to the overlapped word whitelist. Here, the de-overlapping process refers to deleting the overlapped character in the overlapped word, for example, "this is adjusted to" this "by the de-overlapping process. The accuracy of the language form processing result can be improved by adopting a spoken language dictionary and an overlapped word white list to process the language form of the teaching audio.

In an alternative embodiment, the spoken dictionary is determined based on historical teaching audio of the user to whom the teaching video pertains. Specifically, the historical teaching audio of the user to which the teaching video belongs can be determined through manual annotation statistics, and the personalized spoken word dictionary of the user is obtained, so that the accuracy of language form processing is further improved.

In an optional implementation manner, performing action type and mouth shape type processing on the teaching image in the teaching video respectively, and obtaining an action type processing result and a mouth shape type processing result of the teaching image includes: respectively carrying out action and mouth shape recognition on the teaching images in the teaching video to obtain actions and mouth shapes in the teaching images; clustering actions and mouth shapes in each teaching image to obtain at least two actions and at least two mouth shapes; dividing at least one action into an active action set and dividing at least one action into an inactive action set; and, at least one of the mouth shapes is divided into an active mouth shape set and at least one of the mouth shapes is divided into an inactive mouth shape set.

Specifically, part of the actions in at least two actions can be randomly divided into an effective action set, the rest of the actions are divided into an ineffective action set, and part of the mouth shapes in at least two mouth shapes can be randomly divided into an effective mouth shape set, and the rest of the mouth shapes are divided into an ineffective mouth shape set, so that the determination efficiency of action type processing results and mouth shape type processing results is improved.

Because the number of actions and mouth shapes in the teaching image is large, clustering is carried out on the actions and mouth shapes, and the number of the actions and mouth shapes is reduced on the basis of clustering results of the actions and mouth shapes, so that the efficiency of cross checking of language forms, action types and mouth shapes is improved. And, by cross-checking at least two of the language form processing result, the action type processing result and the mouth shape type processing result, the accuracy of the language form processing result, the action type processing result and the mouth shape type processing result can be improved.

It should be noted that in the embodiment of the present disclosure, the pre-recorded teaching video may be processed, and the on-line processing may also be performed in the recording process of the teaching video, for example, the teaching video may be processed through a plug-in unit built in the recording device, or the teaching video may be processed through a video acquisition device integrated with a loudspeaker and a microphone adopted in the teaching process.

According to the technical scheme, the accuracy of the language form processing result, the action type processing result and the mouth shape type processing result can be improved; the language form processing is carried out on teaching audios by adopting a spoken language dictionary and an overlapped word white list, and the accuracy of the language form processing result can be improved; and, can also raise the determination efficiency of the action type processing result, mouth shape type processing result.

Fig. 4 is a schematic diagram of an intelligent processing device for teaching video according to an embodiment of the present disclosure, where the embodiment may be suitable for a case where processing is performed on a teaching video in a language form, an action type, a mouth shape type, etc., and the device is configured in an electronic device, so as to implement the intelligent processing method for teaching video according to any embodiment of the present disclosure. The intelligent processing device 400 of the teaching video specifically includes the following steps:

the language form processing module 401 is configured to perform language form processing on the teaching audio in the teaching video, so as to obtain a language form processing result of the teaching audio;

the action mouth shape processing module 402 is configured to respectively perform action type and mouth shape type processing on a teaching image in the teaching video, so as to obtain an action type processing result and a mouth shape type processing result of the teaching image;

the cross checking module 403 is configured to cross check at least two of the language form processing result, the action type processing result, and the mouth shape type processing result, so as to obtain a teaching video processing result.

In an alternative embodiment, the cross-checking module 403 includes:

the verification unit is used for verifying the action type processing result and the mouth shape type processing result according to the language form processing result based on the time stamp association relation so as to obtain a new action type processing result and a new mouth shape type processing result;

And the alignment unit is used for aligning the language form processing result, the new action type processing result and the new mouth type processing result based on the time stamp association relation.

In an alternative embodiment, the verification unit is specifically configured to:

based on the timestamp association relationship, acquiring an action associated with the spoken word in the spoken word set, and adjusting the action associated with the spoken word into the invalid action set when the action associated with the spoken word belongs to the valid action set;

based on the timestamp association, a mouth shape associated with the spoken words in the spoken word set is obtained, and the mouth shape associated with the spoken words is adjusted to the invalid mouth shape set when the mouth shape associated with the spoken words belongs to the valid mouth shape set.

In an alternative embodiment, the alignment unit is specifically configured to:

Aiming at target words in the spoken word set and the written word set, acquiring target actions associated with the target words and target mouth shapes associated with the target words based on a timestamp association relation;

determining that the target word also belongs to the same type under the condition that the target action and the target mouth shape belong to the same type set;

and under the condition that the target action and the target mouth shape belong to different types of sets, acquiring a labeling type, and taking the labeling type as the type of the target word, the target action and the target mouth shape.

In an alternative embodiment, the language form processing module 401 includes:

a spoken word processing unit for extracting words in a language form of a spoken word from the teaching audio of the teaching video based on a spoken word dictionary, and replacing the words with written words;

and the overlapped word processing unit is used for identifying overlapped words in the teaching audio and carrying out de-overlapping processing on the overlapped words outside the overlapped word white list.

In an alternative embodiment, the spoken word dictionary is determined based on historical teaching audio of a user to whom the teaching video belongs.

In an alternative embodiment, the action profile processing module 402 includes:

the action mouth shape recognition unit is used for respectively carrying out action and mouth shape recognition on the teaching images in the teaching video to obtain action and mouth shape in the teaching images;

the action mouth shape clustering unit is used for clustering actions and mouth shapes in the teaching images to obtain at least two actions and at least two mouth shapes;

an action mouth shape dividing unit for dividing at least one action into an effective action set and dividing at least one action into an ineffective action set; and, at least one of the mouth shapes is divided into an active mouth shape set and at least one of the mouth shapes is divided into an inactive mouth shape set.

According to the technical scheme, the language form, the action type and the mouth shape type of the teaching video are automatically checked in a cross mode, manual work is not needed, the processing efficiency of the teaching video can be improved, and the quality of a processing result of the teaching video can be improved; and moreover, the quality of the teaching video processing result can be further improved by verifying the action type and the mouth shape type by adopting the spoken language and aligning the language form, the action and the mouth shape type.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units executing machine learning model algorithms, a digital information processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the respective methods and processes described above, for example, an intelligent processing method of a teaching video. For example, in some embodiments, the intelligent processing method of the teaching video may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the intelligent processing method of teaching video described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform intelligent processing of the teaching video in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs executing on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the information desired by the technical solution of the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An intelligent processing method of teaching video comprises the following steps:

respectively processing the action type and the mouth shape type of the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image; the action type is effective action or ineffective action; the mouth shape type is an effective mouth shape or an ineffective mouth shape;

Cross-checking at least two of the language form processing result, the action type processing result and the mouth shape type processing result to obtain a teaching video processing result;

wherein the cross-checking at least two of the linguistic form processing result, the action type processing result, and the mouth shape type processing result comprises:

based on the time stamp association relation, verifying the action type processing result and the mouth shape type processing result according to the language form processing result to obtain a new action type processing result and a new mouth shape type processing result;

and aligning the language form processing result, the new action type processing result and the new mouth type processing result based on the timestamp association relation.

2. The method of claim 1, wherein the linguistic form processing results comprise a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and an effective action set; the mouth shape type processing results include an invalid mouth shape set and a valid mouth shape set.

3. The method of claim 2, wherein the verifying the action type processing result and the mouth type processing result according to the language form processing result based on the timestamp association relationship comprises:

4. The method of claim 2, wherein the aligning the linguistic form processing result, the new action type processing result, and the new mouth form type processing result based on the timestamp association relationship comprises:

5. The method of claim 1, wherein the language processing of the teaching audio in the teaching video to obtain the language processing result of the teaching audio comprises:

extracting words in a language form of spoken words from teaching audio of the teaching video based on a spoken word dictionary, and replacing the words with written words;

and identifying overlapping words in the teaching audio, and performing de-overlapping processing on the overlapping words outside the overlapping word white list.

6. The method of claim 5, wherein the spoken word dictionary is determined from historical teaching audio of a user to whom the teaching video belongs.

7. The method of claim 1, wherein the respectively performing the action type and the mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image, includes:

respectively carrying out action and mouth shape recognition on the teaching images in the teaching video to obtain actions and mouth shapes in the teaching images;

Clustering actions and mouth shapes in each teaching image to obtain at least two actions and at least two mouth shapes;

dividing at least one action into an active action set and dividing at least one action into an inactive action set; and, at least one of the mouth shapes is divided into an active mouth shape set and at least one of the mouth shapes is divided into an inactive mouth shape set.

8. An intelligent processing device for teaching video, comprising:

the action mouth shape processing module is used for respectively processing action types and mouth shapes of teaching images in the teaching video to obtain action type processing results and mouth shape type processing results of the teaching images; the action type is effective action or ineffective action; the mouth shape type is an effective mouth shape or an ineffective mouth shape;

the cross checking module is used for carrying out cross checking on at least two of the language form processing result, the action type processing result and the mouth shape type processing result so as to obtain a teaching video processing result;

Wherein, the cross-checking module includes:

9. The apparatus of claim 8, wherein the linguistic form processing results comprise a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and an effective action set; the mouth shape type processing results include an invalid mouth shape set and a valid mouth shape set.

10. The apparatus of claim 9, wherein the verification unit is specifically configured to:

11. The device according to claim 9, wherein the alignment unit is specifically configured to:

12. The apparatus of claim 8, wherein the language form processing module comprises:

13. The apparatus of claim 12, wherein the spoken word dictionary is determined from historical teaching audio of a user to which the teaching video belongs.

14. The apparatus of claim 8, wherein the action profile processing module comprises:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.