CN112906650A

CN112906650A - Intelligent processing method, device and equipment for teaching video and storage medium

Info

Publication number: CN112906650A
Application number: CN202110315710.3A
Authority: CN
Inventors: 梁嘉兴
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-04
Anticipated expiration: 2041-03-24
Also published as: CN112906650B

Abstract

The disclosure provides an intelligent processing method, an intelligent processing device, intelligent processing equipment and a storage medium for teaching videos, and relates to the technical field of computers, in particular to the technical field of online teaching. The specific implementation scheme is as follows: performing language form processing on teaching audio in a teaching video to obtain a language form processing result of the teaching audio; respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image; and performing cross check on at least two items of the language form processing result, the action type processing result and the mouth type processing result to obtain a teaching video processing result. The embodiment of the disclosure can improve the processing efficiency of teaching videos.

Description

Intelligent processing method, device and equipment for teaching video and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an intelligent processing method, apparatus, device, and storage medium for teaching videos.

Background

With the development of computer technology, users can learn in an electronic environment formed by communication technology, microcomputer technology, computer technology, artificial intelligence, network technology, multimedia technology and the like through the internet.

In an online learning scenario, a teacher may pre-record a teaching video. How to process the teaching video is very important.

Disclosure of Invention

The present disclosure provides an intelligent processing method, apparatus, device and storage medium for teaching video.

According to an aspect of the present disclosure, there is provided an intelligent processing method of a teaching video, including:

performing language form processing on teaching audio in a teaching video to obtain a language form processing result of the teaching audio;

respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

and performing cross check on at least two items of the language form processing result, the action type processing result and the mouth type processing result to obtain a teaching video processing result.

According to another aspect of the present disclosure, there is provided an intelligent processing device for teaching video, comprising:

the language form processing module is used for carrying out language form processing on the teaching audio in the teaching video to obtain a language form processing result of the teaching audio;

the action mouth shape processing module is used for respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

and the cross checking module is used for carrying out cross checking on at least two items of the language form processing result, the action type processing result and the mouth type processing result so as to obtain a teaching video processing result.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for intelligent processing of instructional video provided by any of the embodiments of the disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method for intelligent processing of instructional video provided by any of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the intelligent processing method of instructional videos provided by any of the embodiments of the present disclosure.

According to the technology disclosed by the invention, the processing efficiency of the teaching video can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a method for intelligent processing of instructional videos, according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another intelligent processing method of instructional video, according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of yet another intelligent processing method for instructional videos, according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an intelligent processing device for instructional videos, according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing the intelligent processing method of instructional video according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The scheme provided by the embodiment of the disclosure is described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an intelligent processing method for teaching videos according to an embodiment of the present disclosure, which is applicable to a case of processing audio and images in teaching videos. The method can be executed by an intelligent processing device for teaching videos, which can be realized by hardware and/or software and can be configured in electronic equipment. Referring to fig. 1, the method specifically includes the following steps:

s110, performing language form processing on teaching audio in a teaching video to obtain a language form processing result of the teaching audio;

s120, respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

s130, performing cross check on at least two items of the language form processing result, the action type processing result and the mouth type processing result to obtain a teaching video processing result.

The teaching video can be a video which is pre-recorded by a teacher and is used for students to learn online. The teaching video can comprise teaching audio and teaching images, and the teaching audio and the teaching images can be associated through timestamps, namely the teaching audio and the teaching images associated with the same moment are associated with each other.

The language form can include a spoken language form and a written language form, wherein the spoken language form refers to a spoken stop word or a spoken Buddhist, and the like. The action types may include valid actions and invalid actions; the effective action is an indispensable action in the teaching process, and the ineffective action is an indispensable action. The die types may include valid and invalid die; the effective mouth shape is the mouth shape that is essential in the teaching process, and the invalid mouth shape is the mouth shape that is not essential in the teaching process.

Specifically, the teaching audio in the teaching video can be obtained, and the language form processing is performed on the teaching audio to obtain word sets belonging to different language forms; the method comprises the steps of obtaining a teaching image in a teaching video, respectively identifying the action and the mouth shape in the teaching image, determining the type of the action and the mouth shape, and obtaining action sets belonging to different types and mouth shape sets belonging to different types.

The cross-checking is to select at least one item from the language form processing result, the action type processing result and the mouth type processing result as a checking standard, select at least one item as a checking object, and check the checking object by using the calibration standard. That is, in the case of the type inconsistency between the verification standard and the verification object, the type of the verification standard is taken as the type of the verification object, that is, the type of the verification standard is taken as the standard. For example, the action type processing result and/or the mouth shape processing result may be checked and adjusted with the language form processing result as a standard; or the action type processing result and/or the mouth shape processing result can be used as a standard to check and adjust the language form processing result; two items of the language form processing result, the action type processing result and the mouth shape processing result can be adopted to check and adjust the other item. Through cross checking among the language form, the action type and the mouth shape type, the accuracy of a checking result can be improved, namely the accuracy of the language form, the action type and the mouth shape type can be improved; in addition, through the automatic teaching video processing, the teaching video is not required to be processed manually, and the processing efficiency of the teaching video can be improved.

According to the technical scheme, cross verification is automatically performed among the language form, the action type and the mouth shape type of the teaching video, manual work is not needed, the processing efficiency of the teaching video can be improved, and the quality of a teaching video processing result can be improved.

Fig. 2 is a schematic flow chart diagram of another method for intelligently processing teaching videos according to an embodiment of the present disclosure. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the intelligent processing method for teaching video provided by this embodiment includes:

s210, performing language form processing on teaching audio in a teaching video to obtain a language form processing result of the teaching audio;

s220, respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

s230, verifying the action type processing result and the mouth shape type processing result according to the language form processing result based on the timestamp incidence relation to obtain a new action type processing result and a new mouth shape type processing result;

and S240, aligning the language form processing result, the new action type processing result and the new mouth shape type processing result based on the timestamp incidence relation to obtain a teaching video processing result.

In the embodiment of the present disclosure, the language form processing result may be used as a verification standard to verify the action type processing result and the mouth shape type processing result, and the type adjustment is performed on the action type processing result and the mouth shape type processing result that are different from the type of the associated language form processing result, so as to obtain a new action type processing result and a new mouth shape type processing result. And based on the time stamp association relationship, the language form processing result, the new action type processing result and the new mouth type processing result are aligned, so that the language form processing result, the new action type processing result and the new mouth type processing result which are associated with the same time stamp have the same type, namely, the type alignment is realized. The accuracy of the language form, the action and the mouth shape type can be further improved through cross check processing, the language form, the action type and the mouth shape type which are associated with the same timestamp can be consistent, the teaching video processing result can be further processed conveniently, and the quality of the teaching video is further improved.

In an alternative embodiment, the linguistic form processing result includes a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and a valid action set; the mouth shape processing result comprises an invalid mouth shape set and an effective mouth shape set.

Wherein one word in the spoken language set and the written language set can be associated with at least one teaching video time stamp, one action in the invalid action set and the valid action set can be associated with at least one teaching video time stamp, and one mouth in the invalid mouth shape set and the valid mouth shape set can be associated with at least one teaching video time stamp; the words, the actions and the mouth shapes which are associated with the same teaching video time stamp are associated with each other.

In an optional implementation manner, the verifying the action type processing result and the mouth type processing result according to the language form processing result based on the timestamp association relationship includes: acquiring an action associated with a spoken language in the set of spoken languages based on a timestamp association, and adjusting the action associated with the spoken language into the invalid action set if the action associated with the spoken language belongs to the valid action set; acquiring a mouth shape associated with a spoken language in the spoken language set based on the timestamp association relationship, and adjusting the mouth shape associated with the spoken language into the invalid mouth shape set if the mouth shape associated with the spoken language belongs to the valid mouth shape set.

Specifically, the spoken language in the spoken language set can be traversed, the action associated with the spoken language is acquired based on the timestamp association relationship, and the associated action is adjusted to an invalid action set when the associated action belongs to the valid action set, that is, the associated action is adjusted to an invalid action when the associated action is a valid action; it is also possible to acquire a mouth shape associated with the spoken utterance and, in the case where the associated mouth shape is a valid mouth shape, adjust the associated mouth shape to an invalid mouth shape. By adjusting the action associated with the spoken language to an invalid action and adjusting the mouth shape associated with the spoken language to an invalid mouth shape, the accuracy of the type of the action and the mouth shape can be improved. The method is particularly suitable for the condition that the accuracy of the language form processing result is higher than that of the action type processing result and the mouth type processing result.

In an optional implementation, the aligning the linguistic form processing result, the new action type processing result, and the new mouth type processing result based on a timestamp association includes: aiming at target words in the oral language set and the written language set, acquiring target actions associated with the target words and target mouth shapes associated with the target words based on a timestamp association relation; determining that the target word also belongs to the same type under the condition that the target action and the target mouth shape belong to the same type set; and under the condition that the target action and the target mouth shape belong to different types of sets, acquiring a labeling type, and taking the labeling type as the types of the target word, the target action and the target mouth shape.

The target words can belong to a set of oral languages and can also belong to a set of written languages, namely the target words can be the oral languages and can also be the written languages. And acquiring a target action and a target mouth shape associated with the target word based on the timestamp association relationship by traversing each word (namely the target word) in the language form processing result.

Under the condition that the types of the target action and the target mouth shape are both valid, determining that the target words belong to written languages (namely the target words are also valid); in the event that both the target action and the type of target mouth shape are invalid, it may be determined that the target word belongs to the spoken language (i.e., the target word is also invalid). And, in the case that the target action is different from the type of the target mouth shape, that is, one of the types is valid and the other one is invalid, the annotation type determined based on the quantization standard may be obtained, and the annotation type may be taken as the type of the target word, the target action, and the target mouth shape. Wherein the spoken language form, the invalid action, and the invalid mouth shape are of the same type; the type between written language form, valid action and valid mouth shape is the same. Through the type alignment among the language form, the action and the mouth shape, the accuracy of the language form, the action and the mouth shape can be further improved.

It should be noted that, the language form and the action type may also be adopted to perform alignment check on the mouth shape type, that is, traverse the mouth shapes in the valid mouth shape set and the invalid mouth shape set, and obtain the words associated with the mouth shape and the actions associated with the mouth shape; determining that the mouth shape also belongs to the same type in the case that the associated word and the associated action belong to the same type set; and in the case that the associated words and the associated actions belong to different types of sets, taking the label type as the types of the associated words, the associated actions and the mouth shape. The action type can also be aligned and checked by adopting a language form and a mouth shape type, namely, actions in an effective action set and an invalid action set are traversed, and words related to the actions and the mouth shape related to the actions are obtained; determining that the action also belongs to the same type in the case that the associated word and the associated mouth shape belong to the same type set; and in the case that the associated word and the associated mouth shape belong to different types of sets, taking the annotation type as the types of the associated word, the action and the associated mouth shape.

In addition, in the teaching video processing result, if the word of any timestamp belongs to the spoken language, the action of the timestamp belongs to an invalid action, the mouth shape is in an invalid mouth shape, and the teaching audio and the teaching image of the timestamp can be directly deleted, so that invalid information in the teaching video is removed, the quality of the teaching video is improved, the duration of the teaching video is shortened, and the online learning efficiency is improved. It should be noted that the key knowledge in the teaching video can be marked with key points, so that the learning efficiency of the key knowledge is improved.

According to the technical scheme of the embodiment of the disclosure, the action type and the mouth shape type are verified by using the spoken language, and the type alignment is performed on the language form, the action and the mouth shape, so that the quality of a teaching video processing result can be further improved; in addition, invalid information in the processing result of the teaching video can be removed, the duration of the teaching video is shortened, and the online learning efficiency is improved.

Fig. 3 is a schematic flow chart diagram of another method for intelligently processing teaching videos according to an embodiment of the present disclosure. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the intelligent processing method for teaching video provided by this embodiment includes:

s310, extracting words with a language form of a spoken language from teaching audio of the teaching video based on the spoken language dictionary, and replacing the words with written language;

s320, identifying overlapped words in the teaching audio, and performing overlap removal processing on the overlapped words outside the white list of the overlapped words to obtain a language form processing result;

s330, respectively carrying out action type and mouth shape type processing on the teaching image in the teaching video to obtain an action type processing result and a mouth shape type processing result of the teaching image;

s340, performing cross check on at least two items of the language form processing result, the action type processing result and the mouth type processing result to obtain a teaching video processing result.

Wherein the spoken language dictionary includes spoken words and written languages associated with the spoken words. Specifically, sentence breaking can be performed on the whole teaching audio through semantic analysis, a sentence breaking result is converted into a text sentence, the text sentence is compared with spoken words in a spoken word dictionary, spoken words in the text sentence are recognized, and the spoken words are replaced by written words.

Where overlapping words refer to words in which there are overlapping characters, i.e., there are characters that occur twice or more in succession, for example, "this" is an overlapping word. The overlapped word white list comprises overlapped words meeting the grammar specification, and the overlapped word white list can be constructed by commonly used overlapped words in a statistical dictionary, for example, although overlapped words exist in the warm sunlight, the overlapped words can be added into the overlapped word white list according to the grammar specification.

Specifically, the overlapped words in the teaching audio can be obtained by extracting the overlapped words from the teaching audio based on a character string matching technology; and matching the overlapped words in the teaching audio with the overlapped word white list, reserving the overlapped words belonging to the overlapped word white list in the teaching audio, and performing de-overlapping processing on the overlapped words not belonging to the overlapped word white list. The overlap removal process refers to removing the overlapped characters in the overlapped words, for example, "this" is adjusted to "this" by the overlap removal process. By adopting the oral word dictionary and the overlapped word white list, the language form processing is carried out on the teaching audio, and the accuracy of the language form processing result can be improved.

In an alternative embodiment, the spoken word dictionary is determined from historical instructional audio of a user to whom the instructional video pertains. Specifically, the historical teaching audio of the user to which the teaching video belongs can be determined through manual labeling statistics, and the personalized spoken word dictionary of the user is obtained, so that the accuracy of language form processing is further improved.

In an optional implementation manner, the performing motion type and mouth shape type processing on the teaching image in the teaching video respectively to obtain a motion type processing result and a mouth shape type processing result of the teaching image includes: respectively identifying the motion and the mouth shape of the teaching image in the teaching video to obtain the motion and the mouth shape in the teaching image; clustering the actions and the mouth shapes in the teaching images to obtain at least two actions and at least two mouth shapes; dividing at least one action into an effective action set, and dividing at least one action into an ineffective action set; and, dividing at least one die into a set of valid dies and at least one die into a set of invalid dies.

Specifically, part of the at least two actions may be randomly divided into an effective action set, and the remaining actions may be divided into an ineffective action set, and part of the at least two dies may be randomly divided into an effective die set, and the remaining dies may be divided into an ineffective die set, so as to improve the determination efficiency of the action type processing result and the die type processing result.

Because the number of the actions and the mouth shapes in the teaching image is large, the action and the mouth shapes are clustered firstly, and the number of the actions and the mouth shapes is reduced for the clustering result based on the actions and the mouth shapes, so that the efficiency of cross checking the language form, the action type and the mouth shape type is improved. Moreover, the accuracy of the language form processing result, the action type processing result and the mouth type processing result can be improved by performing cross check on at least two items of the language form processing result, the action type processing result and the mouth type processing result.

It should be noted that, in the embodiment of the present disclosure, a pre-recorded teaching video may be processed, and an online processing may also be performed in the recording process of the teaching video, for example, the teaching video may be processed through a plug-in built in the recording device, or the teaching video may be processed through a video capture device integrated with a loudspeaker and a microphone used in the teaching process.

According to the technical scheme of the embodiment of the disclosure, the accuracy of the language form processing result, the action type processing result and the mouth shape type processing result can be improved; the oral word dictionary and the overlapped word white list are adopted to process the language form of the teaching audio, so that the accuracy of the language form processing result can be improved; moreover, the determination efficiency of the action type processing result and the mouth type processing result can be improved.

Fig. 4 is a schematic diagram of an intelligent processing apparatus for teaching video according to an embodiment of the present disclosure, where this embodiment is applicable to a case where a language type, an action type, a mouth shape type, and the like are processed on a teaching video, and the apparatus is configured in an electronic device, and can implement an intelligent processing method for teaching video according to any embodiment of the present disclosure. The intelligent processing device 400 for teaching video specifically includes the following:

the language form processing module 401 is configured to perform language form processing on a teaching audio in a teaching video to obtain a language form processing result of the teaching audio;

an action type processing module 402, configured to perform action type and mouth type processing on the teaching image in the teaching video respectively to obtain an action type processing result and a mouth type processing result of the teaching image;

and a cross checking module 403, configured to perform cross checking on at least two of the language form processing result, the action type processing result, and the mouth type processing result to obtain a teaching video processing result.

In an alternative embodiment, the cross-check module 403 includes:

the verification unit is used for verifying the action type processing result and the mouth shape type processing result according to the language form processing result based on the timestamp incidence relation so as to obtain a new action type processing result and a new mouth shape type processing result;

and the alignment unit is used for aligning the language form processing result, the new action type processing result and the new mouth shape type processing result based on the timestamp incidence relation.

In an optional implementation manner, the verification unit is specifically configured to:

acquiring an action associated with a spoken language in the set of spoken languages based on a timestamp association, and adjusting the action associated with the spoken language into the invalid action set if the action associated with the spoken language belongs to the valid action set;

acquiring a mouth shape associated with a spoken language in the spoken language set based on the timestamp association relationship, and adjusting the mouth shape associated with the spoken language into the invalid mouth shape set if the mouth shape associated with the spoken language belongs to the valid mouth shape set.

In an alternative embodiment, the alignment unit is specifically configured to:

aiming at target words in the oral language set and the written language set, acquiring target actions associated with the target words and target mouth shapes associated with the target words based on a timestamp association relation;

determining that the target word also belongs to the same type under the condition that the target action and the target mouth shape belong to the same type set;

and under the condition that the target action and the target mouth shape belong to different types of sets, acquiring a labeling type, and taking the labeling type as the types of the target word, the target action and the target mouth shape.

In an alternative embodiment, the language form processing module 401 includes:

a spoken language processing unit for extracting a word in a language form of a spoken language from a teaching audio of the teaching video based on a spoken language dictionary and replacing the word with a written language;

and the overlapped word processing unit is used for identifying overlapped words in the teaching audio and carrying out overlap removal processing on the overlapped words outside the white list of the overlapped words.

In an alternative embodiment, the spoken word dictionary is determined from historical instructional audio of a user to whom the instructional video pertains.

In an alternative embodiment, the action shape processing module 402 comprises:

the action mouth shape recognition unit is used for respectively recognizing the action and the mouth shape of the teaching image in the teaching video to obtain the action and the mouth shape in the teaching image;

the action mouth shape clustering unit is used for clustering the actions and the mouth shapes in the teaching images to obtain at least two actions and at least two mouth shapes;

the action mouth shape dividing unit is used for dividing at least one action into an effective action set and dividing at least one action into an ineffective action set; and, dividing at least one die into a set of valid dies and at least one die into a set of invalid dies.

According to the technical scheme of the embodiment, the language form, the action type and the mouth shape type of the teaching video are automatically cross-checked without depending on manpower, so that the processing efficiency of the teaching video can be improved, and the quality of a teaching video processing result can be improved; moreover, the action type and the mouth shape type are verified by adopting the spoken language, and the type alignment is carried out on the language form, the action and the mouth shape, so that the quality of a teaching video processing result can be further improved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units that perform machine learning model algorithms, a digital information processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as an intelligent processing method of teaching video. For example, in some embodiments, the intelligent processing method of instructional video may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the intelligent processing method of teaching video described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured in any other suitable way (e.g., by means of firmware) to perform the intelligent processing method of the instructional video.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs executing on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the information desired by the technical solution disclosed in the present disclosure can be realized.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An intelligent processing method of teaching videos comprises the following steps:

2. The method of claim 1, wherein said cross-checking at least two of said linguistic form processing result, said action type processing result, and said mouth type processing result comprises:

based on the timestamp incidence relation, verifying the action type processing result and the mouth shape type processing result according to the language form processing result to obtain a new action type processing result and a new mouth shape type processing result;

and aligning the language form processing result, the new action type processing result and the new mouth type processing result based on the timestamp incidence relation.

3. The method of claim 2, wherein the linguistic form processing result includes a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and a valid action set; the mouth shape processing result comprises an invalid mouth shape set and an effective mouth shape set.

4. The method of claim 3, wherein the verifying the action type processing result and the mouth type processing result according to the language form processing result based on the timestamp association comprises:

5. The method of claim 3, wherein the aligning the linguistic form processing result, the new action type processing result, and the new mouth type processing result based on a timestamp association comprises:

6. The method of claim 1, wherein the performing the language form processing on the teaching audio in the teaching video to obtain the language form processing result of the teaching audio comprises:

extracting a word in a language form of a spoken language from teaching audio of the teaching video based on a spoken language dictionary, and replacing the word with a written language;

and identifying overlapped words in the teaching audio, and performing de-overlapping processing on the overlapped words outside the white list of the overlapped words.

7. The method of claim 6, wherein the spoken word dictionary is determined from a historical instructional audio of a user to whom the instructional video pertains.

8. The method of claim 1, wherein the performing action type and mouth type processing on the teaching image in the teaching video respectively to obtain an action type processing result and a mouth type processing result of the teaching image comprises:

respectively identifying the motion and the mouth shape of the teaching image in the teaching video to obtain the motion and the mouth shape in the teaching image;

clustering the actions and the mouth shapes in the teaching images to obtain at least two actions and at least two mouth shapes;

dividing at least one action into an effective action set, and dividing at least one action into an ineffective action set; and, dividing at least one die into a set of valid dies and at least one die into a set of invalid dies.

9. An intelligent processing device for teaching video, comprising:

10. The apparatus of claim 9, wherein the cross-check module comprises:

11. The apparatus of claim 10, wherein the linguistic form processing result includes a set of spoken words and a set of written words; the action type processing result comprises an invalid action set and a valid action set; the mouth shape processing result comprises an invalid mouth shape set and an effective mouth shape set.

12. The apparatus according to claim 11, wherein the verification unit is specifically configured to:

13. The apparatus according to claim 11, wherein the alignment unit is specifically configured to:

14. The apparatus of claim 9, wherein the linguistic form processing module comprises:

15. The apparatus of claim 14, wherein the spoken word dictionary is determined from a historical instructional audio of a user to whom the instructional video pertains.

16. The apparatus of claim 9, wherein the action-style processing module comprises:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.