CN111107442A

CN111107442A - Method and device for acquiring audio and video files, server and storage medium

Info

Publication number: CN111107442A
Application number: CN201911167053.1A
Authority: CN
Inventors: 颜洪奎
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2020-05-05
Anticipated expiration: 2039-11-25
Also published as: CN111107442B

Abstract

The application relates to the technical field of internet, in particular to a method and a device for acquiring an audio and video file, a server and a storage medium. The method for acquiring the audio and video file comprises the following steps: acquiring a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data; extracting at least one target audio clip based on the volume of the teaching audio data; determining at least one target video segment corresponding to the at least one target audio segment according to the time point; and synthesizing the at least one target audio clip and the at least one target video clip to obtain the wonderful audio and video file of the online classroom. According to the technical scheme of the embodiment of the application, the accuracy of obtaining the audio and video files can be improved, and the wonderful audio and video files in an online classroom can be obtained.

Description

Method and device for acquiring audio and video files, server and storage medium

Technical Field

The application relates to the technical field of internet, in particular to a method and a device for acquiring an audio and video file, a server and a storage medium.

Background

With the continuous development of the information society, more and more people choose to learn various knowledge to continuously expand the information society, and the online network education is accepted by vast users. In the online network education process, a user can record and identify courses of students and teachers, and wonderful videos of the teachers or students are obtained. The acquisition of the wonderful video facilitates the user to monitor and analyze the online courses in time. For example, a user may view a recorded full online lesson video and obtain a highlight video of a teacher or a student.

The statements in this application as to the background of the invention, as they pertain to the present application, are merely provided to illustrate and facilitate an understanding of the present disclosure and are not to be construed as an admission that the applicant expressly believes or infers that the applicant is admitted as prior art to the date of filing of the present application for the first time.

Disclosure of Invention

The embodiment of the application provides an audio and video file acquisition method, an audio and video file acquisition device, a server and a storage medium, and can improve the accuracy of audio and video file acquisition.

In a first aspect, an embodiment of the present application provides an audio and video file acquisition method, including:

acquiring a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data;

extracting at least one target audio clip based on the volume of the teaching audio data;

determining at least one target video segment corresponding to the at least one target audio segment according to the time point;

and synthesizing the at least one target audio clip and the at least one target video clip to obtain the wonderful audio and video file of the online classroom.

According to some embodiments, the determining at least one target video segment corresponding to the at least one target audio segment according to a point in time comprises:

identifying student audio data and teacher audio data in the at least one target audio segment;

identifying at least one first keyword of the student audio data and at least one second keyword of the teacher audio data;

and when the at least one first keyword and the at least one second keyword are matched, determining at least one target video segment corresponding to the at least one target audio segment according to the time point.

obtaining courseware data of the online classroom;

identifying student audio data in the at least one target audio segment;

identifying at least one first keyword of the student audio data;

when the at least one first keyword and the at least one third keyword in the courseware data are matched, determining at least one target video clip corresponding to the at least one target audio clip according to the time point.

acquiring at least one student sentence and at least one teacher sentence in the at least one target audio segment;

determining at least one target video clip corresponding to the at least one target audio clip according to a point in time when the semantics of the at least one student sentence and the semantics of the at least one teacher sentence match.

According to some embodiments, said extracting at least one target audio segment based on the volume of the instructional audio data comprises:

intercepting a plurality of audio clips to be identified in the teaching audio data;

and when the volume of the plurality of audio segments to be identified is greater than the preset volume, extracting the at least one target audio segment.

According to some embodiments, intercepting a plurality of audio pieces to be identified in the instructional audio data comprises:

periodically intercepting the plurality of segments to be identified based on the duration of the teaching audio data;

or randomly intercepting the plurality of segments to be identified based on the duration of the teaching audio data.

According to some embodiments, before extracting at least one target audio segment based on the audio volume of the instructional audio data, further comprising:

acquiring the format of the teaching video;

and when the format of the teaching video is not the preset format, converting the format of the teaching video into the preset format.

In a second aspect, an embodiment of the present application provides an apparatus for acquiring an audio/video file, including:

the video acquisition unit is used for acquiring a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data;

a segment extraction unit for extracting at least one target audio segment based on the volume of the teaching audio data;

a section determining unit, configured to determine at least one target video section corresponding to the at least one target audio section according to a time point;

and the segment synthesis unit is used for synthesizing the at least one target audio segment and the at least one target video segment to obtain the wonderful audio and video file of the online classroom.

In a third aspect, an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method described in any one of the above when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any one of the above.

In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.

The embodiment of the application provides an audio and video file acquisition method, which comprises the following steps: acquiring a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data, at least one target audio clip is extracted based on the volume of the teaching audio data, a target video clip corresponding to the at least one target audio clip is determined according to a time point, and the at least one target audio clip and the at least one target video clip are synthesized to obtain a wonderful audio and video file of an online classroom. According to the technical scheme of the embodiment of the application, the extracted at least one target audio clip and at least one target video clip are synthesized based on the volume of the teaching audio data, so that a wonderful audio and video file of an online classroom can be obtained, and the accuracy of obtaining the audio and video file can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture to which an apparatus for acquiring an audio-video file according to an embodiment of the present application may be applied;

fig. 2 shows a schematic flow chart of an audio and video file acquisition method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an audio/video file acquisition method according to another embodiment of the present application;

fig. 4 is a schematic flowchart illustrating an audio/video file acquisition method according to another embodiment of the present application;

fig. 5 shows a schematic structural diagram of an apparatus for acquiring an audio/video file according to an embodiment of the present application;

fig. 6 shows a schematic structural diagram of a server provided in an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

With the development of the internet, online education is also popular with more and more people. In online education, many parents can select different online classes for own children, so that the children can independently learn online, and own skills are fully improved. Compared with the traditional fixed classroom, the audio and video performance of online education is more mobile, convenient and fast, and has more visualization and attraction in pictures and audio. On-line education can also compensate the problem of traditional education place limitation, no matter where you are, all can enjoy the same on-line education, therefore on-line education makes education fairer, the scope of application wider.

Specifically, online network education is that a teacher end where a teacher is located communicates with a student end where a student is located through a network, so that remote teaching of the teacher and the student is achieved.

According to some embodiments, in the 1-to-1 mode or 1-to-multi-mode online teaching, the server can record the teaching videos of teachers and students in an online classroom to obtain the teaching videos. The server can intercept the wonderful audio and video files of teachers or students by playing back the recorded teaching videos. The user can master the class state of the student according to the wonderful audio and video file acquired by the server.

It is easy to understand that the server can obtain the wonderful audio and video files of teachers or students by playing back the teaching videos of the whole online classroom. The server can also randomly intercept wonderful audio and video files of teachers or students in the teaching videos of the online classes with fixed time length. And the way that the server randomly intercepts the wonderful audio and video files of teachers or students is inaccurate. For example, the server may randomly intercept the teaching videos of the sixteenth minute and the nineteenth minute between the tenth minute and the twentieth minute of the teaching videos, and use the teaching videos as the highlight audio and video files of the class. But the sixteenth and nineteenth minute teaching videos acquired by the server may be the only videos the teacher is testing the student's word grasping ability. Therefore, the existing method for acquiring the audio and video files by the server is inaccurate, and the experience effect of the user is poor. The embodiment of the application provides an audio and video file acquisition method, which is characterized in that when at least one target audio segment is extracted based on the volume of teaching audio data, the at least one target audio segment and the at least one target video segment are synthesized to obtain a wonderful audio and video file in an online classroom, and the accuracy of audio and video file acquisition can be improved.

Optionally, the technical scheme of the embodiment of the application can be used for a method for acquiring one-to-many classroom audio and video files and can also be used for a method for acquiring one-to-one classroom audio and video files.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture to which an apparatus for acquiring an audio-video file according to an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of terminals 101, 102, 103, a network 104, and a plurality of servers 105. The network 104 is used to provide communication links between the terminals 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminals 101, networks 104 and servers 105 in fig. 1 is merely illustrative. There may be any number of terminals 101, networks 104, and servers 105, as desired for the reality. For example, server 105 may be a server cluster comprised of multiple servers, or the like. The terminals 101, 102, 103 interact with a server 105 over a network 104 to receive or send messages or the like. The terminals 101, 102, 103 may be various electronic devices having display screens including, but not limited to, personal computers, tablet computers, handheld devices, in-vehicle devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and the like. Terminals can be called different names in different networks, for example: user equipment, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent or user equipment, cellular telephone, cordless telephone, Personal Digital Assistant (PDA), terminal equipment in a 5G network or future evolution network, and the like.

The method for acquiring the audio/video file provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the apparatus for acquiring the audio/video file is generally disposed in the server 105, but the present application is not limited thereto.

Fig. 2 shows a schematic flow diagram of an audio and video file acquisition method according to an embodiment of the present application.

As shown in fig. 2, the method for acquiring an audio/video file includes:

s201, obtaining teaching videos of an online classroom.

According to some embodiments, English is the predominant international common language in the world today and is the most widely used language in the world. Therefore, in online education, English education plays an increasingly important role. Many people can select an English online classroom so that English courses can be conveniently learned at any time. In an online course, a teacher interacts with a student according to the classroom content, and the student expresses his or her mind through audio or video.

It is easily understood that the server may obtain a teaching video of the classroom during the on-line classroom, wherein the teaching video may include teaching audio data and teaching video data. For example, in a classroom for teaching mathematics in english, a teacher will ask a student in english "My favorite front is pineapple. how about you? ". And when the server detects that the teacher sends the audio data, the server stores the audio data and video data when the teacher asks questions. The student hears the teacher's question followed by the answer "Strawberry". When the server detects that the student sends the audio data, the audio data and the video data when the student answers the questions are stored.

Optionally, the server may further obtain a teaching video of the classroom from a memory of the terminal where the teacher is located and/or the terminal where the student is located. In an online education classroom, a teacher or a student can choose to store a teaching video of the classroom so that the teacher can repeatedly watch to find out deficiencies in the classroom for later correction and the student can repeatedly watch to consolidate the classroom. When the server receives an acquisition instruction of the audio/video file, the server can acquire the corresponding teaching video from a memory of a terminal where a teacher is located and/or a memory of a terminal where a student is located.

It is easy to understand that, after the classroom teaching of the online education is completed, the server can store the teaching video generated in the classroom of the online education to the memory of the server. When the server receives an acquisition instruction of the audio/video file, the server can acquire the corresponding teaching video from the memory.

S202, extracting at least one target audio clip based on the volume of the teaching audio data.

According to some embodiments, when the server acquires teaching audio data of an online classroom, whether the volume of the teaching audio data is larger than a preset volume can be detected. When the server detects that the volume of the teaching audio data is larger than the preset volume, a target audio clip can be extracted. Wherein the number of the target audio segments is at least one. For example, the preset volume set by the server may be 45 db. The server can obtain the teaching audio data of teacher A and student A, and when the server obtained teacher A's audio and video file, the server detected that the volume of teacher A's teaching audio data is greater than 45 decibels, can draw three target audio frequency fragments in teacher A's teaching audio data. When the server detects that the volume of the teaching audio data of the teacher A is smaller than 45 decibels, the teaching audio data of the teacher A can be marked, so that the user can directly see the volume of the teaching audio data of the teacher A.

It is easy to understand that before the server detects whether the volume of the teaching audio data is greater than the preset volume, the server may intercept a plurality of audio clips to be recognized in the teaching audio data. For example, the server may periodically intercept a plurality of audio clips to be identified based on the duration of the instructional audio data. When the duration of the teaching audio data acquired by the server is 40 minutes, an audio clip to be recognized with a duration of 1 minute can be intercepted every 5 minutes.

Optionally, when the duration of the teaching audio data acquired by the server is 40 minutes, the server may further intercept one to-be-identified segment every 5 minutes, where the durations of the to-be-identified audio segments acquired by the server are different. For example, the server may intercept 8 audio clips to be identified, the time duration of the 8 audio clips to be identified being 1 minute, 20 seconds, 25 seconds, 36 seconds, 15 seconds, 24 seconds, 45 seconds and 56 seconds, respectively.

S203, determining at least one target video clip corresponding to at least one target audio clip according to the time point.

According to some embodiments, when the server acquires the teaching video of the online classroom, the time point of the teaching video can be recorded. When the server extracts the at least one target audio clip, the at least one target video clip corresponding to the at least one target audio clip may be determined according to a time point of the target audio clip. For example, the server may extract 4 target audio clips, which may be a Q target audio clip, a W target audio clip, an E target audio clip, and an R target audio clip. The server can obtain the starting time point and the audio duration of the Q target audio clip, and determine the Q target video clip corresponding to the Q target audio clip according to the starting time point and the audio duration of the Q target audio clip. For example, the starting time point and the audio duration of the server acquiring the target audio clip with Q are 21:00 and 15 seconds, respectively. The server can determine a target video clip corresponding to the Q target audio clip in the teaching video data according to the starting time point and the audio duration of the Q target audio clip. The server can extract the start time point as 21:00 and Q target video segments with a video duration of 15 seconds. According to the starting time point and the audio duration of the target audio clip, the server can determine a W target video clip corresponding to the W target audio clip, an E target video clip corresponding to the E target audio clip, and an R target video clip corresponding to the R target audio clip.

It is readily understood that, for example, the server may extract 4 target audio pieces, and the 4 target audio pieces may be a Q target audio piece, a W target audio piece, an E target audio piece, and an R target audio piece. The server can obtain the termination time point and the audio duration of the Q target audio clip, and determine the Q target video clip corresponding to the Q target audio clip according to the termination time point and the audio duration of the Q target audio clip. For example, the end time point and the audio duration of the server obtaining the audio clip with Q target are 21:25 and 15 seconds, respectively. The server can determine a target video clip corresponding to the Q target audio clip in the teaching video data according to the termination time point and the audio duration of the Q target audio clip. The server can extract the termination time point as 21:25 and q target video segments with a video duration of 15 seconds.

And S204, synthesizing the at least one target audio clip and the at least one target video clip to obtain a wonderful audio/video file of the online classroom.

According to some embodiments, when the server extracts a target audio clip and a target video clip corresponding to the target audio clip, the server may synthesize at least one target audio clip and at least one target video clip to obtain an audio/video clip. The server can synthesize the plurality of audio and video clips to obtain wonderful audio and video files in an online classroom.

It will be readily appreciated that the server may, for example, extract 4 target audio clips and 4 target video clips, the 4 target audio clips may be a Q target audio clip, a W target audio clip, an E target audio clip and an R target audio clip, and the 4 target video clips may be a Q target audio clip, a W target audio clip, an E target audio clip and an R target audio clip. The server can synthesize the Q target audio clip and the Q target audio clip to obtain a Q1 audio/video file. The server can synthesize the W target audio clip and the W target audio clip to obtain a W1 audio/video file. The server can synthesize the E target audio clip and the E target audio clip to obtain an E1 audio/video file. The server can synthesize the R target audio clip and the R target audio clip to obtain an R1 audio and video file. The server can synthesize the Q1 audio and video file, the W1 audio and video file, the E1 audio and video file and the R1 audio and video file to obtain the wonderful audio and video file of the online classroom.

Optionally, after obtaining the highlight audio/video file, the server may send the highlight audio/video file to a terminal where the teacher is located or a terminal where the student is located. When receiving the wonderful audio and video file, the students at the terminals can review the state of the students in class and encourage the learning enthusiasm of the students. When the teacher at the terminal where the teacher is located receives the wonderful audio and video file, the teacher can watch the audio and video file, and the teacher is encouraged to have good teaching and can find out the defects in the teaching of the teacher so as to be corrected in the subsequent online classroom.

The embodiment of the application provides an audio and video file acquisition method, which comprises the following steps: acquiring a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data, at least one target audio segment is extracted based on the volume of the teaching audio data, a target video segment corresponding to the at least one target audio segment is determined according to a time point, and the at least one target audio segment and the at least one target video segment are synthesized to obtain a wonderful audio and video file of an online classroom. According to the technical scheme of the embodiment of the application, based on the volume of the teaching audio data, the extracted at least one target audio clip and at least one target video clip can be synthesized, a wonderful audio and video file of an online classroom can be obtained, and the accuracy of obtaining the audio and video file can be improved.

Fig. 3 shows a schematic flow chart of an audio/video file acquisition method according to another embodiment of the present application.

As shown in fig. 3, the method for acquiring an audio/video file includes:

s301, obtaining teaching videos of the online classroom.

The specific process is as described above, and is not described herein again.

S302, extracting at least one target audio clip based on the volume of the teaching audio data.

The specific process is as described above, and is not described herein again.

According to some embodiments, before the server detects whether the volume of the teaching audio data is larger than the preset volume, the server may intercept a plurality of audio clips to be identified in the teaching audio data. For example, the server may randomly intercept a plurality of audio clips to be identified based on the duration of the instructional audio data. The server can intercept a plurality of audio clips to be identified with the same duration, and can also intercept a plurality of audio clips to be identified with different durations. When the duration of the teaching audio data acquired by the server may be 40 minutes, for example, the server may randomly intercept 8 audio clips to be recognized, where the durations of the 8 audio clips to be recognized are 1 minute, 20 seconds, 25 seconds, 36 seconds, 15 seconds, 24 seconds, 45 seconds, and 56 seconds, respectively.

S303, obtaining courseware data of the online classroom.

According to some embodiments, before the online classroom, the teacher sets the corresponding courseware for the classroom knowledge point of the online classroom. The teacher may choose to store in the memory of the terminal where the teacher is located. The teacher can also send the courseware to the server through the terminal where the teacher is located, and the server stores the courseware in the server when receiving the courseware.

It is easy to understand that, when the server detects that the volume of the teaching audio data is greater than the preset volume, the server may obtain the courseware data of the online classroom from the terminal where the teacher is located, for example.

S304, identifying the student audio data in the at least one target audio segment.

According to some embodiments, when the server extracts at least one target audio segment, a recognition algorithm may be employed to identify student audio data in the target audio segment. For example, when the server extracts the T-target audio segment, the student audio data in the T-target audio segment can be identified by using a voiceprint recognition algorithm.

S305, at least one first keyword of the student audio data is identified.

According to some embodiments, the server, upon identifying the student audio data in the at least one target audio clip, may employ a keyword recognition algorithm to identify at least one first keyword of the student audio data. For example, the server may identify student audio data in the T-target audio clip using a voiceprint recognition algorithm. The server identifies the first keyword of the student audio data by using a keyword identification algorithm, for example, "like, applet".

S306, when the at least one first keyword is matched with the at least one third keyword in the courseware data, determining at least one target video clip corresponding to the at least one target audio clip according to the time point.

According to some embodiments, when acquiring the courseware data of the online classroom, the server may acquire word frequency data of the courseware data, and extract at least one third key word of a preset number. When the server acquires the courseware data of the online classroom, the server can also acquire the classroom title of the online classroom, and extracts the keyword of the classroom title as a third keyword. When the server detects that the at least one first keyword and the at least one third keyword match, a time point of the target audio segment may be obtained. For example, the 5 third keywords obtained by the server may be "like, fruit, applet, and orange", and the first keyword obtained by the server is "applet and like". When the server detects that the first keyword and the third keyword match, a time point of the T target audio segment may be acquired, and according to the time point, the server may determine at least one target video segment corresponding to the at least one target audio segment.

The specific process is as described above, and is not described herein again.

S307, synthesizing the at least one target audio clip and the at least one target video clip to obtain a wonderful audio and video file of an online classroom.

According to some embodiments, when the server extracts the at least one target audio segment and the at least one target video segment, the server may synthesize the at least one target audio segment to obtain an audio file. The server can synthesize at least one target video segment to obtain a video file. The server can synthesize the audio file and the video file to obtain a wonderful audio and video file of the online classroom.

It will be readily appreciated that the server may, for example, extract 4 target audio clips and 4 target video clips, the 4 target audio clips being Q target audio clips, W target audio clips, E target audio clips and R target audio clips. The 4 target video segments may be q target audio segments, w target audio segments, e target audio segments, and r target audio segments. The server may synthesize the Q target audio clip, the W target audio clip, the E target audio clip, and the R target audio clip to obtain a QWER audio file. The server can synthesize the q target audio clip, the w target audio clip, the e target audio clip and the r target audio clip to obtain a qwer video file. The server can synthesize the QWER audio file and the QWER video file to obtain a wonderful audio/video file of an online classroom.

According to the technical scheme, when at least one first keyword of student audio data is matched with at least one third keyword of courseware data of an online classroom, a target video segment corresponding to at least one target audio segment can be determined according to a time point, the at least one target audio segment and the at least one target video segment are synthesized, a wonderful audio and video file of the online classroom can be obtained, and the accuracy of obtaining the audio and video file can be improved.

Fig. 4 shows a schematic flow chart of an audio-video file acquisition method according to another embodiment of the present application.

As shown in fig. 4, the method for acquiring an audio/video file includes:

s401, obtaining teaching videos of an online classroom.

The specific process is as described above, and is not described herein again.

S402, obtaining the format of the teaching video.

According to some embodiments, when the server acquires the teaching video of the online classroom, the format of the teaching video can be acquired. The formats of the instructional Video include, but are not limited to, Audio Video Interleaved (AVI), FLV, MP4, F4V, ASF. The format in which the server obtains the instructional video may be, for example, MP 4.

And S403, converting the format of the teaching video into a preset format when the format of the teaching video is not the preset format.

According to some embodiments, when the server acquires the format of the teaching video, whether the format of the teaching video is a preset format or not can be detected. For example, when the format of the teaching video acquired by the server is MP4, the time for the server to intercept the video may be affected due to the long tail effect of the teaching video in MP4 format. Therefore, the server detects the format of the teaching video, and the time for video interception can be reduced. The preset format set by the server may be an FLV format, and when the server detects that the format of the instructional video is the MP4 format, the MP4 format of the instructional video may be converted into the FLV format.

S404, extracting at least one target audio clip based on the volume of the teaching audio data.

The specific process is as described above, and is not described herein again.

And S405, identifying the student audio data and the teacher audio data in the at least one target audio segment.

According to some embodiments, when the server extracts at least one target audio segment, a recognition algorithm may be employed to identify student audio data in the target audio segment. For example, when the server extracts the T-target audio segment, the student audio data and the teacher audio data in the T-target audio segment may be identified by using a voiceprint recognition algorithm.

S406, at least one first keyword of the student audio data and at least one second keyword of the teacher audio data are identified.

According to some embodiments, the server, upon identifying the student audio data in the at least one target audio clip, may employ a keyword recognition algorithm to identify at least one first keyword of the student audio data. For example, when the server identifies the student audio data in the T target audio clip by using the voiceprint recognition algorithm, the first keyword of the student audio data may be identified as "like or" applet "by using the keyword recognition algorithm.

It is readily understood that the server, upon identifying the teacher audio data in the at least one target audio segment, may employ a keyword recognition algorithm to identify at least one second keyword of the teacher audio data. For example, when the server identifies teacher audio data in the T-target audio clip by using a voiceprint recognition algorithm, the second keyword of the student audio data may be identified as "like, fruit, applet, and banana" by using a keyword recognition algorithm.

S407, determining at least one target video clip corresponding to the at least one target audio clip according to the time point when the at least one first keyword and the at least one second keyword match.

According to some embodiments, when the server acquires the at least one first keyword and the at least one second keyword, it may be detected that a keyword matching degree of the at least one first keyword and the at least one second keyword exceeds a preset matching degree, and the at least one target video clip corresponding to the at least one target audio clip may be determined according to a time point. The first keyword obtained by the server into the T target audio segment may be, for example, "like, applet" and the second keyword may be, for example, "like, fruit, applet, and banana". The preset keyword matching degree of the server can be 90%, when the server detects that the keyword matching degree of the first keyword and the second keyword is 95%, the time point of the T target audio clip can be obtained, and at least one target video clip corresponding to at least one target audio clip is determined according to the time point.

It is easy to understand that, when the server acquires the at least one student sentence and the at least one teacher sentence in the at least one target audio segment, a matching degree of the semantics of the at least one student sentence and the semantics of the at least one teacher sentence may be detected. When the server detects that the matching degree of the semantics of the at least one student sentence and the semantics of the at least one teacher sentence exceeds the preset semantic matching degree, the at least one target video clip corresponding to the at least one target audio clip can be determined according to the time point. The server can improve the accuracy of audio and video file acquisition by detecting the semantic matching degree of at least one student sentence and at least one teacher sentence.

Optionally, the semantic of the statement of the student in the Y target audio clip obtained by the server is "i like the fruit is orange". The semantic of the teacher's sentence obtained by the server into the Y target audio clip is "what are your favorite fruits? ". The semantic matching degree preset by the server is 85%. When the server detects that the matching degree of the semantics of the student sentences and the semantics of the teacher sentences in the Y target audio clips is 90%, at least one target video clip corresponding to at least one target audio clip can be determined according to the time point.

And S408, synthesizing the at least one target audio clip and the at least one target video clip to obtain a wonderful audio/video file of the online classroom.

The specific process is as described above, and is not described herein again.

According to the technical scheme, the first keywords of the student audio data and the second keywords of the teacher audio data are identified, when the first keywords are matched with the second keywords, the target video clip corresponding to the at least one target audio clip can be determined according to the time point, the at least one target audio clip and the at least one target video clip are synthesized, the wonderful audio and video files of an online classroom can be obtained, and the accuracy of obtaining the audio and video files can be improved.

Fig. 5 shows a schematic structural diagram of an apparatus for acquiring an audio/video file according to an embodiment of the present application.

As shown in fig. 5, the apparatus 500 for acquiring an audio/video file includes: a video acquisition unit 501, a clip extraction unit 502, a clip determination unit 503, and a clip synthesis unit 504. Wherein:

a video acquiring unit 501, configured to acquire a teaching video of an online classroom; the teaching video comprises teaching audio data and teaching video data;

a section extraction unit 502 for extracting at least one target audio section based on the audio volume of the teaching audio data;

a section determining unit 503, configured to determine a target video section corresponding to at least one target audio section according to the time point;

and the segment synthesis unit 504 synthesizes the at least one target audio segment and the at least one target video segment to obtain a wonderful audio/video file of an online classroom.

According to some embodiments, the section determining unit 503 is further configured to identify the student audio data and the teacher audio data in the at least one target audio section;

According to some embodiments, the fragment determining unit 503 is further configured to obtain courseware data of an online classroom;

identifying student audio data in at least one target audio clip;

identifying at least one first keyword of the student audio data;

and when the at least one first keyword is matched with the at least one third keyword in the courseware data, determining at least one target video segment corresponding to the at least one target audio segment according to the time point.

According to some embodiments, the segment determining unit 503 is further configured to obtain at least one student sentence and at least one teacher sentence in the at least one target audio segment;

when the semantics of the at least one student sentence and the semantics of the at least one teacher sentence match, at least one target video segment corresponding to the at least one target audio segment is determined according to the time point.

According to some embodiments, the apparatus 500 for acquiring an audio/video file further includes a fragment intercepting unit 504, configured to intercept a plurality of audio fragments to be identified in the teaching audio data;

and when the volume of the plurality of audio segments to be identified is greater than the preset volume, extracting at least one target audio segment.

According to some embodiments, the fragment intercepting unit 504 is further configured to periodically intercept a plurality of fragments to be identified based on a duration of the teaching audio data;

or randomly intercepting a plurality of fragments to be identified based on the duration of the teaching audio data.

According to some embodiments, the apparatus 500 for acquiring an audio/video file further includes a format conversion unit 505, configured to acquire a format of a teaching video;

The embodiment of the application provides an acquisition device of an audio and video file, which acquires a teaching video of an online classroom through a video acquisition unit; the teaching video comprises teaching audio data and teaching video data, the segment extraction unit extracts at least one target audio segment based on the audio volume of the teaching audio data, the segment determination unit determines a target video segment corresponding to the at least one target audio segment according to a time point, and the segment synthesis unit synthesizes the at least one target audio segment and the at least one target video segment to obtain a wonderful audio and video file of an online classroom. According to the device for acquiring the audio and video files, the at least one target audio clip and the at least one target video clip are synthesized, so that the wonderful audio and video files in an online classroom can be acquired, and the accuracy of acquiring the audio and video files can be improved.

Please refer to fig. 6, which is a schematic structural diagram of a server according to an embodiment of the present disclosure.

As shown in fig. 6, the server 600 may include: at least one processor 601, at least one network interface 604, a user interface 603, a memory 605, at least one communication bus 602.

Wherein a communication bus 602 is used to enable the connection communication between these components.

The user interface 603 may include a Display screen (Display) and an antenna, and the optional user interface 603 may further include a standard wired interface and a wireless interface.

The network interface 604 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

Processor 601 may include one or more processing cores, among others. The processor 601 connects various components throughout the server farm 600 using various interfaces and lines to perform various functions of the server 600 and process data by executing or executing instructions, programs, code sets or instruction sets stored in the memory 605 and invoking data stored in the memory 605. Optionally, the processor 601 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 601, but may be implemented by a single chip.

The Memory 605 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 605 includes a non-transitory computer-readable medium. The memory 605 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 605 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 605 may optionally be at least one storage device located remotely from the processor 601. As shown in fig. 6, the memory 605, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an application program for the acquisition of an audio-video file.

In the server 600 shown in fig. 6, the processor 601 may be configured to call an application program stored in the memory 605, and specifically perform the following operations:

determining at least one target video clip corresponding to at least one target audio clip according to the time point;

According to some embodiments, the processor 601, when performing the determining of the at least one target video segment corresponding to the at least one target audio segment according to the time point, specifically performs the following operations:

acquiring courseware data of an online classroom;

identifying student audio data in at least one target audio clip;

identifying at least one first keyword of the student audio data;

acquiring at least one student sentence and at least one teacher sentence in at least one target audio segment;

According to some embodiments, the processor 601, when executing the volume based on the teaching audio data to extract at least one target audio segment, specifically performs the following operations:

intercepting a plurality of audio clips to be identified in teaching audio data;

According to some embodiments, when the processor 601 intercepts a plurality of audio segments to be recognized in the teaching audio data, the following operations are specifically performed:

periodically intercepting a plurality of fragments to be identified based on the duration of the teaching audio data;

According to some embodiments, the processor 601, before executing the audio volume based on the teaching audio data to extract the at least one target audio segment, specifically performs the following operations:

acquiring a format of a teaching video;

The embodiment of the application provides a server, which is used for obtaining teaching videos of an online classroom; the teaching video comprises teaching audio data and teaching video data; extracting at least one target audio clip based on the volume of the teaching audio data; determining at least one target video clip corresponding to at least one target audio clip according to the time point; and synthesizing the at least one target audio clip and the at least one target video clip to obtain the wonderful audio and video file of the online classroom. According to the server, the at least one target audio clip and the at least one target video clip are synthesized, so that the wonderful audio and video files of an online classroom can be obtained, and the accuracy of obtaining the audio and video files can be improved.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

Embodiments of the present application also provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to make a computer execute part or all of the steps of any one of the audio-video file acquisition methods described in the above method embodiments.

It is clear to a person skilled in the art that the solution of the present application can be implemented by means of software and/or hardware. The "unit" and "module" in this specification refer to software and/or hardware that can perform a specific function independently or in cooperation with other components, where the hardware may be, for example, a Field-ProgrammaBLE gate array (FPGA), an Integrated Circuit (IC), or the like.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some microservice interface, and may be an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An acquisition method of an audio/video file is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining, from the point in time, that at least one target video segment corresponding to the at least one target audio segment corresponds comprises:

3. The method of claim 1, wherein the determining at least one target video segment corresponding to the at least one target audio segment according to a point in time comprises:

obtaining courseware data of the online classroom;

identifying student audio data in the at least one target audio segment;

identifying at least one first keyword of the student audio data;

4. The method of claim 1, wherein the determining, from the point in time, that at least one target video segment corresponding to the at least one target audio segment corresponds comprises:

5. The method of claim 1, wherein extracting at least one target audio segment based on the volume of the instructional audio data comprises:

6. The method of claim 5, wherein intercepting a plurality of audio segments to be identified in the instructional audio data comprises:

7. The method of claim 1, wherein prior to extracting at least one target audio segment based on the audio volume of the instructional audio data, further comprising:

acquiring the format of the teaching video;

8. An apparatus for acquiring an audio/video file, comprising:

9. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 7.