CN112653902B

CN112653902B - Speaker recognition method and device and electronic equipment

Info

Publication number: CN112653902B
Application number: CN201910959899.2A
Authority: CN
Inventors: 王全剑; 黄鹏; 李波
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-04-11
Anticipated expiration: 2039-10-10
Also published as: CN112653902A

Abstract

The embodiment of the application discloses a speaker identification method, a speaker identification device and electronic equipment. The method comprises the following steps: separating and obtaining an audio file and an image file from a video file to be identified; performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, and performing face recognition on the image file to obtain face recognition results corresponding to different times; and aligning the speaker identification information and the face recognition result in time to determine the speaker corresponding to the at least one voice section. According to the scheme, the accuracy of speaker recognition is improved.

Description

Speaker recognition method and device and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speaker recognition method, device and electronic apparatus, and a video processing method, device and electronic apparatus.

Background

Voiceprint recognition, also known as speaker recognition, may analyze and process a speech signal to determine the identity of a speaker through a computer speech processing technique.

For scenes such as live broadcast and conference, which may involve multiple speakers, when speaker recognition is performed, it is necessary to recognize which persons are included in a voice signal, and it is also necessary to determine which specific voice segments each person corresponds to. Taking the dialogue voice between the user a and the user B as an example, voice segmentation can be performed first to find voice jumping points between different speakers, the whole dialogue voice is segmented into a plurality of voice sections, the start-stop time and the speaker id corresponding to each voice section are marked, then speaker clustering is performed, and the voice sections corresponding to the same speaker id are combined together.

Ideally, the above dialogue speech is processed to obtain two categories: one category corresponds to id1 of the user A and comprises all voice sections spoken by the user A in the conversation process; the other category corresponds to id2 of user B, which includes all speech segments that user B speaks during the conversation.

However, in practical applications, due to the influence of ambient noise, there is a high possibility that a segmentation error occurs during speech segmentation. For example, a speech segment corresponding to the user a is split into two sub-speech segments, and the corresponding speaker ids are respectively labeled as id1 and id2, which results in that when the speakers are clustered, the sub-speech segment labeled as id2 is merged into the corresponding category of the user B. Although the identity of the speaker is correctly identified, the combined speech segment is erroneous.

Or, the corresponding speaker id of the two split sub-speech segments may be identified as id1 and id3, so that after the identification, not only the combined speech segment is incorrect, but also the identity identification of the speaker is incorrect, and the speech originally having only two user dialogues is identified as three users.

How to improve the recognition accuracy of the speaker, especially when there is strong noise interference such as background music and applause in the voiced language communication activities in which multiple persons participate, for example, in a multi-person speech scene, how to improve the recognition accuracy of the speaker becomes a technical problem to be solved by technicians.

Disclosure of Invention

The application provides a speaker recognition method, a speaker recognition device and electronic equipment, which can be used for carrying out speaker recognition based on voiceprint characteristics and human face characteristics and are beneficial to improving recognition accuracy.

The present application provides the following:

a speaker recognition method, comprising:

separating and obtaining an audio file and an image file from a video file to be identified;

performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, and performing face recognition on the image file to obtain face recognition results corresponding to different times;

and aligning the speaker identification information and the face recognition result in time to determine the speaker corresponding to the at least one voice section.

A video file processing method, comprising:

the server side obtains a video file generated by the audio language communication activity;

according to the audio file and the image file separated from the video file, carrying out speaker identification to obtain a set of voice sections respectively associated with different speakers;

carrying out voice recognition on the set of voice sections associated with the speaker to obtain the speaking content of the speaker;

and establishing an incidence relation between the speaker and the corresponding speaking content.

A video file processing method, comprising:

the method comprises the steps that a client receives a viewpoint file which is sent by a server and is obtained by processing a video of an audio language communication activity, wherein content viewpoints of different speakers are added in the viewpoint file, and are extracted from a set of determined voice sections related to different speakers according to speaker recognition of an audio file and an image file which are separated from the video;

and pushing the viewpoint file to a user associated with the client.

A speaker recognition device, comprising:

the file separation unit is used for separating and obtaining an audio file and an image file from a video file to be identified;

the voice segmentation unit is used for carrying out voice segmentation on the audio file to obtain start-stop time information and speaker identification information which respectively correspond to at least one voice section;

the face recognition unit is used for carrying out face recognition on the image file to obtain face recognition results corresponding to different time;

and the alignment processing unit is used for carrying out alignment processing on the speaker identification information and the face recognition result in time to determine the speaker corresponding to the at least one voice section.

A video file processing device is applied to a server and comprises:

a video file obtaining unit for obtaining a video file generated by the vocal language communication activity;

the speaker recognition unit is used for recognizing speakers according to the audio files and the image files separated from the video files to obtain a set of voice sections associated with different speakers;

a speaking content obtaining unit, configured to perform speech recognition on a set of speech segments associated with the speaker to obtain the speaking content of the speaker;

and the association relationship establishing unit is used for establishing the association relationship between the speaker and the corresponding speaking content.

A video file processing device applied to a client comprises:

the system comprises a viewpoint file receiving unit, a viewpoint file processing unit and a processing unit, wherein the viewpoint file receiving unit is used for receiving a viewpoint file which is sent by a server and is obtained by processing a video of a voice language communication activity, content viewpoints of different speakers are added in the viewpoint file, and the content viewpoints are extracted from a set of voice sections relevant to the different speakers determined according to speaker recognition of an audio file and an image file which are separated from the video;

and the viewpoint file pushing unit is used for pushing the viewpoint file to a user associated with the client.

An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

An electronic device, comprising:

one or more processors; and

memory associated with the one or more processors, the memory for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

obtaining a video file generated by the vocal language communication activity;

carrying out voice recognition on the set of the voice sections associated with the speaker to obtain the speaking content of the speaker;

and establishing an association relation between the speaker and the corresponding speaking content.

An electronic device, comprising:

one or more processors; and

receiving a viewpoint file which is sent by a server and is obtained by processing a video of an audio language communication activity, wherein content viewpoints of different speakers are added in the viewpoint file, and the content viewpoints are extracted from a set of voice sections which are determined to be associated with the different speakers according to speaker recognition of an audio file and an image file which are separated from the video;

and pushing the viewpoint file to a user associated with the client.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the embodiment of the application, sound and image separation can be carried out on a video file, voice segmentation is carried out on a separated audio file, at least one voice section is obtained, and start-stop time information and speaker identification information which correspond to each voice section are obtained; meanwhile, the face recognition can be carried out on the separated image files, and face recognition results corresponding to different time are obtained. And then aligning the voiceprint features and the face features to obtain a speaker clustering result, namely determining a speaker corresponding to at least one voice segment. The alignment processing may be based on the start-stop time corresponding to the voice segment, and determining whether the face recognition results corresponding to the start-stop time correspond to the same speaker; alternatively, the alignment process may be to determine whether adjacent speech segments may correspond to the same speaker based on the face recognition result. Therefore, the speaker identity recognition is realized based on the voiceprint features and the face features of the user, and the problem of low recognition accuracy in the prior art is solved.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for practicing the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a system provided by an embodiment of the present application;

FIG. 2 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a first interactive text provided by an embodiment of the present application;

FIG. 4 is a diagram of a second interactive text provided by an embodiment of the present application;

FIG. 5 is a flow chart of a second method provided by an embodiment of the present application;

FIG. 6 is a flow chart of a third method provided by embodiments of the present application;

FIG. 7 is a flow chart of a fourth method provided by embodiments of the present application;

FIG. 8 is a schematic view of a first apparatus provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a third apparatus provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of an architecture of a computer system provided by an embodiment of the present application;

fig. 12 is a schematic diagram of an architecture of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

Example 1

In order to solve the problem of low speaker recognition accuracy caused by the influence of factors such as environmental noise and the like in a multi-person speaking scene, the embodiment of the application provides a tool for speaker recognition based on voiceprint features and face features. The following explains the recognition process with reference to a specific example.

The identification object related to the embodiment of the application is mainly a video file comprising audio and images. After the video file to be recognized is obtained, sound and image separation can be performed firstly, and the video file is split into an audio file capable of extracting voiceprint features and an image file capable of extracting face features. For example, audio files as well as image files may be separated from video files by the multimedia video processing tool FFmpeg (Fast Forward Mpeg).

For an audio file, voice segmentation can be performed through a voiceprint tracking technology, voice jumping points among different speakers are found, the audio file is segmented into a plurality of voice sections, and the start-stop time and the speaker identification information corresponding to each voice section are marked. For example, the time length of the video file 1 is 10 minutes, and after the audio file separated from the video file 1 is subjected to voice segmentation, the related information of the voice segments shown in table 1 can be obtained.

TABLE 1

Identification information of speech segments	Start and end time information (unit: s)	Speaker identification information
			Speech segment 1	0～11	id1
Speech segment 2	12～550	id2
			Speech segments 3	551～600	id1

For an image file, face recognition can be performed on an image frame included in the image file, and which speakers are included in the image file is determined through face features. As an example, image frames may be extracted from an image file at preset time intervals, and identity information of a speaker corresponding to each image frame may be determined through a face recognition technology. For example, the face recognition result information shown in table 2 below can be obtained by extracting frames in seconds for an image file separated from the video file 1.

TABLE 2

Time information (Unit: s)	Speaker identity information
		1	User 1
2	User 1
		……	……
12	User 2
		13	User 2
14	User 1
		……	……
600	User 2

The speaker identity information can be a self-defined number in the face recognition process. For example, according to the image frame corresponding to the 1 st second, the determined user may be defined as the user 1, and if the recognition result indicates that the image frames correspond to the same user 1 in the process of performing face recognition on the image frames corresponding to the 2 nd to 11 th seconds, the same serial number is continuously used to define the user; if the user corresponding to the facial feature is determined to be a new user when the image frame corresponding to the 12 th second is subjected to face recognition, the new user can be defined as user 2. And so on, defining a corresponding number for the user appearing in the image file.

Or, a speaker information library may be created in advance, and an association relationship between identity information (for example, the identity information may be embodied as a name of a speaker, etc.) of at least one speaker and face feature information may be stored in the information library. Therefore, after the image frame is subjected to face recognition and corresponding face characteristic information is extracted, the information base can be inquired, and the identity information of the speaker corresponding to the face characteristic information is determined.

After the recognition results in tables 1 and 2 are obtained, the voiceprint feature and the face feature can be aligned to obtain a speaker clustering result.

The feature alignment process is illustrated below with reference to specific examples.

In one mode, a target face recognition result corresponding to the start-stop time can be obtained according to the start-stop time information corresponding to the voice segment, and whether the target face recognition result corresponds to the same speaker or not is determined.

Specifically, if the target face recognition result corresponds to a target user, that is, the voiceprint feature and the face feature are aligned within the starting time and the ending time, the association relationship between the speech segment and the target user can be directly established.

Taking the speech segment 1 shown in table 1 as an example, the corresponding start-stop time is 0 to 11s, and as can be seen from table 2, the identity information of the speakers identified by the face corresponding to the start-stop time is user 1, that is, the target face identification result only corresponds to one target user. At this time, the speech segment 1 may be determined as the audio generated by the user 1 during speaking, and the association relationship between the speech segment 1 and the user 1 may be established.

Specifically, if the target face recognition result corresponds to at least two users, one target user can be determined from the target face recognition results, that is, the target face recognition results are merged, the at least two users are merged into one target user, the voiceprint feature and the face feature are aligned within the starting time and the ending time, and then the voice segment and the target user are associated.

Taking the speech segment 2 shown in table 1 as an example, the corresponding start-stop time is 12 to 550s, and as can be seen from table 2, the speaker identity information of the face recognition corresponding to the start-stop time includes user 1 and user 2, that is, the target face recognition result corresponds to two target users. The reason for this may be that during the speaking of a certain user, the other user is shot in a shot-cut manner, i.e. the speaker is usually one of the two users, so that a target user can be determined from the shot-cut, and the association relationship between the target user and the voice segment is established. For example, the user with the largest number of occurrences within the start-stop time may be determined as the target user.

In the above example, if the image frame identified as the user 1 is significantly less in a proportion than the image frame identified as the user 2 among the image frames corresponding to 12 to 550s, the user 2 may be determined as the target user. That is, the speech segments 2 may be determined as audio generated during the speaking of the user 2, and the association relationship between the speech segments 2 and the user 2 may be established.

In another mode, whether the adjacent speech segments correspond to the same speaker or not can be determined according to the face recognition result.

Specifically, the target start-stop time for performing alignment processing may be determined according to the start-stop time information corresponding to each of the adjacent voice segments; obtaining a target face recognition result corresponding to the target start-stop time; if the target face recognition result corresponds to a target user, the adjacent voice segments can be determined to correspond to the same speaker, and the adjacent voice segments and the target user can be associated.

Taking the speech segment 3 shown in table 1 as an example, the corresponding start-stop time is 551-600 s, if the speaker identity information corresponding to the start-stop time is determined to be user 2 through table 2, (possibly, the face recognition result corresponding to the start-stop time only includes user 2, or, possibly, a plurality of users are included, but the number of times of occurrence of user 2 is the largest), that is, for the adjacent speech segment 2 and speech segment 3, although they correspond to different user ids during audio recognition, they correspond to the same user identity during image recognition. The reason for this may be that, under the influence of environmental noise, the sound of the same user is segmented into a plurality of ids during speech segmentation, and for this, adjacent speech segments may be mapped to the speaker identified by the image, so as to achieve alignment of the voiceprint features and the face features within the target start-stop time. That is, it can be determined that the speech segments 2 and 3 are both audio generated during the speaking process of the user 2, and the association relationship between the speech segments 2 and 3 and the user 2 can be established.

Through the above alignment process, the speaker clustering results shown in table 3 below can be obtained.

TABLE 3

It should be noted that, for the case that the same speaker associates at least two speech segments, as an example, as shown in table 3, the association relationship between the speaker and the corresponding at least two speech segments may be established; alternatively, at least two speech segments may be merged, and then the association relationship between the speaker and the merged speech segments is established.

Therefore, the speaker identity recognition is realized based on the voiceprint features and the face features of the user, and the problem of low recognition accuracy in the prior art is solved.

It can be understood that the embodiment of the present application can perform speaker recognition based on a complete video file, and as the above example, can perform analysis processing on 0-600 s of audio files and image files; or, speaker recognition may be performed based on a portion of the video file in which the user is interested, for example, the video file corresponding to the 12 th to 600 th seconds is intercepted, and after sound and image separation, speaker recognition is performed according to the processing procedure described above. Regardless of the manner in which speaker recognition is performed, the audio files and image files may remain aligned in time.

In addition, it should be noted that the speaker recognition tool may be deployed on a terminal device associated with a user, and perform local analysis processing on a video file obtained by the terminal device; or the method can be deployed on a cloud server, analyze and process video files submitted by terminal equipment or other servers, and identify that the video files comprise several speakers and which voice segments each speaker corresponds to respectively.

It is to be appreciated that the speaker recognition tool provided by the embodiments of the present application can be applied in a variety of scenarios. As an example, the method can be applied to the voice language communication activities in which multiple persons participate, such as activities of multi-person meeting, multi-person lecture and the like.

Taking a multi-person conference scenario as an example, identity information of speakers (e.g., participants) may be identified and a set of speech segments generated when each speaker speaks on the conference may be determined. As one example, a meeting perspective set forth by a speaker can be extracted based on a set of speech segments to which the speaker corresponds. Taking a multi-speaker speech scene as an example, identity information of speakers (e.g., speakers) can be identified, and voice segments generated by the speakers during the speech process can be determined. As an example, a speaking viewpoint of a speaker can be extracted based on a set of speech segments corresponding to the speaker.

Or, the method can be applied to movie and television play scenes for identifying the identity information of different roles and the sets of the voice segments corresponding to the roles. As an example, a storyline viewpoint may be extracted based on a set of speech segments corresponding to a role.

The following explains the processing procedure of a video file by taking a video shot in a live speech scene as an example.

Example 2

For live broadcasting of a speech, in the prior art, speech viewpoints of different speakers can be extracted only in a manual analysis mode after the live broadcasting is finished, and then are manually added into a video file, so that a large amount of labor cost and time cost are consumed in a video processing process.

Accordingly, the present embodiment provides a tool capable of automatically processing a video file, which may include a client and a server, as shown in fig. 1. The client can be deployed on the terminal device associated with the user, the server can be deployed on the cloud server, for example, on the live server, and the speaker recognition tool provided in embodiment 1 can serve as a function of the server to recognize the speaker identity in the lecture video and determine the set of voice segments associated with different speakers.

The following explains the processing procedure of the video file with reference to the flowchart shown in fig. 2.

S101: the server obtains a video file generated by the voice language communication activity.

In this example, the video file generated by the spoken language communication activity, i.e., the video file generated by the speech, may be obtained by a live broadcast server performing live broadcast of the speech, and submitted to the server. If the live broadcast server submits the video file generated by the lecture to the server after the live broadcast of the lecture is finished, the server can perform live broadcast post-processing, and adds a corresponding content viewpoint (in this example, the content viewpoint can be embodied as the lecture viewpoint of the lecturer) in the file for the user to review. If the live broadcast server submits the video file generated by the speech to the server side in the live broadcast process, the server side can perform real-time processing, and adds a corresponding speech watching point in the video file which is completed with the live broadcast, so that a user can select to follow the speech which is being live broadcast when watching the live broadcast and synchronously watch the speech on line; alternatively, review of a portion of the content that has completed the live broadcast may be selected.

For example, for a 2-hour live speech, if the live speech is played for 30 minutes, the server may analyze a video file generated by the 30-minute live speech, add a speech view point for the user to review the previous 30-minute live content, and directly locate a speech segment corresponding to the speech view point.

S102: and according to the audio file and the image file separated from the video file, carrying out speaker identification to obtain a set of voice sections respectively associated with different speakers.

The specific implementation process can be described in the above embodiment 1, and is not described herein again.

If the video file 1 is a video file generated by a speech, the identification result shown in table 3 can be obtained after the speaker identification. Wherein, speech section 1 is the audio frequency that produces in the lecture of speaker 1 speech process, and speech section 2 and speech section 3 are the audio frequency that produces in the lecture of speaker 2 speech process.

As an example, a speaker information base may be created in advance before the speech starts, and the face feature information of different speakers is stored in the speaker information base, so that when the server performs face recognition on an image file, the identity information of the speaker can be determined from the speaker information base according to the face feature of the speaker. Or, the voiceprint characteristics of different speakers can be stored through the speaker information base, so that when the server side tracks the voiceprint characteristics of the audio file, the identity information of the speaker can be determined from the information base according to the voiceprint characteristics of the speaker. The presenter information base may be created by the server according to the embodiment of the present invention, or may be created by another server and then provided to the server according to the embodiment of the present invention, which is not limited in this embodiment of the present application.

Preferably, a noise threshold value can be preset, and when it is determined that noise may affect voice segmentation, speaker recognition is performed by performing feature alignment on voiceprint features and face features.

In one mode, if a noise collection device is disposed in a speech floor, after the server obtains noise information collected by the noise collection device, it may be determined whether a noise value represented by the noise information exceeds a preset threshold (for example, the preset threshold may be embodied as a preset decibel value): if the noise exceeds the preset threshold, it can be determined that the noise may affect the accuracy of the speech segmentation, and the audio file and the image file can be separated from the video file according to the scheme provided in embodiment 1 to perform speaker recognition. If the noise does not exceed the preset threshold, the influence of the noise on the voice segmentation can be determined to be small, the audio file can be extracted from the video file, and only the speaker recognition can be carried out according to the audio file.

Or, in another mode, the server obtains the video file, and after the audio and video separation of the video file is completed, whether noise interference exists can be determined according to the volume change condition in the audio file. If the amplitude of the volume changes within a certain range, which may be the amplitude change caused by the speaker speaking, the audio file can be extracted from the video file, and speaker identification can be performed according to the audio file. If the amplitude of the volume changes greatly, which may be noise interference such as applause or background music, the audio file and the image file can be separated from the video file according to the scheme provided in embodiment 1 to perform speaker recognition.

S103: and carrying out voice recognition on the set of the voice sections associated with the speaker to obtain the speaking content of the speaker.

S104: and establishing an association relation between the speaker and the corresponding speaking content.

Specifically, the speech segments can be converted into the recognition texts by a speech recognition technology, so as to obtain the speaking content of the speaker.

In the above example, speech recognition is performed on the speech segment 1, so that the speech content 1 (in this example, the speech content may be embodied as speech content of the speaker) can be obtained, and the association relationship between the speaker 1 and the speech content 1 is established; speech recognition is performed on the speech segments 2 and 3, so that the speech content 2 and the speech content 3 can be obtained, and the association relationship between the speaker 2 and the speech content 3 is established.

The above example can obtain the association shown in table 4 below through a video processing process.

TABLE 4

As an example, after performing voice recognition on the voice segments 2 and 3, the speech content 2 and the speech content 3 may be combined to obtain the speech content 4, and then the association relationship between the speaker 2 and the speech content 4 is established.

S105: and extracting the content viewpoint of the speaker from the speaker-associated speaking content.

In the above example, after obtaining the lecture content related to the lecturer, the lecture viewpoint of the lecturer, that is, the main viewpoint set forth by the lecturer during the lecture process, can be extracted from the lecture content. For example, the lecture viewpoint can be extracted through an abstract formula (abstract) or an explicit formula (explicit), and the specific implementation process is referred to the related art and is not described in detail herein.

For the lecturer 1, the lecture viewpoint of the lecturer 1 can be extracted based on the lecture content 1. For the speaker 2, the lecture viewpoint of the speaker 2 can be extracted based on the lecture content 2 and the lecture content 3, respectively; alternatively, the lecture viewpoint of the lecturer 2 may be extracted based on the lecture content 4 obtained by the combination.

In addition, in consideration of the fact that the speaker may need to describe the speech in a long time, that is, the speech point of the speaker is more likely to be extracted from the speech segments with long starting and ending times, the text recognition may be performed on the speech segments with long starting and ending times in the speech set. For example, it is possible to perform text recognition on the speech segment 2 and extract the speech viewpoint of the speaker 2 from the speech content 2.

In conclusion, the speech viewpoint of the speaker can be automatically extracted from the speech video, and compared with the scheme that the speech viewpoint can only be manually extracted after the live broadcast is finished in the prior art, the processing speed of the video file can be increased, and the labor cost and the time cost are reduced; in addition, the live content can be processed in real time in the live broadcasting process, so that a user can select the live content in two modes of review and online watching, and more use experiences are provided for the user.

S106: and post-processing the content viewpoint to generate viewpoint file and transmitting the viewpoint file to the client.

In the embodiment of the present application, the content viewpoint may be delivered to the client in different forms, and the following explains the post-processing process with reference to specific examples.

1. And if the intelligent equipment associated with the client is video playing equipment, the content viewpoint can be issued in a video form. For example, in a teaching scenario, a video with content viewpoints added can be sent to a client for a user to watch learning.

In one mode, the server may add the content view point to the video file, and send the video file containing the content view point to the client for video playing. That is, the viewpoint file generated by the post-processing is a video file containing the content viewpoint.

That is, post-processing can be performed on the basis of the original video, and the addition of the content viewpoint is realized. Specifically, the adding node of the content viewpoint may be determined according to the start-stop time of the speech segment corresponding to the content viewpoint. For example, the content viewpoint is added to the start time node of the corresponding voice segment, so that after the user clicks the content viewpoint displayed on the video file progress bar, the playing segment corresponding to the content viewpoint can be quickly positioned for the user to watch.

Or, in another mode, a demonstration video simulating the communication activity can be generated according to the content viewpoint, and the demonstration video containing the content viewpoint is sent to the client to be played. That is, the viewpoint file generated by the post-processing is a presentation video including the viewpoint of the content.

That is, in order to facilitate the user to quickly know the main viewpoints involved in the communication activities, a demonstration video can be generated based on the content viewpoints, and the demonstration video can be used as an abstract video of an original video to show the content viewpoints of different speakers to the user as concisely as possible.

Specifically, the characters in the demonstration video can be determined according to the characters of the speakers or other virtual characters, and then the content view points corresponding to the speakers can be expressed in the form of communication conversation through different characters.

2. And if the intelligent equipment associated with the client is audio playing equipment, the content viewpoint can be issued in an audio form. For example, after the client is deployed on the smart sound box, the audio file with the content viewpoint added thereto may be sent to the client for the user to listen to.

Specifically, the content viewpoint may be added to the audio file (for example, an added node of the content viewpoint in the audio file may be determined according to the start and end time of a speech segment corresponding to the content viewpoint), and the audio file including the content viewpoint is sent to the client for audio playing. That is, the viewpoint file generated by the post-processing is an audio file containing the content viewpoint.

As an example, after obtaining the viewpoint file, the client can directly perform audio playing. Or the content viewpoint added in the viewpoint file can be played first, so that the user can select the interested target content viewpoint from the viewpoint file. In this way, after the user inputs the target content viewpoint (taking the smart sound box as an example, the user can speak the target content viewpoint in a voice control manner), the client can submit the target content viewpoint to the server, and the server sends the audio clip corresponding to the target content viewpoint to the client for audio playing. Or the identity information of the speaker can be played first, so that the user can select the interested target speaker, and the client and the server are matched with each other to push the audio clip related to the target speaker to the user.

3. The content viewpoints may be delivered in text form.

Specifically, the communication text of the corresponding type can be generated from the content viewpoint according to the display type of the client, and the communication text is sent to the client for display. That is, the viewpoint file generated by the post-processing is an interchange text containing the viewpoint of the content.

As an example, the display type of the client may be text display, for example, the content viewpoint is displayed in the form of mail, post, and the like, and correspondingly, the server may generate a communication text of the text type shown in fig. 3, and issue the communication text to the client for display. Or, the display type of the client may be poster display, and correspondingly, the server may generate a communication text of the poster type shown in fig. 4, and issue the communication text to the client for display.

S107: the method comprises the steps that a client receives a viewpoint file which is sent by a server and obtained by processing a video of the audio language communication activity, and the viewpoint file is pushed to a user associated with the client.

In the embodiment of the application, the client may push the viewpoint file to the user in a video form, an audio form, and a text form, which may be specifically described above and will not be described herein again.

S108: the client submits a service request to the server, and the server is triggered to provide related services for the user associated with the client.

As can be seen from the foregoing description, the embodiments of the present application can implement data processing of video files. As a preferred scheme, the server may further define an array structure according to requirements, for storing the association relationship between the information of the speaker, the voice segment, the content viewpoint, and the like.

For the video file 1, the association shown in table 5 below can be obtained after video processing.

TABLE 5

That is, the client may push the viewpoint file delivered by the server to the user, and may request the server to provide various services for the user. In the following, a speech scene is taken as an example to illustrate the implementation process of providing different services by the server.

1. And generating a personalized live broadcast cover for a forwarding request submitted by a user in the live broadcast process.

Specifically, in the process of watching the live speech on line, if the user wants to perform sharing operation on the live speech, the live speech forwarding request can be submitted to the server through the client associated with the user. Correspondingly, the server may obtain a speech viewpoint corresponding to the content that has completed live broadcasting, for example, a speech viewpoint extracted based on the content that has been live broadcasting for 30 minutes is sent to the client.

As an example, the client may display the speech viewpoint issued by the server, so that the user can select the target speech viewpoint from the speech viewpoint, and submit the target speech viewpoint to the server, so that the server generates a live broadcast cover according to the target speech viewpoint. For example, the server may add the target lecture viewpoint to the live cover page; or, a target image may be captured from the speech segment corresponding to the target speech viewpoint, and a live-broadcast cover is generated by using the target image and the target speech viewpoint, specifically, the start-stop time corresponding to the target speech viewpoint may be determined according to tables 3 and 5, and then the target image may be captured from the speech segment corresponding to the start-stop time.

After the server generates a live broadcast cover according to the target lecture viewpoint, a forwarding link containing the live broadcast cover can be obtained, and the forwarding link can be used as an exclusive link of a user and is issued to the client for the user to perform live broadcast forwarding. Therefore, when the user carries out live broadcast forwarding, the exclusive personalized live broadcast cover of the user can be displayed. Compared with the prior art that only live links containing unified cover pages can be forwarded, the forwarding service provided by the embodiment of the application is beneficial to improving user experience, and can attract more users to watch live speech through personalized sharing, so that the propagation guiding efficiency is improved.

2. The playing segments which are interesting to the user can be directly positioned when the user looks back.

Specifically, no matter whether live broadcasting is finished and then the content which is finished live broadcasting is reviewed, or live broadcasting is finished and the content which is finished live broadcasting is reviewed in the live broadcasting process, a user can submit a playing request to the server through the client, the server is triggered to send a speech viewpoint to the client for displaying, the user can select a target speech viewpoint from the speech viewpoint, and then the target speech viewpoint selected by the user is submitted to the server. Correspondingly, the server can determine a target voice segment corresponding to the target speech viewpoint according to the association relationship shown in tables 3 and 5, determine a playing segment corresponding to the target speech viewpoint from the video file according to the start-stop time of the target voice segment, and send the playing segment to the client for video playing. That is, the client may receive and play only segments that are of interest to the user.

Or, in another mode, if the client can show the speech viewpoint added in the video file to the user for viewing, for example, show the speech viewpoint on a progress bar of the video file, the client can obtain a target speech viewpoint selected by the user, generate a play request including the target speech viewpoint, and submit the play request to the server. For example, after the user clicks the speech viewpoint 2 displayed on the progress bar, a play request for the speech viewpoint 2 may be generated. Correspondingly, after the server obtains the playing request, the server can determine that the user wants to watch the lecture segment of 12-550 s according to the association relationship shown in table 3 and table 5, so that the client can be controlled to load the video resource of 12-550 s and play the segment. That is to say, the client can play the complete speech video and skip to play the segment that the user is interested in according to the user operation.

For a user listening to a speech in an audio form, after submitting a play request through a client, a server can determine a play segment corresponding to a target speech point from an audio file according to the target speech point determined by the user, and issue the play segment to the client for audio play.

3. According to the retrieval information input by the user, the playing segments which are interested by the user can be directly positioned.

Specifically, the client may provide an operation option for submitting the search information, and after obtaining the search information through the operation option, a search request may be generated and submitted to the server.

As an example, the retrieval information submitted by the user may be identification information of the lecturer, i.e., a speech segment that the user wants to view a certain lecturer(s). Correspondingly, after the server obtains the retrieval request, the server can determine the target voice segment associated with the speaker according to the association relationship shown in tables 3 and 5, determine the playing segment corresponding to the speaker from the video file according to the starting and ending time of the target voice segment, and send the playing segment to the client for video playing.

Or, as another example, the retrieval information submitted by the user may be a content keyword, i.e., a speech segment that the user wants to view in relation to certain content(s). Correspondingly, after the server obtains the retrieval request, the server can determine the target speech viewpoint matched with the content keyword from the speech viewpoints, then determine the target voice segment corresponding to the target speech viewpoint according to the association relationship shown in tables 3 and 5, determine the playing segment corresponding to the target speech viewpoint from the video file according to the starting and ending time of the target voice segment, and send the playing segment to the client for video playing.

Preferably, in order to improve the accuracy of the user in retrieving the played segments based on the content keywords, the content keywords can be extracted from the lecture viewpoint in advance, and the content keywords are sent to the client for display, so that the user can select the interested content keywords from the content keywords and submit the content keywords to the server.

Similarly, for a user listening to a speech in an audio form, after submitting a retrieval request through the client, the server may determine a corresponding playing segment from the audio file according to the identification information or content keywords of the speaker determined by the user, and issue the playing segment to the client for audio playing.

It will be appreciated that the target speech viewpoint determined in the above example may be the same viewpoint or may be different viewpoints, mainly determined by the actual needs of the user.

Example 3

Embodiment 3 is a method for speaker recognition corresponding to embodiment 1, and referring to fig. 5, the method may specifically include:

s201: separating and obtaining an audio file and an image file from a video file to be identified;

s202: performing voice segmentation on the audio file to obtain start-stop time information and speaker identification information corresponding to at least one voice section, and performing face recognition on the image file to obtain face recognition results corresponding to different times;

s203: and aligning the speaker identification information and the face recognition result in time to determine the speaker corresponding to the at least one voice section.

Example 4

Embodiment 4 is corresponding to embodiment 2, and provides a video file processing method from the perspective of a server, and referring to fig. 6, the method may specifically include:

s301: the server side obtains a video file generated by the audio language communication activity;

s302: according to the audio file and the image file separated from the video file, carrying out speaker identification to obtain a set of voice sections respectively associated with different speakers;

s303: carrying out voice recognition on the set of the voice sections associated with the speaker to obtain the speaking content of the speaker;

s304: and establishing an incidence relation between the speaker and the corresponding speaking content.

Example 5

Embodiment 5 is a method corresponding to embodiment 2, and provides a video file processing method from the perspective of a client, and referring to fig. 7, the method may specifically include:

s401: the method comprises the steps that a client receives a viewpoint file which is sent by a server and obtained by processing a video of a voice language communication activity, wherein content viewpoints of different speakers are added in the viewpoint file, and the content viewpoints are extracted from a set of voice sections relevant to different speakers determined according to speaker recognition of an audio file and an image file which are separated from the video;

s402: and pushing the viewpoint file to a user associated with the client.

For the parts not described in detail in embodiments 3 to 5, reference may be made to the descriptions in the embodiments, which are not repeated herein.

Corresponding to embodiment 1, the embodiment of the present application further provides a speaker recognition apparatus, which can be applied to a server, and includes:

a file separation unit 501, configured to separate an audio file and an image file from a video file to be identified;

a voice dividing unit 502, configured to perform voice division on the audio file, so as to obtain start-stop time information and speaker identification information corresponding to at least one voice segment;

a face recognition unit 503, configured to perform face recognition on the image file to obtain face recognition results corresponding to different times;

an alignment processing unit 504, configured to perform alignment processing on the speaker identification information and the face recognition result in time, and determine a speaker corresponding to the at least one speech segment.

Wherein the apparatus further comprises:

and the merging processing unit is used for merging at least one voice section corresponding to the speaker.

Wherein, the alignment processing unit is specifically configured to:

determining target start-stop time for alignment processing according to the start-stop time information corresponding to the adjacent voice segments; obtaining a target face recognition result corresponding to the target start-stop time; and if the target face recognition result corresponds to a target user, merging the adjacent voice sections, and associating the merged voice sections with the target user.

The alignment processing unit is specifically configured to:

obtaining a target face recognition result corresponding to the start-stop time according to the start-stop time information corresponding to the voice section; and if the target face recognition result corresponds to at least two users, determining a target user from the target face recognition results, and associating the voice segment with the target user.

Wherein the apparatus further comprises:

the device comprises an information base obtaining unit, a speaker information base obtaining unit and a speaker information base obtaining unit, wherein the speaker information base is used for obtaining a speaker information base, and the face characteristic information related to the identity information of at least one speaker is stored in the speaker information base;

and the identity information determining unit is used for determining the identity information of the target user from the speaker information base according to the face feature information of the target user.

Corresponding to embodiment 2, an embodiment of the present application further provides a video file processing apparatus, and referring to fig. 9, the apparatus is applied to a server, and includes:

a video file obtaining unit 601 for obtaining a video file generated by the vocal language communication activity;

a speaker recognition unit 602, configured to perform speaker recognition according to the audio file and the image file separated from the video file, and obtain a set of speech segments associated with different speakers;

a speaking content obtaining unit 603, configured to perform speech recognition on the set of speech segments associated with the speaker, so as to obtain the speaking content of the speaker;

an association relationship establishing unit 604, configured to establish an association relationship between the speaker and the corresponding speaking content.

Wherein the apparatus further comprises:

a content viewpoint extracting unit, configured to extract a content viewpoint of the speaker from the speaker-related speech content;

and the content viewpoint processing unit is used for carrying out post-processing on the content viewpoint and generating viewpoint files to be issued to the client.

Wherein, the content viewpoint processing unit is specifically configured to:

and if the intelligent equipment associated with the client is video playing equipment, adding the content viewpoint into the video file, and sending the video file containing the content viewpoint to the client for video playing.

Wherein, the content viewpoint processing unit is specifically configured to:

and if the intelligent equipment associated with the client is video playing equipment, generating a demonstration video simulating communication activities according to the content view points, and sending the demonstration video containing the content view points to the client for video playing.

Wherein, the content viewpoint processing unit is specifically configured to:

and if the intelligent equipment associated with the client is audio playing equipment, adding the content view point into the audio file, and sending the audio file containing the content view point to the client for audio playing.

Wherein the content watching point processing unit is specifically configured to:

and generating a communication text of a corresponding type from the content viewpoint according to the display type of the client, and sending the communication text to the client for display.

Wherein, if exchange the place of activity and laid noise acquisition equipment, the device still includes:

the noise information acquisition unit is used for acquiring the noise information acquired by the noise acquisition equipment;

and the speaker identification unit is used for separating the audio file and the image file from the video file to identify the speaker when the noise information indicates that the noise value exceeds a preset threshold value.

Wherein the apparatus further comprises:

and the audio file extracting unit is used for extracting an audio file from the video file when the noise information shows that the noise value does not exceed the preset threshold value, and carrying out speaker identification according to the audio file.

The video files are obtained in the live broadcasting process of the communication activities and correspond to the content which is finished live broadcasting.

Wherein the apparatus further comprises:

a content view point sending unit, configured to send, when a live broadcast forwarding request submitted by a client is obtained, a content view point corresponding to the content that has completed live broadcast to the client, so that a user associated with the client selects a target content view point from the content view point;

the live broadcast cover generation unit is used for receiving the target content watching points submitted by the client and generating a live broadcast cover according to the target content watching points;

and the forwarding link issuing unit is used for acquiring the forwarding link containing the live cover and issuing the forwarding link to the client for live forwarding.

The incidence relation establishing unit is further configured to establish an incidence relation among the speaker, the set of the voice segments, and the content viewpoint.

Wherein the apparatus further comprises:

the content viewpoint sending unit is used for sending the content viewpoint to the client when a playing request submitted by the client is obtained, so that a user associated with the client can select a target content viewpoint from the content viewpoint;

a play segment determining unit, configured to receive the target content viewpoint submitted by the client; determining a target voice segment corresponding to the target content viewpoint, and determining a playing segment corresponding to the target content viewpoint from the video file according to the start-stop time of the target voice segment;

and the playing fragment issuing unit is used for issuing the playing fragment to the client for playing.

Wherein the apparatus further comprises:

a retrieval request obtaining unit, configured to obtain a retrieval request submitted by a client, where the retrieval request includes identification information of a speaker;

a playing segment determining unit, configured to determine a target voice segment associated with the speaker, and determine a playing segment corresponding to the speaker from the video file according to a start-stop time of the target voice segment;

Wherein the apparatus further comprises:

a retrieval request obtaining unit, configured to obtain a retrieval request submitted by a client, where the retrieval request includes a content keyword;

a play segment determining unit, configured to determine, from the content viewpoints, a target content viewpoint that matches the content keyword; obtaining a target voice segment corresponding to the target content viewpoint, and determining a playing segment corresponding to the target content viewpoint from the video file according to the start-stop time of the target voice segment;

and the playing segment issuing unit is used for issuing the playing segments to the client for playing.

Corresponding to embodiment 2, an embodiment of the present application further provides a video file processing apparatus, referring to fig. 10, where the apparatus is applied to a client, and includes:

a viewpoint file receiving unit 701, configured to receive a viewpoint file obtained by processing a video of an audio language communication activity and sent by a server, where content viewpoints of different speakers are added to the viewpoint file, and the content viewpoints are extracted from a set of voice segments associated with different speakers determined according to speaker recognition performed on an audio file and an image file separated from the video;

a viewpoint file pushing unit 702, configured to push the viewpoint file to a user associated with the client.

In addition, an embodiment of the present application further provides an electronic device, including:

one or more processors; and

And an electronic device comprising:

one or more processors; and

obtaining a video file generated by the vocal language communication activity;

And an electronic device comprising:

one or more processors; and

and pushing the viewpoint file to a user associated with the client.

FIG. 11 illustrates an architecture of a computer system that may include, in particular, a processor 810, a video display adapter 811, a disk drive 812, an input/output interface 813, a network interface 814, and a memory 820. The processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, and the memory 820 may be communicatively connected by a communication bus 830.

The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 820 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 820 may store an operating system 821 for controlling the operation of the computer system 800, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 800. In addition, a web browser 823, a data storage management system 824, and a video file processing system 825, among others, may also be stored. The video file processing system 825 may be a server side that specifically implements the operations of the foregoing steps in this embodiment of the present application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 820 and called for execution by the processor 810.

The input/output interface 813 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 814 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 830 includes a pathway for communicating information between various components of the device, such as processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820.

It should be noted that although the above-mentioned devices only show the processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, the memory 820, the bus 830, etc., in a specific implementation, the devices may also include other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the above-described apparatus may also include only the components necessary to implement the embodiments of the present application, and need not include all of the components shown in the figures.

Where fig. 12 illustrates an architecture of an electronic device, for example, the device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, an aircraft, etc.

Referring to fig. 12, device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls the overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods provided by the disclosed solution. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia components 908 include a screen that provides an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 further includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the device 900. For example, the sensor component 914 can detect an open/closed state of the device 900, the relative positioning of components, such as a display and keypad of the device 900, the sensor component 914 can also detect a change in position of the device 900 or a component of the device 900, the presence or absence of user contact with the device 900, orientation or acceleration/deceleration of the device 900, and a change in temperature of the device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the device 900 and other devices in a wired or wireless manner. The device 900 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the device 900 to perform the methods provided by the present disclosure is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the embodiments or some portions of the embodiments of the present application.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The speaker identification method, the speaker identification device, the electronic device, the video file processing method, the video file processing device, and the electronic device provided by the present application are introduced in detail, and specific examples are applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A speaker recognition method, comprising:

performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, and performing face recognition on the image file to obtain face recognition results corresponding to image frames at different time points; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, a voice segment with wrong start-stop time splitting and/or wrong speaker identification information is/are caused;

and aligning the start-stop time information, the speaker identification information and the face recognition result of the voice segments obtained by voice segmentation in time, and determining the speaker corresponding to at least one voice segment so as to correct the start-stop time information and the speaker identification information of the voice segments obtained by voice segmentation according to the face recognition result.

2. The method of claim 1, further comprising:

and carrying out merging processing on at least one voice section corresponding to the speaker.

3. The method of claim 1,

the aligning the speaker identification information and the face recognition result in time comprises:

determining target start-stop time for alignment processing according to the start-stop time information corresponding to the adjacent voice segments;

obtaining a target face recognition result corresponding to the target start-stop time;

and if the target face recognition result corresponds to a target user, merging the adjacent voice sections, and associating the merged voice sections with the target user.

4. The method of claim 1,

obtaining a target face recognition result corresponding to the start-stop time according to the start-stop time information corresponding to the voice section;

and if the target face recognition result corresponds to at least two users, determining a target user from the target face recognition results, and associating the voice segment with the target user.

5. The method of claim 4,

determining a target user from the at least two users, comprising:

and determining the user with the largest occurrence number in the starting and stopping time as the target user.

6. The method of any of claims 3 to 5, further comprising:

obtaining a speaker information base, wherein the face characteristic information related to the identity information of at least one speaker is stored in the speaker information base;

and determining the identity information of the target user from the speaker information base according to the face feature information of the target user.

7. A video file processing method, comprising:

according to the audio file and the image file separated from the video file, carrying out speaker identification to obtain a set of voice sections respectively associated with different speakers; wherein the set of speech segments is obtained by: performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, wherein the at least one voice section obtained by voice segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment; performing face recognition on the image file to obtain face recognition results corresponding to the image frames at different time points respectively, and correcting start-stop time information and speaker identification information of the voice segments obtained by voice segmentation according to the face recognition results;

8. The method of claim 7, further comprising:

extracting a content viewpoint of the speaker from the speaker-associated speaking content;

and post-processing the content viewpoint to generate viewpoint file and transmitting the viewpoint file to the client.

9. The method of claim 8,

post-processing the content viewpoint to generate viewpoint file and sending the viewpoint file to the client, comprising:

and if the intelligent equipment associated with the client is video playing equipment, adding the content view point into the video file, and sending the video file containing the content view point to the client for video playing.

10. The method of claim 8,

post-processing the content viewpoint, generating viewpoint files and issuing the viewpoint files to a client, wherein the method comprises the following steps:

11. The method of claim 8,

12. The method of claim 8,

13. The method of claim 7, wherein if the ac arena is populated with noise collection devices, the method further comprises:

acquiring noise information acquired by the noise acquisition equipment;

and if the noise information indicates that the noise value exceeds a preset threshold value, separating the audio file and the image file from the video file.

14. The method of claim 13, further comprising:

and if the noise information indicates that the noise value does not exceed the preset threshold value, extracting an audio file from the video file, and identifying the speaker according to the audio file.

15. The method of claim 7,

the video file is obtained in the live broadcasting process of the communication activity and corresponds to the content which is finished live broadcasting.

16. The method of claim 15, further comprising:

if a live broadcast forwarding request submitted by a client is obtained, sending a content viewpoint corresponding to the content which completes live broadcast to the client, so that a user associated with the client can select a target content viewpoint;

receiving the target content watching point submitted by the client, and generating a live broadcast cover according to the target content watching point;

and acquiring a forwarding link containing the live broadcast cover, and issuing the forwarding link to the client for live broadcast forwarding.

17. The method of claim 8, further comprising:

and establishing the incidence relation among the speaker, the voice segment set and the content viewpoint.

18. The method of claim 17, further comprising:

if a playing request submitted by a client is obtained, the content watching point is sent to the client for a user associated with the client to select a target content watching point;

receiving the target content viewpoint submitted by the client;

determining a target voice segment corresponding to the target content viewpoint, and determining a playing segment corresponding to the target content viewpoint according to the start-stop time of the target voice segment;

and sending the playing fragments to the client for playing.

19. The method of claim 17, further comprising:

obtaining a retrieval request submitted by a client, wherein the retrieval request comprises identification information of a speaker;

determining a target voice segment associated with the speaker, and determining a playing segment corresponding to the speaker according to the starting and ending time of the target voice segment;

and sending the playing fragments to the client for playing.

20. The method of claim 17, further comprising:

obtaining a retrieval request submitted by a client, wherein the retrieval request comprises content keywords;

determining a target content viewpoint matched with the content keyword from the content viewpoints;

obtaining a target voice segment corresponding to the target content viewpoint, and determining a playing segment corresponding to the target content viewpoint according to the start-stop time of the target voice segment;

and sending the playing segments to the client for playing.

21. A video file processing method, comprising:

the method comprises the steps that a client receives a viewpoint file which is sent by a server and is obtained by processing a video of an audio language communication activity, wherein content viewpoints of different speakers are added in the viewpoint file, and are extracted from a set of determined voice sections related to different speakers according to speaker recognition of an audio file and an image file which are separated from the video; when speaker recognition is carried out, voice segmentation is carried out on the audio file, starting and ending time information and speaker identification information which are respectively corresponding to at least one voice section are obtained, face recognition is carried out on the image file, and face recognition results which are respectively corresponding to image frames at different time points are obtained; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment; aligning the start and end time information, the speaker identification information and the face recognition result of the voice segments obtained by voice segmentation in time, and determining the speaker corresponding to at least one voice segment so as to correct the start and end time information and the speaker identification information of the voice segments obtained by voice segmentation according to the face recognition result;

and pushing the viewpoint file to a user associated with the client.

22. A speaker recognition apparatus, comprising:

the voice segmentation unit is used for carrying out voice segmentation on the audio file to obtain start-stop time information and speaker identification information which respectively correspond to at least one voice section; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment;

the face recognition unit is used for carrying out face recognition on the image file to obtain face recognition results corresponding to the image frames at different time points;

and the alignment processing unit is used for performing alignment processing on the start-stop time information, the speaker identification information and the face recognition result of the voice segments obtained by voice segmentation in time, and determining the speaker corresponding to at least one voice segment so as to correct the start-stop time information and the speaker identification information of the voice segments obtained by voice segmentation according to the face recognition result.

23. A video file processing device is applied to a server and comprises:

the speaker recognition unit is used for recognizing speakers according to the audio files and the image files separated from the video files to obtain a set of voice sections associated with different speakers; wherein the set of speech segments is obtained by: performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, wherein the at least one voice section obtained by voice segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment; carrying out face recognition on the image file to obtain face recognition results corresponding to image frames at different time points respectively, and correcting start-stop time information and speaker identification information of voice segments obtained by voice segmentation according to the face recognition results;

24. A video file processing device applied to a client comprises:

the system comprises a viewpoint file receiving unit, a viewpoint file processing unit and a processing unit, wherein the viewpoint file receiving unit is used for receiving a viewpoint file which is sent by a server and is obtained by processing a video of a voice language communication activity, content viewpoints of different speakers are added in the viewpoint file, and the content viewpoints are extracted from a set of voice sections relevant to the different speakers determined according to speaker recognition of an audio file and an image file which are separated from the video; when speaker recognition is carried out, voice segmentation is carried out on the audio file, starting and ending time information and speaker identification information which are respectively corresponding to at least one voice section are obtained, face recognition is carried out on the image file, and face recognition results which are respectively corresponding to image frames at different time points are obtained; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment; aligning the start-stop time information, the speaker identification information and the face recognition result of the voice segments obtained by the voice segmentation on time, and determining the speaker corresponding to at least one voice segment so as to correct the start-stop time information and the speaker identification information of the voice segments obtained by the voice segmentation according to the face recognition result;

25. An electronic device, comprising:

one or more processors; and

performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, and performing face recognition on the image file to obtain face recognition results corresponding to image frames at different time points; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment;

26. An electronic device, comprising:

one or more processors; and

obtaining a video file generated by the vocal language communication activity;

according to the audio file and the image file separated from the video file, carrying out speaker identification to obtain a set of voice sections respectively associated with different speakers; wherein the set of speech segments is obtained by: performing voice segmentation on the audio file to obtain start and stop time information and speaker identification information corresponding to at least one voice section, wherein the at least one voice section obtained by voice segmentation comprises: under the influence of noise information contained in the audio file, a voice segment with wrong start-stop time splitting and/or wrong speaker identification information is/are caused; carrying out face recognition on the image file to obtain face recognition results corresponding to image frames at different time points respectively, and correcting start-stop time information and speaker identification information of voice segments obtained by voice segmentation according to the face recognition results;

27. An electronic device, comprising:

one or more processors; and

receiving a viewpoint file which is sent by a server and is obtained by processing a video of an audio language communication activity, wherein content viewpoints of different speakers are added in the viewpoint file, and the content viewpoints are extracted from a set of voice sections which are determined to be associated with the different speakers according to speaker recognition of an audio file and an image file which are separated from the video; when speaker recognition is carried out, voice segmentation is carried out on the audio file, starting and ending time information and speaker identification information which correspond to at least one voice section are obtained, face recognition is carried out on the image file, and face recognition results which correspond to image frames at different time points are obtained; wherein the at least one speech segment obtained by the speech segmentation comprises: under the influence of noise information contained in the audio file, the caused starting and ending time splitting error and/or the speaker identification information error voice segment; aligning the start-stop time information, the speaker identification information and the face recognition result of the voice segments obtained by the voice segmentation on time, and determining the speaker corresponding to at least one voice segment so as to correct the start-stop time information and the speaker identification information of the voice segments obtained by the voice segmentation according to the face recognition result;

and pushing the viewpoint file to a user associated with the client.