CN117939238A

CN117939238A - Character recognition method, system, computing device and computer-readable storage medium

Info

Publication number: CN117939238A
Application number: CN202410034130.0A
Authority: CN
Inventors: 王绘; 郑斯奇; 陈亚峰; 程路遥
Original assignee: Zhejiang Alibaba Robot Co ltd
Current assignee: Zhejiang Alibaba Robot Co ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-26

Abstract

Embodiments of the present disclosure provide a role recognition method, system, computing device, and computer-readable storage medium, where the method includes: acquiring image data and audio data in a video; classifying the persona by utilizing the voiceprint characteristics of the audio data, and identifying a first persona set corresponding to the audio data; judging whether any one person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the image data; if yes, decomposing any character into a plurality of character characters to obtain a second character set; judging whether the multiple personas correspond to the same person or not according to the voiceprint features and/or the image features corresponding to the multiple personas in the second persona set; and if so, combining the multiple personas into the same persona to obtain a third persona set.

Description

Character recognition method, system, computing device and computer-readable storage medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and a system for character recognition, a computing device, and a computer readable storage medium.

Background

In a multi-person conversation scenario, character separation is a popular area in speech signal processing technology, whose goal is to determine the number of speakers in a piece of multi-person conversation audio, and which time period is what is being spoken by. The method is widely applied to multi-person conference scenes, customer service telephone scenes and sales scenes at present. In order to segment a character for a piece of audio conveniently and quickly, character recognition technology is required to provide character information. Character recognition can greatly improve the work efficiency of users in scenes such as meeting summary or text excerpt.

However, sometimes, because the recording scene is noisy or the recording device is far away from the speaker, the quality of the recorded audio is poor or the voices among the speakers are similar, the character recognition accuracy is low and the user experience is poor. Therefore, how to identify the roles can be more accurate is an important problem to be solved.

Disclosure of Invention

In view of this, the present embodiment provides a character recognition method. One or more embodiments of the present specification also relate to a character recognition system, a computing device, a computer-readable storage medium, and a computer program that solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided a character recognition method, including: acquiring image data and audio data in a video; classifying the persona by utilizing the voiceprint characteristics of the audio data, and identifying a first persona set corresponding to the audio data; judging whether any one person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the image data; if yes, decomposing any character into a plurality of character characters to obtain a second character set; judging whether the multiple personas correspond to the same person or not according to the voiceprint features and/or the image features corresponding to the multiple personas in the second persona set; and if so, combining the multiple personas into the same persona to obtain a third persona set.

According to a second aspect of embodiments of the present specification, there is provided a character recognition method, including: receiving live video, and acquiring image data and audio data in the live video; classifying the personas by utilizing the voiceprint characteristics of the live audio data, and identifying a first persona set corresponding to the live audio data; judging whether any one person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the live image data; if yes, decomposing any character into a plurality of character characters to obtain a second character set; judging whether the multiple personas correspond to the same person or not according to the voiceprint features and/or the image features corresponding to the multiple personas in the second persona set; if yes, combining the multiple personas into the same persona to obtain a third persona set; and sending the character information corresponding to the third character set to any one or more live users corresponding to the live image data.

According to a third aspect of embodiments of the present specification, there is provided a character recognition system, comprising: the cloud side equipment is used for receiving a role identification request, acquiring a video according to the role identification request, acquiring image data and audio data in the video, classifying the roles by utilizing voiceprint features of the audio data, identifying a first role set corresponding to the audio data, judging whether any role in the first role set corresponds to multiple people by utilizing image features of the image data, if so, decomposing any role into multiple roles, acquiring a second role set, judging whether the multiple roles correspond to the same person based on the voiceprint features and/or the image features corresponding to the multiple roles for the multiple roles in the second role set, and if so, merging the multiple roles into the same role to acquire a third role set; and the end side device is used for sending the role identification request to the cloud side device and receiving a third role set returned by the cloud side device.

According to a fourth aspect of embodiments of the present specification, there is provided a computing device comprising: a memory and a processor; the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the character recognition method described above.

According to a fifth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described character recognition method.

According to a sixth aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described character recognition method.

According to the character recognition method, after image data and audio data in a video are obtained, character classification is carried out by utilizing voiceprint features of the audio data, a first character set corresponding to the audio data is recognized, then image features of the image data are utilized to judge whether any character in the first character set corresponds to multiple people, if so, the any character is decomposed into multiple character characters to obtain a second character set, and according to the multiple character in the second character set, whether the multiple character characters correspond to the same person is judged based on the voiceprint features and/or the image features corresponding to the multiple character characters, and if so, the multiple character characters are combined into the same character to obtain a third character set. Therefore, the method combines the voiceprint features in the audio and the image features in the video to perform character recognition, analyzes the voiceprint features to recognize the character, further assists in judging whether the character possibly mixed with multiple people exists through the image features of the video, performs character decomposition on the character possibly mixed with multiple people, and then merges the character possibly same person, so that the accuracy of character recognition is improved.

Drawings

FIG. 1 is a block diagram of a character recognition system according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of character recognition provided in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the audio recognition process provided in one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a face detection and correction persona set according to one embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for character recognition according to another embodiment of the present disclosure;

Fig. 6 is a schematic view of a live application scenario of a role recognition method according to another embodiment of the present disclosure;

fig. 7 is a schematic structural view of a character recognition device according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a character recognition system according to one embodiment of the present disclosure;

FIG. 9 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

The audio features and facial features referred to in one or more embodiments of the present description may be feature extracted based on a machine learning model. The machine learning model may be a large model of deep machine learning with large scale model parameters, typically containing hundreds of millions, billions, or even more than billions of model parameters. When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as vision question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.

First, terms related to one or more embodiments of the present specification will be explained.

Role separation: given an audio signal of a multi-person conversation, it is determined that there are several speakers in the audio and it is up to whom in which time period it is. And the identification accuracy can be improved by video auxiliary role separation of audio matching.

Voiceprint recognition: in speech signal processing, given an audio signal containing a human voice, features are extracted in which the identity of a speaker can be represented. In order to improve the recognition accuracy, recognition is performed as much as possible in units of an audio signal including a human voice of one person.

Face detection: in the image signal processing, a picture is given, whether a face image exists in the picture or not is detected, and the position of the face image in the picture is output.

Video speaker detection: in video signal processing, a continuous video of a face is given, and it is detected whether the person is speaking.

Face recognition: in image signal processing, features are extracted in which the identity of the speaker can be represented. In order to improve the recognition accuracy, recognition is performed as much as possible by using a face image including one person as a unit.

Clustering: given a set of feature vectors, the set of vectors is clustered, and divided into different clusters, each cluster having similar characteristics.

At present, the character recognition technology is widely applied in a multi-person conference scene, a customer service telephone scene and a sales scene. However, sometimes, because the recording scene is noisy or the recording device is far away from the speaker, the quality of the recorded audio is poor or the voices among the speakers are similar, the character recognition accuracy is low and the user experience is poor.

In view of this, the embodiments of the present disclosure provide a character recognition method that performs character recognition by combining voiceprint features in audio and face features in video. Specifically, after image data and audio data in a video are acquired, firstly, voice print features of the audio data are utilized to conduct character classification, a first character set corresponding to the audio data is identified, then, image features of the image data are utilized to judge whether any one of the first character set corresponds to multiple people, if yes, any one of the first character set is decomposed into multiple character sets to obtain a second character set, and according to the multiple character sets in the second character set, whether the multiple character sets correspond to the same person is judged based on voice print features and/or image features corresponding to the multiple character sets, if yes, the multiple character sets are combined to be the same character set to obtain a third character set. According to the method, the voiceprint characteristics are analyzed to identify the persona, then the image characteristics of the video are used for further assisting in judging whether the persona possibly mixed with multiple persons exists, after the persona possibly mixed with the multiple persons is decomposed, the personas possibly the same person are combined, and the accuracy of persona identification is improved.

Specifically, in the present specification, a character recognition method is provided, and the present specification relates to a character recognition apparatus, a character recognition system, a computing device, and a computer-readable storage medium, one by one, which are described in detail in the following embodiments.

Referring to fig. 1, fig. 1 shows a frame diagram of a role recognition system provided in an embodiment of the present specification, where the system may include a cloud-side device and an end-side device.

Under the condition that a plurality of end side devices exist, communication connection can be established among the plurality of end side devices through cloud side devices, in a role identification scene, the cloud side devices are used for providing role identification service among the plurality of end side devices, and the plurality of end side devices can serve as a transmitting end or a receiving end respectively and realize communication through the cloud side devices.

Specifically, the end side device is configured to send a role identification request to the cloud side device, where the role identification request may carry data of the video;

The cloud side equipment is used for receiving a role identification request, obtaining a video according to the role identification request, obtaining image data and audio data in the video, classifying the roles by utilizing voiceprint features of the audio data, identifying a first role set corresponding to the audio data, judging whether any role in the first role set corresponds to multiple persons by utilizing the image features of the image data, if so, decomposing any role into multiple roles, obtaining a second role set, judging whether the multiple roles correspond to the same person according to the voiceprint features and/or the image features corresponding to the multiple roles in the second role set, and if so, merging the multiple roles into the same role, and obtaining a third role set.

And the end side device is used for sending the role identification request to the cloud side device and receiving a third role set returned by the cloud side device.

The end side device and the cloud side device can be connected through a network. The network provides a medium for a communication link between the end-side device and the cloud-side device. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The data transmitted by the end-side device may need to be encoded, transcoded, compressed, etc. before being distributed to the cloud-side device.

The end-side device may include a browser, APP (Application), or web Application such as H5 (Hyper Text Markup Language, hypertext markup language version 5) Application, or a light Application (also referred to as applet, a lightweight Application), or cloud Application, etc., and the Application of the end-side device may be based on a software development kit (SDK, software Development Kit) of the corresponding service provided by the service end, such as a real-time communication (RTC, real Time Communication) based SDK development acquisition, etc. The end-side device may appear as an electronic device or run depending on some APP in the device, etc. The electronic device may for example have a display screen and support information browsing etc. as may be a personal mobile terminal such as a mobile phone, tablet computer, personal computer etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.

The cloud-side device may include servers that provide various services, such as servers that provide communication services for multiple clients, as well as servers for background training that provide support for models used on clients, as well as servers that process data sent by clients, and so on. It should be noted that the cloud-side device may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system or a server that incorporates a blockchain. The server may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.

For example, an application scenario of the method provided in the embodiment of the present disclosure may be a live scenario, and the end-side device may include user devices of a host and a viewer. For another example, an application scenario of the method provided in the embodiments of the present disclosure may be an online video conference scenario, and an end-side device of the video sending end may be a participating user device. It can be understood that in an online video conference scenario, the video transmitting end may also be a video receiving end, and the video receiving end may also be a video transmitting end, which is not described in detail in this embodiment.

It should be noted that, the role recognition method provided in the embodiments of the present disclosure may be executed by a cloud-side device, and in other embodiments of the present disclosure, an end-side device may also have a similar function to the cloud-side device, so as to execute the role recognition method provided in the embodiments of the present disclosure; in other embodiments, the role recognition method provided in the embodiments of the present disclosure may be performed by the cloud-side device and the end-side device together.

Referring to fig. 2, fig. 2 shows a flowchart of a role recognition method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: image data and audio data in the video are acquired.

The video may be real-time video (video data may include image data and audio data) in a live broadcast, an online conference, or the like, or may be recorded video.

The video obtaining mode is not limited, and particularly, the mode of obtaining video data can be set according to the application scene requirement, which is not limited in the specification. For example, in a scene of live video, online conference, etc., the video of the current conference may be acquired in real time during the conference. Based on the method provided by the embodiment of the specification, character recognition can be performed in live broadcasting and online conferences, and the recognized character and the corresponding speaking content are displayed. For another example, in a scene of a non-real-time video such as a review meeting, a recorded broadcast, etc., a previously recorded video may be acquired. Based on the method provided by the embodiment of the specification, the identified persona and the corresponding speaking content can be displayed in the process that the user turns over the historical conference or looks back at the video.

Step 204: and classifying the persona by utilizing the voiceprint characteristics of the audio data, and identifying a first persona set corresponding to the audio data.

The voiceprint features are features of voiceprints contained in the audio data that can characterize and identify a speaker.

For example, classifying the voiceprint features to identify the persona may include two parts, one part being feature extraction and the other part being pattern matching. In the feature extraction stage, acoustic or language features with strong separability, high stability and other characteristics on the voiceprint of the speaker can be extracted and selected. Voiceprint features are typically unique to different speakers and may include acoustic features (e.g., spectrum, cepstrum, formants, pitch, reflection coefficients, etc.), semantics, applause, pronunciation habits, etc., related to the anatomical structure of the human pronunciation mechanism. In the pattern matching stage, the voiceprint features that have been extracted can be identified by clustering or by matching with known voiceprint features in a database. For example, various similarity measurement methods, such as euclidean distance, cosine similarity, and the like, can be adopted to judge the similarity degree between the voiceprint features, so as to realize the purpose of classification.

Step 206: and judging whether any person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the image data.

For example: and aiming at any one of the first character role set, acquiring the face characteristics corresponding to the voiceprint characteristics of any one of the character roles on a time axis, and judging whether any one of the character roles corresponds to a plurality of people or not based on the face characteristics. The face features may be obtained by analyzing image features of the image data.

The image data and the corresponding audio data are actually data in videos acquired by audio and video acquisition in the same environment at the same time. Therefore, the image data has the same time axis as the audio data. The time axis refers to a time sequence between the image data and the audio data from the start of acquisition to the end of acquisition. It will be appreciated that the voiceprint feature of any character corresponds to a face feature on the time axis, meaning that the voiceprint feature and face feature correspond to the same person. Therefore, the voice features and the face features of the same environment and the same time can be obtained, if the face features corresponding to the voice print features of any one of the identified personas are represented as multiple persons, the persona may correspond to multiple persons and need to be decomposed into multiple personas to correct the character identification result.

Whether the specific implementation mode corresponding to a plurality of people is judged by the face features is not limited. For example, the determination may be made by clustering or matching of face features.

Step 208: if so, decomposing any character into a plurality of character characters to obtain a second character set.

Since there may be some misjudgment when classifying by voiceprint features, and there is a possibility that multiple persons may actually be mixed in one person character, the method provided in the embodiment of the present disclosure further performs separation of person characters based on step 208.

The specific manner of decomposing any character into a plurality of character characters is not limited, and the character characters may be decomposed by utilizing voiceprint features and/or face features of the character characters.

For example, the decomposing may be performed using voiceprint features of the personas and face features, where the face features are used to determine the number of personas that need to be decomposed, and the voiceprint features are used to re-cluster the number of personas that need to be decomposed, thereby obtaining the second persona set.

For another example, the face features of the persona may be directly utilized for decomposition. For example: if s persons exist in the face feature set of any person character, the person character is directly decomposed into s person characters, and s speaker IDs are newly assigned.

For another example, the voiceprint features of the persona may be directly utilized for decomposition. For example: the cuts Duan Lidu of the active speech signal for the persona may be further narrowed to retrieve audio segments of the persona, re-extract voiceprint features from the segments, and re-cluster the features to resolve multiple speakers.

The second persona set includes the decomposed plurality of personas and the personas determined to correspond to only one person based on the image features.

Step 210: for a plurality of personas in the second persona set, determining whether the plurality of personas correspond to the same person based on voiceprint features and/or image features corresponding to the plurality of personas.

Since there may be some misjudgment when decomposing the personas, the decomposed personas may actually be one person, and thus the method provided in the embodiments of the present disclosure further determines whether the personas need to be merged based on step 210.

For example: whether the plurality of personas correspond to the same person may be determined based on similarity between voiceprint features and/or image features corresponding to the plurality of personas. For example: the similarity between the voiceprint features and/or the image features of the two personas can be calculated, and when the similarity is higher than a preset similarity threshold, the fact that the two personas correspond to the same person can be determined.

For another example, voiceprint features and/or image features corresponding to the plurality of personas may be input into a trained machine learning model for identifying the same person, and the machine learning model may be used to determine whether the same person corresponds.

Step 212: and if so, combining the multiple personas into the same persona to obtain a third persona set.

The second persona set includes the decomposed plurality of personas and the personas determined to correspond to only one person based on the image features. Assuming the updated persona set is [ spka, spkb,., spkn ], spka and spkb may be combined into one persona if the decisions spka and spkb are the same person.

After image data and audio data in a video are acquired, firstly, voice print features of the audio data are utilized to conduct character classification, a first character set corresponding to the audio data is identified, then, image features of the image data are utilized to judge whether any character in the first character set corresponds to multiple people, if yes, the any character is decomposed into multiple character sets to obtain a second character set, and whether the multiple character sets correspond to the same person is judged based on voice print features and/or image features corresponding to the multiple character sets, if yes, the multiple character sets are combined to be the same character set to obtain a third character set. Therefore, the method identifies the persona by analyzing the voiceprint characteristics, further assists in judging whether the persona possibly mixed with multiple persons exists or not by the image characteristics of the video, and combines the personas possibly the same person after decomposing the persona possibly mixed with the multiple persons, so that the accuracy of persona identification is improved.

In one or more embodiments of the present disclosure, whether any character is persona corresponds to a plurality of people is determined using a face feature, and the character is decomposed using the face feature and a voiceprint feature, so that the character can be decomposed more accurately. Specifically, the determining, using the image features of the image data, whether any one of the first set of object roles corresponds to a plurality of people includes:

And judging whether the face feature of any person character in the first person character set corresponds to multiple persons.

Correspondingly, the decomposing any character into a plurality of character characters to obtain a second character set includes:

Determining the number of the personas to be decomposed based on the face characteristics of any persona;

and re-classifying the personas by utilizing the voiceprint characteristics of any persona, decomposing a plurality of personas with the number of the personas, and obtaining a second persona set.

For example, the decomposition process may include: all speakers in the persona set are traversed, when traversing to the kth speaker ID (identification), all m voiceprint features of the speakers are collected, meanwhile, a face feature set overlapped with the speakers on a time axis is collected, s persons are assumed to exist in the face feature set, if s=1, the speakers are correctly identified and do not contain a plurality of persons, if s >1, the speakers are confused by a plurality of persons, s class clustering is conducted on the m voiceprint features again, and s speaker IDs are endowed to each class of feature again. Therefore, this step is equivalent to unpacking and decomposing the speakers with erroneous judgment in the results of the classification of the voiceprint features, ensuring that each ID contains only one speaker, thereby obtaining an updated persona set [ spka, spkb,.. spkn ].

In one or more embodiments of the present disclosure, it is determined whether a plurality of personas correspond to the same person based on similarity calculation, so that the personas can be decomposed more accurately. Specifically, the determining, for the multiple personas in the second persona set, whether the multiple personas correspond to the same person based on the voiceprint features and/or image features corresponding to the multiple personas includes:

Calculating the similarity of voiceprint features and/or image features corresponding to a plurality of personas in the second persona set;

and comparing the similarity of the voiceprint features and/or the image features corresponding to the multiple personas with a preset similarity threshold to determine whether the multiple personas correspond to the same person.

The similarity calculation may use various similarity measurement methods, such as euclidean distance, cosine similarity, and the like, to determine the similarity between the voiceprint features, which is not limited in this specification. For example: the similarity of the voiceprint features and/or the face features of any two personas can be calculated, the similarity is compared with a preset similarity threshold, when the preset similarity threshold is exceeded, the corresponding same person is determined, and the two personas need to be combined. Assuming the updated persona set is [ spka, spkb,., spkn ], if the decisions spka and spkb are the same person, they may be merged into one person.

In the above embodiment, the similarity of the image feature and the voiceprint feature is used to determine whether the decomposed personas are the same person, and if so, the personas are combined, so that the accuracy of persona recognition is improved by decomposing-combining the personas.

The method provided by the embodiments of the present disclosure is not limited to a specific manner of identifying personas based on voiceprint features. For example, in one or more embodiments of the present disclosure, to make the recognition result more accurate, the audio data is segmented so that only one person is contained in one audio segment as much as possible to recognize the persona more accurately. Specifically, the classifying the persona by using the voiceprint feature of the audio data, and identifying the first persona set corresponding to the audio data includes:

Segmenting the audio data to obtain audio segments;

Extracting voiceprint features of the audio segment;

And clustering the voiceprint features to identify a first person character set corresponding to the audio data.

For example, as shown in fig. 3, the effective audio signal of the audio data may first be segmented according to a window length of 1.5s windowed 0.75 s. Wherein, window length of 1.5s window moves 0.75s for illustrative purposes only to illustrate the manner of segmentation, in practical application, the length of the segmentation can be set as required, and this specification is not limited thereto. It will be appreciated that the smaller the length of the segment, the more likely it is to be close to the effect of containing only one person. After segmentation, voiceprint features of the segments are extracted to represent information of the speaker of the segment. Then, the voiceprint features are clustered to identify different speaker IDs.

In order to make the recognition more accurate and the recognition efficiency higher, in one or more embodiments of the present disclosure, a mode of detecting a face by using a frame image to obtain a face feature is adopted, and a mode of traversing the persona is combined with the face feature to decompose the persona of multiple people. In particular, the method comprises the steps of,

The determining whether the face feature of any character in the first character set corresponds to a plurality of people includes:

Framing the image data to obtain a framed image;

performing face detection on the frame-divided image;

Under the condition that a human face is detected, judging whether the human face in the framing image is in a speaking state, and if so, extracting human face features from the framing image to obtain a human face feature set;

Traversing the personas in the first persona character set, and aiming at any traversed persona character, acquiring the corresponding face characteristics of the audio characteristics of any traversed persona character on a time axis from the face characteristic set;

and judging whether the face features of any person character correspond to multiple persons according to the face features of the audio features of any person character corresponding to the time axis.

The frame image refers to an image corresponding to each frame in the image data. The face detection of the frame image may be implemented by a general face detection method, which is not limited in this specification. For example: the features contained in the framed image may include: histogram features, color features, template features, structural features, etc., by which face detection can be achieved. Under the condition that the face is detected, the face features in the face can be further extracted, and the accuracy of face recognition is guaranteed.

It will be appreciated that in a framed image, a face being spoken may be included, or an un-spoken face may be included, and for an un-spoken face, if face features are also extracted to correct the persona, erroneous results may result. Therefore, in this embodiment, in order to make the recognition accuracy higher, before extracting the face features from the frame images, it is determined whether the face in the frame images is in a speaking state, and if so, the face features are extracted from the frame images, so that the accuracy of character decomposition is effectively improved.

The specific judging mode for judging whether the face is in the speaking state is not limited. For example: the method can extract the positions of key points such as eyes, nose, mouth and the like of the face under the condition that the face is detected, and identify whether the face is in a speaking state or not based on the positions and preset identification rules or identification models. For example, a machine learning model may be used to train a recognition model that can be used to recognize whether to speak based on the locations of key points such as eyes, nose, mouth, etc. of some of the face image samples of the speaker. When the face is required to be judged to be in a speaking state, the positions of key points such as eyes, nose and mouth of the face in the framing image are input into a recognition model, and a recognition result of whether to speak is obtained.

In this embodiment, if it is determined that the face in the frame image is not in the speaking state, the frame image may be discarded without recognition of the face feature, so as to avoid interference.

In some cases, more interference information may exist in the face in the frame image, and the face feature cannot be accurately extracted, so, in order to further ensure the accuracy of the recognition result, in one or more embodiments of the present disclosure, before extracting the face feature from the frame image, the method further includes:

judging whether the face image quality of the framing image reaches a preset image quality condition or not;

if so, the step of extracting the face features from the frame images is entered.

Correspondingly, if the face feature is not achieved, the frame image can be abandoned, the face feature is not recognized, and interference is avoided.

The preset image quality condition may be set according to various quality problems possibly existing in the face image in the actual application scene, which is not limited in this specification.

By judging whether the quality of the face image reaches the preset image quality condition or not, the face features are extracted only when the quality of the face image reaches the preset image quality condition, so that the interference of the framing images with poor quality can be effectively avoided, and the recognition accuracy is improved.

For example, the determining whether the face image quality of the frame image reaches a preset image quality condition includes:

judging whether the face image of the framing image is free of shielding, if yes, determining that the quality of the face image does not reach the preset image quality condition;

And/or

Judging whether the face image of the framing image has a positive face or not, and if the face image does not have the positive face, determining that the quality of the face image does not reach a preset image quality condition.

The embodiment achieves the aim that the extracted face features can be trusted through face image quality detection, and can effectively improve the accuracy of the identification result.

For example: a schematic of a process for identifying a set of personas based on audio data using face detection correction is shown in fig. 4. Aiming at the problems that the character is identified based on audio data, the audio quality is highly dependent, high background noise is encountered, far-field voice is generated, and the identification accuracy is reduced, the method provided by the embodiment of the specification assists in identification by means of video image information. As shown in fig. 4, performing character recognition based on an audio signal (which can be understood as audio data) to obtain a first person character set; character correction is performed based on the image signal (which can be understood as image data). The face image with poor quality is abandoned, and the character with wrong recognition can be decomposed aiming at one character mixed with multiple people to obtain a second character set so as to achieve the effect of correcting the character recognition result. In addition, aiming at the second persona set obtained after decomposition, the personas highly suspected to be the same person can be combined based on the similarity of the voiceprint features and/or the face features to obtain a third persona set, so that the recognition accuracy of the personas is effectively improved.

It should be noted that, in the method provided in the embodiment of the present disclosure, a specific manner of classifying the personas based on the voiceprint features is not limited, and may be implemented by clustering or other arbitrary manners; the specific modes of face detection, face feature extraction and speaker face detection are not limited, and can be realized in any mode such as rule judgment, neural network model recognition and the like; the specific manner of voiceprint feature extraction is not limited, as may be achieved by converting voiceprints into mathematical feature vectors.

In one or more embodiments of the present disclosure, the identified persona set is further provided to a user, thereby improving an implementation of a user interaction experience. Specifically, before the capturing the image data and the audio data in the video, the method further includes:

Receiving a character recognition request sent by a user, wherein the character recognition request carries the video;

after the third persona set is obtained, the method further includes:

and sending the third persona set to the user.

Furthermore, in one or more embodiments of the present disclosure, the user may further perform identification on the persona, and the system shares the identification information with the relevant user in combination with the information provided by the user, so as to further improve accuracy of persona identification. Specifically, after the third persona set is sent to the user, the method further includes:

receiving identity information sent by the user aiming at any one or more roles in the third role set;

Establishing a corresponding relation between any one or more personas and the identity information according to the identity information;

And sending the corresponding relation between any one or more personas and the identity information to any one or more users corresponding to the video.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of a role identification method in a live application scenario. Fig. 5 shows a flowchart of a character recognition method according to another embodiment of the present disclosure. As shown in fig. 5, the method includes:

step 502: and receiving the live video, and acquiring image data and audio data in the live video.

Step 504: and classifying the persona by utilizing the voiceprint characteristics of the audio data, and identifying a first persona set corresponding to the audio data.

Step 506: and judging whether any person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the image data.

Step 508: if so, decomposing any character into a plurality of character characters to obtain a second character set.

Step 510: for a plurality of personas in the second persona set, determining whether the plurality of personas correspond to the same person based on voiceprint features and/or image features corresponding to the plurality of personas.

Step 512: and if so, combining the multiple personas into the same persona to obtain a third persona set.

According to the method, the person roles are identified through analyzing the voiceprint characteristics of the audio data in the live video, whether the person roles possibly mixed with multiple persons exist or not is further judged in an auxiliary mode through the image characteristics of the live video, after the person roles possibly mixed with the multiple persons are decomposed, the person roles possibly the same person are combined, and therefore accuracy of role identification in the live video is improved.

In one or more embodiments of the present disclosure, in order to enable a user in a live application scene to timely acquire information of a persona in a live video, the method further includes:

And sending the character information corresponding to the third character set to any one or more live users corresponding to the live video.

Furthermore, in one or more embodiments of the present disclosure, the user may further identify the identity of the person, thereby improving the user interaction experience and improving the accuracy of identifying the person. Specifically, the method further comprises:

receiving identity information sent by the live user aiming at any one or more personas in the third persona set;

And sending the corresponding relation between any one or more personas and the identity information to any one or more live broadcast users corresponding to the live broadcast video.

The live application scenario is further illustrated below in conjunction with fig. 6. As shown in fig. 6, each user device participating in the live video conference is a video transmitting end and is also a video receiving end, the user device serving as the video transmitting end transmits real-time video data to the cloud side device, and the cloud side device transmits the video data to the user device serving as the video receiving end in real time. In addition, the cloud-side device may send a video request, and the video receiving request is regarded as a role identification request at the same time. The cloud side equipment further utilizes the voiceprint characteristics of the audio data in the video to classify the personas, identifies a first persona character set, utilizes image data to acquire the corresponding persona characteristics of the voiceprint characteristics of any persona on a time axis aiming at any persona in the persona set, and under the condition that the persona corresponding to the persona characteristics is a plurality of personas, the voiceprint characteristics of any persona are utilized again to classify the personas, a plurality of personas are identified, a second persona set is acquired, similarity calculation of the voiceprint characteristics and/or the persona characteristics is carried out on every two personas in the second persona set, whether any two personas are combined is determined based on the similarity, after combination processing, a third persona set is acquired, persona information of the third persona set is issued to user equipment corresponding to a video transmitting end and a video receiving end, the personas in the conference are displayed, and the identity of the person speaking in the current conference is identified based on the personas.

As shown in fig. 6, in the interface where the user device presents personas, a persona list may be presented in which identity information of personas of interest to the user is identified, where the persona being speaking has a tag in a speaking state. In addition, in the interface, an identification input box, a determination control and a cancel control can be provided, so that the user can identify the identity of the persona. In practical applications, the manner in which the user operates the control includes any manner such as clicking, double clicking, touch control, mouse hovering, sliding, long pressing, voice control or shaking, and the like, and the selection is specifically performed according to the practical situation, which is not limited in any way in the embodiments of the present disclosure.

It should be noted that, the technical solution of the present disclosure is not only applicable to a communication scenario implemented based on Real-time communication (Real-Time Communication, RTC) technology, but also applicable to any other application scenario requiring character recognition. The RTC technology is a communication technology capable of sending and receiving text, audio, video and the like in real time, and is suitable for live broadcasting, video on demand, video conferences, online classrooms, online chat rooms, game interaction and other scenes, so that real-time transmission of pure audio data, image data and the like is realized. The technical scheme of the application can be particularly applied to communication scenes such as live broadcast, video on demand, video conference, online class, online chat room, game interaction and the like realized based on the RTC.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a role recognition device, and fig. 7 shows a schematic structural diagram of a role recognition device provided in one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

the data acquisition module 702 is configured to acquire image data and audio data in a video.

The audio identifying module 704 is configured to classify the characters by utilizing the voiceprint features of the audio data, and identify the first character set corresponding to the audio data.

A decomposition determination module 706 configured to determine whether any of the first set of personas corresponds to a plurality of persons using image features of the image data.

The decomposition execution module 708 is configured to decompose the arbitrary character into a plurality of character characters to obtain a second character set if the decomposition judgment module judges yes.

The merging judgment module 710 is configured to judge, for a plurality of personas in the second persona set, whether the plurality of personas correspond to the same person based on voiceprint features and/or image features corresponding to the plurality of personas.

And a merging execution module 712 configured to merge the plurality of personas into the same persona to obtain a third persona set if the merging determination module determines that it is.

In one or more embodiments of the present disclosure, the decomposition determination module is configured to determine whether a face feature of any character in the first character set corresponds to a plurality of people. Accordingly, the decomposition execution module includes:

A decomposition number determination submodule configured to determine the number of personas to be decomposed based on the face features of any persona;

and the decomposition execution sub-module is configured to re-classify the personas by utilizing the voiceprint characteristics of any persona, decompose a plurality of personas with the number of the personas and obtain a second persona set.

In one or more embodiments of the present disclosure, the merging judgment module includes:

a similarity calculation submodule configured to calculate, for a plurality of characters in the second character set, similarity of voiceprint features and/or image features corresponding to the plurality of characters;

And the merging determination submodule is configured to determine whether the plurality of personas correspond to the same person by comparing the similarity of the voiceprint features and/or the image features corresponding to the plurality of personas with a preset similarity threshold.

In one or more embodiments of the present specification, the audio recognition module includes:

a segmentation sub-module configured to segment the audio data to obtain an audio segment;

a voiceprint feature extraction submodule configured to extract voiceprint features of the audio segment;

And the voiceprint recognition sub-module is configured to cluster the voiceprint features and recognize a first person character set corresponding to the audio data.

The decomposition judging module comprises:

The framing sub-module is configured to frame the image data to obtain a frame-divided image;

the face detection sub-module is configured to detect the face of the frame image;

the face feature extraction sub-module is configured to judge whether the face in the frame image is in a speaking state or not under the condition that the face is detected, and if so, the face feature is extracted from the frame image to obtain a face feature set;

The person traversing submodule is configured to traverse the person roles in the first person role set, and acquire the corresponding face features of the voiceprint features of the traversed person roles on a time axis from the face feature set aiming at the traversed person roles;

And the decomposition judging sub-module is configured to judge whether the face features of any person role correspond to multiple persons according to the face features of the audio features of any person role corresponding to the time axis.

In one or more embodiments of the present disclosure, the apparatus further includes:

And the speaking detection module is configured to judge whether the face in the frame image is in a speaking state before the face features are extracted from the frame image, and if so, trigger the face feature extraction submodule to enter the step of extracting the face features from the frame image.

the image quality detection module is configured to judge whether the face image quality of the frame images reaches a preset image quality condition before the face features are extracted from the frame images, and if so, the face feature extraction submodule is triggered to enter the step of extracting the face features from the frame images.

In one or more embodiments of the present disclosure, the image quality detection module is configured to determine whether a face image of the frame image is not blocked, and if so, determine that the face image quality does not reach a preset image quality condition; and/or judging whether the face image of the framing image has a positive face, if not, determining that the quality of the face image does not reach the preset image quality condition.

the request receiving module is configured to receive a character recognition request sent by a user, wherein the character recognition request carries the video;

and a role return module configured to send the third persona set to the user.

the identification receiving module is configured to receive identification information sent by the user aiming at any one or more characters in the third character set;

The relation establishing module is configured to establish a corresponding relation between any one or more personas and the identity information according to the identity information;

And the identification sending module is configured to send the correspondence between any one or more personas and the identification information to any one or more users corresponding to the video.

The above is an exemplary scheme of a character recognition apparatus of the present embodiment. It should be noted that, the technical solution of the role recognition device and the technical solution of the role recognition method belong to the same concept, and details of the technical solution of the role recognition device, which are not described in detail, can be referred to the description of the technical solution of the role recognition method.

Correspondingly, the present disclosure also provides an embodiment of a role recognition system, and fig. 8 shows a schematic structural diagram of a role recognition system provided in an embodiment of the present disclosure. As shown in fig. 8, the system includes:

The cloud side device 802 is configured to receive a role recognition request, obtain a video according to the role recognition request, obtain image data and audio data in the video, classify the roles by using voiceprint features of the audio data, identify a first role set corresponding to the audio data, determine whether any one of the first role set corresponds to multiple people by using image features of the image data, if so, decompose the any one of the first role set into multiple roles, obtain a second role set, and determine whether the multiple roles correspond to the same person based on voiceprint features and/or image features corresponding to the multiple roles for the multiple roles in the second role set, if so, combine the multiple roles into the same role, and obtain a third role set;

And the end side device 804 is configured to send the role identification request to the cloud side device, and receive a third persona set returned by the cloud side device.

The cloud side equipment in the character recognition system firstly analyzes the voiceprint characteristics to recognize the character, then combines the video characteristics on the corresponding time axis of the voiceprint characteristics to further assist in separating the possibly multi-person character, and then combines the possibly same-person character, thereby improving the accuracy rate of character recognition.

Fig. 9 illustrates a block diagram of a computing device 900 provided in accordance with one embodiment of the present specification. The components of computing device 900 include, but are not limited to, memory 910 and processor 920. Processor 920 is coupled to memory 910 via bus 930 with database 950 configured to hold data.

Computing device 900 also includes an access device 940, access device 940 enabling computing device 900 to communicate via one or more networks 960. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 940 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near Field Communication (NFC).

In one embodiment of the present description, the above-described components of computing device 900 and other components not shown in FIG. 9 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 9 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 900 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 900 may also be a mobile or stationary server.

Wherein the processor 920 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the character recognition method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the role recognition method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the role recognition method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the character recognition method described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the role recognition method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the role recognition method.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above character recognition method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the role recognition method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the role recognition method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A character recognition method, comprising:

acquiring image data and audio data in a video;

Classifying the persona by utilizing the voiceprint characteristics of the audio data, and identifying a first persona set corresponding to the audio data;

Judging whether any one person character in the first person character set corresponds to a plurality of persons or not by utilizing the image characteristics of the image data;

If yes, decomposing any character into a plurality of character characters to obtain a second character set;

Judging whether the multiple personas correspond to the same person or not according to the voiceprint features and/or the image features corresponding to the multiple personas in the second persona set;

and if so, combining the multiple personas into the same persona to obtain a third persona set.

2. The method of claim 1, the determining whether any person character in the first set of person characters corresponds to multiple persons using image features of the image data, comprising:

Judging whether the face features of any person character in the first person character set correspond to multiple persons or not;

the decomposing the character of any person into a plurality of characters to obtain a second character set includes:

3. The method of claim 1, wherein the determining, for the plurality of personas in the second set of personas, whether the plurality of personas correspond to the same person based on voiceprint features and/or image features to which the plurality of personas correspond comprises:

4. The method of claim 1, wherein the classifying the persona using the voiceprint feature of the audio data to identify the first set of personas corresponding to the audio data comprises:

Segmenting the audio data to obtain audio segments;

Extracting voiceprint features of the audio segment;

5. The method of claim 2, the determining whether the face feature of any persona in the first persona set corresponds to multiple people, comprising:

Framing the image data to obtain a framed image;

performing face detection on the frame-divided image;

6. The method of claim 5, further comprising, prior to extracting face features from the framed image:

7. The method of claim 6, wherein determining whether the face image quality of the framed image meets a preset image quality condition comprises:

And/or

8. The method of claim 1, further comprising, prior to the acquiring the image data and the audio data in the video:

after the third persona set is obtained, the method further includes:

and sending the third persona set to the user.

9. The method of claim 8, after the sending the third persona set to the user, further comprising:

10. A character recognition method, comprising:

Receiving live video, and acquiring image data and audio data in the live video;

11. The method of claim 10, further comprising:

12. The method of claim 10, further comprising:

13. A character recognition system, comprising:

The cloud side equipment is used for receiving a role identification request, acquiring a video according to the role identification request, acquiring image data and audio data in the video, classifying the roles by utilizing voiceprint features of the audio data, identifying a first role set corresponding to the audio data, judging whether any role in the first role set corresponds to multiple people by utilizing image features of the image data, if so, decomposing any role into multiple roles, acquiring a second role set, judging whether the multiple roles correspond to the same person based on the voiceprint features and/or the image features corresponding to the multiple roles for the multiple roles in the second role set, and if so, merging the multiple roles into the same role to acquire a third role set;

14. A computing device, comprising:

a memory and a processor;

The memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the character recognition method of any one of claims 1 to 12.

15. A computer readable storage medium storing computer executable instructions which when executed by a processor perform the steps of the character recognition method of any one of claims 1 to 12.