WO2023197979A1

WO2023197979A1 - Data processing method and apparatus, and computer device and storage medium

Info

Publication number: WO2023197979A1
Application number: PCT/CN2023/087208
Authority: WO
Inventors: 冯鑫
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-04-13
Filing date: 2023-04-10
Publication date: 2023-10-19
Also published as: CN114465737A; CN114465737B

Abstract

Disclosed in the embodiments of the present application are a data processing method and apparatus, and a computer device and a storage medium, which can be applied to an artificial intelligence scene. The method comprises: identifying picture feature information from a video frame of multimedia data, wherein the picture feature information comprises M service objects to which role pictures in the video frame belong; positioning and separating, from an original audio frame of the multimedia data, audio frames that include human voice, so as to obtain N object audio frames, respectively extracting corresponding audio semantic feature vectors from the N object audio frames, and performing clustering processing on the audio semantic feature vectors corresponding to the N object audio frames, so as to obtain M audio clusters; and on the basis of the picture feature information, the M audio clusters and an object role mapping table associated with the multimedia data, identifying a service role corresponding to each of P audio clusters. By means of the embodiments of the present application, the precision, efficiency and applicability of audio role identification can be improved.

Description

A data processing method, device, computer equipment and storage medium

This application claims priority to the Chinese patent application with application number 202210383918.3 and the invention title "a data processing method, device, computer equipment and storage medium" submitted on April 13, 2022.

Technical field

The present application relates to the field of computer technology, and in particular, to a data processing method, device, computer equipment and storage medium.

Background technique

Many video content providing platforms have launched services that individually edit out all the clips of a certain character in multimedia data (for example, movies and TV series) for users to watch. Character identification is required during said editing process. Current character recognition solutions usually involve manual character recognition, which requires a lot of time and energy to manually annotate the characters in the film and television dramas. For example, manually determine the number of characters appearing in the film and television dramas, and label each sentence Lines are marked.

Contents of the invention

Embodiments of the present application provide a data processing method, device, computer equipment and storage medium, which can improve the accuracy, efficiency and applicability of audio character recognition.

On the one hand, embodiments of the present application provide a data processing method, including:

Identify picture feature information from video frames of multimedia data. The picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;

Locate and separate the audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames. Extract the corresponding audio semantic feature vectors from the N object audio frames respectively, and compare the corresponding audio frames of the N object audio frames. The audio semantic feature vectors are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;

Based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data, identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.

On the one hand, embodiments of the present application provide a data processing device, including:

The picture information acquisition module is used to identify picture feature information from the video frame of the multimedia data. The picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer;

The clustering processing module is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and perform The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;

The audio role recognition module is used to identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data. ;P is a positive integer less than or equal to M; The object role mapping table includes business roles that have a mapping relationship with the list business object; there are P overlapping business objects between the list business object and the M business objects.

On the one hand, embodiments of the present application provide a computer device, including: a processor and a memory;

The processor is connected to a memory, where the memory is used to store a computer program. When the computer program is executed by the processor, the computer device executes the method provided by the embodiment of the present application.

On the one hand, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application. Examples provide methods.

On the one hand, embodiments of the present application provide a computer program product. The computer program product includes a computer program. The computer program is stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium. , the processor executes the computer program, so that the computer device executes the method in the embodiment of the present application.

In the embodiment of the present application, a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified. This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition. In addition, the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.

Description of the drawings

In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application;

Figure 2 is a schematic flow diagram of a system for audio character recognition provided by an embodiment of the present application;

Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application;

Figure 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application;

Figure 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application;

Figure 6 is a schematic architectural diagram of an audio semantic feature clustering provided by an embodiment of the present application;

Figure 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application;

Figure 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application;

Figure 9 is a schematic diagram of a scene for audio character recognition provided by an embodiment of the present application;

Figure 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application;

Figure 11 is a schematic diagram of a scene for displaying multimedia segment data provided by an embodiment of the present application;

Figure 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application;

Figure 13 is another structural schematic diagram of a data processing device provided by an embodiment of the present application;

Figure 14 is a schematic diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

The embodiment of the present application provides a character recognition method based on audio semantic feature clustering, which can be applied to the field of artificial intelligence. Among them, the so-called Artificial Intelligence (AI) is the theory, method, technology and technology that uses digital computers or digital computer-controlled calculations to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system. In other words, artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, autonomous driving, smart transportation and other major directions.

Among them, the field of computer vision technology (Computer Vision, CV) is a science that studies how to make machines "see". Furthermore, it refers to machine vision such as using cameras and computers to replace human eyes to identify and measure targets, and Further graphics processing is performed to make the computer processing into an image more suitable for human eye observation or to be transmitted to the instrument for detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data. Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous positioning and mapping Construction, autonomous driving, smart transportation and other technologies also include common biometric identification technologies such as facial recognition and fingerprint recognition.

Among them, the key technologies of speech technology include automatic speech recognition technology, speech synthesis technology and voiceprint recognition technology. Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.

Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

Among them, Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning often include artificial neural networks, Belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

Please refer to Figure 1. Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application. As shown in Figure 1, the network architecture may include a server 1OF and a terminal device cluster. The terminal device cluster may include one or more terminal devices, and there will be no limit on the number of terminal devices here. As shown in Figure 1, the terminal device cluster may specifically include terminal devices 100a, terminal devices 100b, terminal devices 100c,..., terminal devices 100n. As shown in Figure 1, the terminal device 100a, the terminal device 100b, the terminal device 100c, ..., the terminal device 100n can each have a network connection with the above-mentioned server 10F, so that each terminal device can perform data interaction with the server 10F through the network connection. . The network connection here is not limited to a connection method. It can be connected directly or indirectly through wired communication, or directly or indirectly through wireless communication. It can also be connected through other methods. This application does not limit it here.

Each terminal device in the terminal device cluster may include: smart phones, tablets, laptops, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart TVs and other smart terminals with audio role recognition functions. Each terminal device in the terminal device cluster as shown in Figure 1 can be installed with a target application (for example, a client). When the client is running in each terminal device, it can perform data interaction with the server 10F shown in FIG. 1 . The client may include a social client, a multimedia client (for example, a video client), an entertainment client (for example, a game client), an information flow client, an education client, a live broadcast client, and other clients. The client can be an independent client or an embedded sub-client integrated in a certain client (for example, a social client, an education client, a multimedia client, etc.), which is not limited here.

As shown in Figure 1, the server 10F in the embodiment of the present application can be the server corresponding to the client. The server 10F can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services. Among them, the embodiment of this application will not limit the number of servers.

For ease of understanding, in this embodiment of the present application, one terminal device may be selected as the target terminal device among the multiple terminal devices shown in FIG. 1 . For example, in this embodiment of the present application, the terminal device 100a shown in FIG. 1 can be used as a target terminal device, and the target terminal device can be integrated with a target application (for example, a client). At this time, the target terminal device can realize data interaction with the server 10F through the business data platform corresponding to the client. The client here may have a frame sequence (for example, frame animation sequence) loading and playback function, which is used to play video frames, audio frames and text (for example, lines) in the service playback display interface provided by the client. Multimedia data. The service playback display interface here refers to the interface displayed by the terminal device for playing multimedia data. The data type of the multimedia data may include film and television drama types, animation types, variety show types, etc. The data type of multimedia data will not be limited here.

When a computer device with an audio character recognition function (for example, the above-mentioned server 10F) obtains multimedia data (for example, TV series A), it can identify picture feature information from video frames of the multimedia data. The picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer. For example, the picture feature information may indicate which actor plays the character in a certain character picture including key parts of the character (for example, the character's face) in the TV series A. At the same time, the computer device can also extract corresponding audio semantic feature vectors from the N object audio frames, and then perform clustering processing on the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clusters. Class cluster. Wherein, N is a positive integer, and the N object audio frames here are obtained by the computer device locating and separating the audio frames containing human voices from the original audio frames in the multimedia data. The computer device performs object positioning and separation processing on the original audio frame in order to reduce the interference caused by the silent frames in the environmental audio track and the object audio track (for example, the vocal track) in subsequent clustering processing, so as to improve the clustering accuracy, thereby improving the accuracy of character voice recognition.

At this time, the computer device can be based on the picture feature information, M audio clusters, and object roles associated with the multimedia data. The mapping table identifies the business role corresponding to each of the P audio clusters. P here can be a positive integer less than or equal to M. The object role mapping table here (for example, the cast list of TV series A) may include business roles (roles) that have a mapping relationship with the list business objects (actors). There are P overlapping business objects between the list business objects in the object role mapping table and the M business objects recognized by the computer. The object mapping table may be an initial object role mapping table provided by the business editor of the multimedia data acquired by the computer device (for example, the editing user of TV series A), or it may be an initial object extracted by the target user of the access client based on the business editing. What is updated by the role mapping table will not be limited here. For example, the target user can add a mapping relationship between a certain business role in TV series A (for example, waiter in a restaurant) and a certain business object (for example, actor 1) in the initial object role mapping table, that is, the waiter in the restaurant is played by actor 1.

It can be seen that the computer device in the embodiment of the present application can combine the sound and character by combining the picture feature information (for example, face information) automatically recognized from the video frame and the M audio clusters of adaptive clustering. Association identification, so that the business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified. This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. It can not only reduce the time consumed by manpower, but also solve the problem of similar timbre recognition errors, thereby improving the accuracy and efficiency of recognition. In addition, the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.

For ease of understanding, please further refer to Figure 2. Figure 2 is a schematic flow chart of a system for audio character recognition provided by an embodiment of the present application. As shown in Figure 2, the computer device in the embodiment of the present application may be a computer device with audio character recognition function. The computer device may be any terminal device in the terminal device cluster shown in FIG. 1 , for example, the terminal device 100a, or may be the server 10F shown in FIG. 1 . Computer equipment will not be limited here.

As shown in Figure 2, the audio character recognition system provided by the embodiment of the present application may include three modules. Specifically, it may include a first module (for example, a key image recognition module), a second module 202 (for example, an audio semantic feature clustering model). ) and the third module 203 (for example, character recognition module). Among them, the multimedia data 20S in the embodiment of the present application may be multimedia data acquired by the computer device that requires audio character recognition. The multimedia data 20S can be multimedia data corresponding to a certain episode in a certain TV series, multimedia data corresponding to a certain movie, or multimedia data corresponding to a certain variety show, which will not be discussed one by one here. Example. The multimedia data 20S is composed of video data including original video frames and audio data including original audio frames.

The computer device can obtain video frames from video data including raw video frames. The video frame here may refer to a video frame sequence obtained by deleting the beginning and end of the original video frame in the video data. Further, the computer device can identify picture feature information from the video frames of the multimedia data 20S through the first module 201 shown in FIG. 2 . Among them, the first module 201 may include a key part detection model 210w and a picture encoding model 220w. The key part detection model 210w can be used to detect character pictures in video frames. The character picture here refers to a picture including the key parts of the character (for example, the character's face). The picture encoding model 220w can be used to encode each character cut picture in the character picture to obtain picture vector information corresponding to the character cut picture. The computer device may also obtain the information vector database 200K shown in FIG. 2 from its internal memory or externally, for example. The information vector database 200K can be an information index database established by the computer device in advance based on a large amount of material data (for example, multimedia data belonging to film and television drama types, variety show types, etc.) through the same key image recognition method, and is specially used for Information base for key image recognition. Among them, the information vector database 200K can be used to store object key information vectors respectively corresponding to Y candidate business objects. The object here is related to The key information vector may also be determined through the picture encoding model 220w, and Y is a positive integer greater than or equal to M. In addition, the information vector database 200K may also include object information of each candidate business object, for example, the object attribute type of the candidate business object (including singing and dancing singers, modern idol dramas, ancient palace dramas, fairy tale dramas, war-themed dramas, etc. ). The computer device can obtain the picture feature information shown in Figure 2 based on the information vector database 200K and the picture information vector output by the picture coding model 220w.

At the same time, the computer device can also obtain audio clustering results associated with the N object audio frames in the multimedia data 20S through the second module 202 shown in FIG. 2 . Among them, the N object audio frames here are obtained by subjecting the original audio frames in the multimedia data to object positioning and separation processing, and N is a positive integer. As shown in Figure 2, the second module 202 here may include a source separation model 230w and an audio semantic feature extraction model 240w. The source separation model 230w here can be used to perform source separation on the original audio frame to obtain the object sound segment (or object audio track) (for example, the vocal segment (or vocal track)) and the environmental sound segment (or ambient sound track) (e.g., background sound segment (or backing track)). The audio semantic feature extraction model 240w here can be used to perform frame-level semantic feature extraction on each object audio frame when N object audio frames in the object segment are obtained, so as to obtain the corresponding information of each object audio frame. Audio semantic feature vector. Further, the computer device can perform clustering processing on N audio semantic feature vectors to obtain M audio clusters, and then these M audio clusters can be used as the audio clustering results obtained by the second module 202 . Among them, an audio cluster can correspond to a business object.

Further, the computer device can identify each audio in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table 200B associated with the multimedia data 20S shown in FIG. 2 Cluster clusters correspond to business roles respectively. P is a positive integer less than or equal to M. The object role mapping table 200B here may include business roles that have a mapping relationship with the list business objects. There are P overlapping business objects between the list business object and the M business objects. The computer device can perform audio character recognition on the output information of the first two modules through the third module 203. For example, based on the picture feature information output by the first module 201, the audio clustering result output by the second module 202, and the object role mapping table 200B, the computer device determines the position of the video frame where the P overlapping business objects are located in the multimedia data 20S. The playback time (ie, the second playback time), and the playback time (ie, the first playback time) of the target audio frame corresponding to each audio cluster in the multimedia data 20S. Furthermore, the computer device can determine the audio clusters corresponding to the P business objects by comparing the two playback times, and further determine the business roles corresponding to each of the P audio clusters. .

It can be seen that the computer device in the embodiment of the present application can combine the picture feature information (for example, face information) output by the first module 201 and the audio clustering result output by the second module 202, in the third module 203. The audio and business roles are associated and identified, so that the business roles respectively corresponding to the P audio clusters associated with the object role mapping table 200B can be accurately identified. This audio character recognition method not only improves the accuracy and efficiency of recognition, but also improves the applicability of recognition.

Among them, the computer device with the audio character recognition function recognizes the object character by combining the picture feature information (for example, face information) automatically recognized from the video frame of the multimedia data and the M audio clusters of adaptive clustering. For specific implementation methods of the service roles corresponding to the P audio clusters associated with the mapping table, please refer to the embodiments corresponding to Figures 3 to 11 below.

Further, please refer to Figure 3, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 3, the method can be performed by a computer device with audio character recognition capabilities. The computer device may be a terminal device (for example, any terminal device in the terminal device cluster shown in Figure 1 above, for example, the terminal device 100a), or it may be a server (for example, the server 10F shown in Figure 1 above), No limitation is made here. For ease of understanding, the embodiment of the present application uses this method to use the service with the audio character recognition function to Taking server execution as an example for illustration, the method may at least include the following steps S101 to S103:

Step S101: Identify picture feature information from video frames of multimedia data.

The picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer. Specifically, the computer device can obtain video frames from multimedia data, and can then perform picture cutting processing on the key parts of the character in the video frame (cutting the pictures containing the key parts of the character in the video frame) to obtain the video The character picture corresponding to the frame. Among them, the character pictures here may include X character cut pictures, where X is a positive integer greater than or equal to M. Further, the computer device can obtain the character cutting picture _Ti among the X character cutting pictures, and encode the character cutting picture _Ti to obtain the picture information vector _Li corresponding to the character cutting picture _Ti . Among them, i here is a positive integer less than or equal to X. At this time, the computer device can determine the object key information vector matching the picture information vector _Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as The business object corresponding to the role cutting picture T _i . Further, the computer device can determine the picture feature information corresponding to the video frame based on the business objects corresponding to the X character cut pictures.

Among them, the picture recognition system when the computer device detects and recognizes the key parts of the character in the video frame can be composed of a detection sub-module and a recognition sub-module, or it can be an integrated system that detects and recognizes the key parts of the character. Detection and recognition network will not be limited here.

For example, when determining the character picture corresponding to the video frame, the computer device can detect and locate the key parts of the character in the video frame, thereby determining the position information of the key parts of the character in the video frame. Further, the computer device can cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame. Then, the computer device can obtain the character cutting picture _Ti among the X character cutting pictures, and encode the character cutting picture _Ti to obtain the picture information vector _Li corresponding to the character cutting picture _Ti . Among them, i here is a positive integer less than or equal to X. At this time, the computer device can obtain the information vector database associated with the candidate business object from its internal memory or externally to find the candidate business object that has a matching relationship with the picture information vector _Li . Among them, the information vector database here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.

When the computer device obtains the information vector database, it can directly search for candidate business objects that have a matching relationship with the picture information vector _Li from the information vector database. Wherein, the computer device can respectively determine the vector distance between the picture information vector _Li and each of the Y object key information vectors, and obtain Y vector distances. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to _Ti . The distance threshold here is a value set in advance by the computer device to ensure that the found candidate business object has a matching relationship with the character cut picture. It can be dynamically adjusted according to the actual situation, and will not be limited here.

In order to improve the matching efficiency, the computer device can obtain the object role mapping table associated with the multimedia data, and use the object role mapping table and the information vector database to find candidate business objects that have a matching relationship with the picture information vector _Li . For ease of understanding, please further refer to Table 1. Table 1 is an object role mapping table associated with multimedia data provided by an embodiment of the present application. As shown in Table 1:

Table 1

For ease of understanding, the business roles in the object role mapping table shown in Table 1 may include H, where H is a positive integer greater than or equal to M. Here we take five business roles as an example, which may include role 1, role 2, role 3, role 4 and role 5. Wherein, both role 1 and role 2 may have a mapping relationship with the same business object (for example, object a). That is, both role 1 and role 2 are played by object a. Role 3 has a mapping relationship with object b, role 4 has a mapping relationship with object c, and role 5 has a mapping relationship with object d.

The computer device can select the object key information vector corresponding to the list business object in the object role mapping table from the information vector database according to the above Table 1, for example, the object key information vector of object a, the object key information vector of object b, and Object key information vector of object c. Further, the computer device can respectively determine the vector distance between the picture information vector _Li and each of the selected three object key information vectors. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the three vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to _Ti . It can be seen that when matching candidate business objects, the computer device does not need to determine the vector distance between the key information vector of each object in the information vector database, but selects through the object role mapping table, which greatly reduces the matching time. , thereby improving the matching efficiency of finding candidate business objects with matching relationships from the information vector database.

For ease of understanding, please further refer to FIG. 4 , which is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application. As shown in FIG. 4 , the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the first module 201 in the embodiment corresponding to FIG. 2 . The video frame 4V shown in Figure 4 may be a video frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to Figure 2 described above). The key part detection model 410w shown in Figure 4 can be used to detect key parts in the video frame 4V. The key part detection model 410w may be the key part detection model 210w in the embodiment corresponding to FIG. 2 mentioned above. The picture encoding model 420w shown in FIG. 4 can be used to encode the character cut picture 400S. The picture coding model 420w may be the picture coding model 420w in the embodiment corresponding to FIG. 2 described above. The information vector database 400K shown in Figure 4 may be the information vector database 200K in the embodiment corresponding to Figure 2 described above.

As shown in Figure 4, when the computer device in the embodiment of the present application performs image recognition on the video frame 4V, the video frame 4V can be input to the key part detection model 410w shown in Figure 4, and through the key part detection model 410w , detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information). Further, the computer device can cut the key parts of the character in the video frame 4V based on the position information marked in the area 40Q, and obtain the character cutting picture including the key parts of the character as shown in Figure 4 (for example, as shown in Figure 4 Character cutting picture 400T).

Among them, the key part detection model 410w shown in Figure 4 may be a network structure used to detect and locate key parts of a character (for example, a character's face), for example, a face detection model (Multi-task Cascaded Convolutional Networks, MTCNN for short) network). For ease of understanding, please further refer to FIG. 5 , which is a model architecture diagram of a key part detection model provided by an embodiment of the present application. As shown in Figure 5, the key part detection model in the embodiment of the present application may be the key part detection model 410w in the embodiment corresponding to Figure 4. This key part detection model can be used to detect key parts in the video frame 5V shown in Figure 5, where the video frame 5V It may be the video frame 4V in the embodiment corresponding to FIG. 4 mentioned above.

Among them, as shown in Figure 5, the key part detection model may include three network layers, which may specifically include a filtering network layer 5W ₁ (for example, Proposal Network, P-Net for short), a fine-tuning network layer 5W ₂ (for example, Refinement network, referred to as R-Net) and the output network layer 5W ₃ (for example, Output network, referred to as O-Net).

When the computer device in the embodiment of the present application obtains the video frame 5V, it can adjust the image size of the video frame 5V, so that the image pyramid corresponding to the video frame 5V can be obtained. For example, the computer device can obtain the resizing coefficient (for example, 0.7) from its internal memory or externally, and adjust the video frame 5V multiple times based on the resizing coefficient until the picture size of the adjusted video frame 5V is consistent with the filtering network layer 5W ₁ matches the image size threshold associated with it (for example, 12*12*3). At this time, the computer device can form a picture pyramid corresponding to the video frame 5V based on the video frames 5V with different picture sizes after multiple adjustments. The size adjustment coefficient here may be dynamically set by the computer device according to the distribution of the key parts of the character in the video frame. If the size adjustment coefficient is set too large, it is easy to extend the time for detecting and locating the key parts of the character. If the size adjustment coefficient is set too small, the key parts of the character with a small distribution area in the video frame may be missed (for example, small and medium-sized faces). Based on this, the size adjustment coefficient in the embodiment of the present application can be set between 0.7-0.8. The picture pyramid here may include the original picture (for example, the video frame 5V shown in Figure 5), the first adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V), the second adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V) The picture obtained by adjusting the picture size of the first adjusted picture), ..., and the Nth adjusted picture (that is, the picture obtained by adjusting the picture size of the N-1th adjusted picture). The image size of the Nth adjusted image here may be the image size threshold associated with the filtering network layer 5W ₁ (for example, 12*12).

Further, the computer device can input the picture pyramid corresponding to the video frame 5V to the screening network layer 5W ₁ shown in Figure 5, so that a large number of candidates can be obtained. Among them, in the embodiment of the present application, the picture obtained by cutting the video frame 5V by filtering the bounding box position information obtained by the network layer 5W ₁ is called the first cut picture. Wherein, the computer device can input the pictures in the picture pyramid to the filtering network layer 5W ₁ to obtain the output features (m, n, 16). Among them, m and n here can be used to characterize the length and width of the image, and 16 is the dimension of the channel. According to the classification score obtained by screening the network layer _5W1 , the computer device can screen out a large portion of candidates, thereby obtaining one or more first candidates. The computer device then calibrates the bounding box (bbox for short) based on the obtained four offsets, and obtains the position information of the calibrated bounding box (for example, the coordinate information of the upper left and lower right). Then the computer device can screen these first candidates again according to the Intersection over Union (iou), that is, by performing Non-Maximum Suppression (Non-Maximum Suppression, NMS algorithm) to screen out the first candidates. A large number of candidates get the second candidate. In other words, the computer device can sort the classification scores (for example, in descending order) to obtain a tensor of (num_left, 4), that is, the upper left and lower right absolute coordinates of num_left bboxes. Further, the computer device can determine the iou with the bounding box coordinates and remaining coordinates of the maximum score value after sorting each time, and can further filter out the iou that is greater than the intersection-to-union ratio threshold (for example, 0.6, the intersection-to-union ratio threshold is the computer (preset by the device) and move this maximum score value to the final result. In this embodiment of the present application, the above operation may be called a filtering operation. Further, the computer device repeats this filtering operation to filter out many bounding boxes with a large number of overlapping parts, and finally obtains (num_left_after_nms, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the bounding box, so that the picture size is 24*24, and the picture used to be input to the fine-tuning network layer 5W ₂ shown in Figure 5 (i.e. the first cut picture). The first cut picture here may be a square with the maximum side length of the bounding box captured by the computer device in the video frame 5V, thereby effectively ensuring that no deformation occurs during size adjustment and that more details of key parts of the character are retained.

Then, the computer device can fine-tune the first cut picture through the fine-tuning network layer 5W ₂ to obtain the result shown in Figure 5 of the second cut picture. Among them, the fine-tuned network layer 5W ₂ can output 2 outputs corresponding to the two-class one-hot, 4 outputs corresponding to the coordinate offset of the bounding box, and 10 outputs corresponding to the turning point (landmark). Furthermore, the fine-tuned network layer 5W ₂ can filter out most candidates that do not include key parts of the character (for example, the character's face) according to the binary classification score. After adjusting the bounding box according to the offset, repeat the filtering operation in the above filtering network layer 5W ₁ again to obtain (num_left_after_Rnet, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the adjusted bounding box, so that the picture size is 48*48, and the picture used to be input to the output network layer 5W ₃ shown in Figure 5 (i.e., the first 2 cutting pictures). Of course, the specific processing method of obtaining the second cut picture by the computer device can be referred to the specific processing method of obtaining the first cut picture to avoid deformation and retain more details.

Further, the computer device can accurately output the position information of the character's key parts in the video frame 5V through the output network layer 5W ₃ , including the coordinate information of the bounding box and the coordinate information of the turning point. Among them, the computer device, in the output network layer 5W ₃ , after classification screening and bounding box adjustment NMS screening, not only outputs the coordinate information of the bounding box, but also outputs the coordinate information of the turning point, thereby obtaining the key parts of the character in the video frame The position information in 5V is used to subsequently cut the key parts of the character in the video frame 5V, thereby obtaining a picture including the key parts of the character (for example, the character cut picture 400T shown in Figure 4).

Further, the computer device can input the character cutting picture 400T to the picture coding model 420w shown in FIG. 4. Through the picture coding model 420w, the character cutting picture 400T is coded, so that the character cutting picture 400T corresponding to the character cutting picture 400T can be obtained. Picture information vector. Among them, the picture coding model 420w in the embodiment of the present application is a model based on Residual Network (Resnet). This series of networks can be widely used in fields such as target classification and as the backbone of classic neural networks for computer vision tasks. Some, typical networks include Resnet50, Resnet101, etc. For example, the picture coding model 420w in the embodiment of this application may be a Resnet50 network model. As shown in Figure 4, the Resnet50 network model can include 5 stages, which can specifically include the first stage (for example, Stage 0), the second stage (for example, Stage 1), the third stage (for example, Stage 2), and the third stage. Four stages (e.g., Stage 3) and fifth stage (e.g., Stage 4). The structure of Stage 0 is relatively simple. It can be regarded as the preprocessing of the character cutting image 400T. The last four stages are all composed of bottleneck layers (Bottleneck), and the structures are relatively similar. Among them, Stage 1 can contain 3 Bottlenecks, Stage 2 can contain 4 Bottlenecks, Stage 3 can contain 6 Bottlenecks, and Stage 4 can contain 3 Bottlenecks. The computer device inputs the character cutting picture 400T into the picture encoding model 420w. Through the five stages in the picture encoding model 420w, the character cutting picture 400T can be converted into a picture information vector with 2048 dimensions. The picture information vector can be used Semantic feature information used to represent key parts of the character (for example, face).

Further, the computer device may obtain the information vector database 400K associated with the candidate business object shown in FIG. 4 . Among them, the information vector database 400K here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M. Among them, each object key information vector in the information vector database 400K can be extracted by the computer device using the same encoding processing method as the character cutting picture 400T. An object key information vector can be used to represent a candidate business object corresponding to Key part identification (for example, Face ID). At this time, the computer device can respectively determine the vector distance between the picture information vector corresponding to the character cutting picture 400T and each of the Y object key information vectors, thereby obtaining Y vector distances. Further, in order to effectively ensure that the computer device can accurately match the corresponding candidate business object from the information vector database 400K, the computer device can set a distance threshold in advance. If the minimum vector distance determined by the computer device is greater than the distance threshold, it can be considered that the computer device has not matched the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it has not matched the character cutting picture 400T. The corresponding business object. If the computer equipment determines the minimum If the vector distance is less than or equal to the distance threshold, it can be considered that the computer device can match the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it can successfully match the business object corresponding to the character cutting picture 400T. .

Therefore, when the computer device obtains the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, it can determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and then the determined candidate business object can be As the business object corresponding to the character cutting picture 400T. When the computer device performs image recognition on each video frame in the multimedia data, it can refer to the specific implementation of key part identification of the video frame 5V shown in Figure 5 to obtain X key pictures of the character containing the key parts of the character. , which will not be described further here. Wherein, if a video frame includes key parts of multiple different characters, the computer device can cut out a corresponding number of key parts of the characters from the video frame. Further, the computer device can refer to the specific implementation of object matching for the character cutting pictures 400T in the corresponding embodiment of FIG. 4, perform object matching on each of the X character cutting pictures, and then can perform object matching based on The business objects corresponding to the obtained character cut pictures are determined to determine the picture feature information corresponding to the video frames in the multimedia data.

Step S102: Locate and separate audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N object audio frames with each other. The audio semantic feature vectors corresponding to the frames are clustered to obtain M audio clusters.

Among them, N object audio frames are obtained after the computer device performs object positioning and separation processing on the original audio frames in the multimedia data, where N is a positive integer. An audio cluster can correspond to a business object. Specifically, the computer device can obtain original audio frames from multimedia data, and can then perform object positioning and separation processing on the original audio frames to obtain N object audio frames. Further, the computer device can perform semantic feature extraction on each of the N object audio frames to obtain an audio semantic feature vector corresponding to each object audio frame. At this time, the computer device can determine M as the number of cluster centers to be clustered, and based on the number of cluster centers, perform clustering processing on the audio semantic feature vector corresponding to each acquired audio frame of the object, so that M audio files can be obtained Cluster clusters. Audio semantic characteristics can be understood as the characteristics of the speaker’s voiceprint.

In the clustering process, the embodiment of the present application innovatively uses the number M of business objects indicated by the picture feature information as the selection of the number of cluster centers. This method of using picture feature information as prior knowledge enables the system to know the number of business objects in the multimedia data, thereby giving audio clustering a priori setting of cluster center data, which can be automatically set The number of cluster centers improves the convergence speed of the entire system and the overall recognition performance, and saves computer resources.

For ease of understanding, please further refer to FIG. 6 , which is a schematic architectural diagram of audio semantic feature clustering provided by an embodiment of the present application. As shown in FIG. 6 , the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the second module 202 in the embodiment corresponding to FIG. 2 . The original audio frame shown in FIG. 6 may be an original audio frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to FIG. 2 mentioned above). The source separation model 630w shown in Figure 6 can be used to perform source separation on the original audio frame. The information source separation model 630w may be the information source separation model 230w in the embodiment corresponding to FIG. 2 described above. The audio semantic feature extraction model 640w shown in Figure 6 can be used to extract semantic features for each object audio frame. The audio semantic feature extraction model 640w may be the audio semantic feature extraction model 240w in the embodiment corresponding to FIG. 2 described above.

As shown in Figure 6, the architectural schematic diagram in the embodiment of the present application may include three nodes, namely an audio paragraph cutting node, an audio semantic feature extraction node, and a clustering processing node. Wherein, when the computer device is at the audio segment cutting node, the computer device can obtain the original audio frame from the multimedia data to perform source separation on the original audio frame, thereby obtaining the business object-containing audio frame. Audio frame to be processed. Further, the computer device can locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, so that N object audio frames can be obtained. Among them, source separation refers to separating mixed audio signals mixed with multiple audio signals through signal processing or other algorithms, extracting specified types of audio signal sequences from the mixed signals, and finally generating separate audio files. For example, the audio frame to be processed for the business object (that is, the object segment) is extracted from the original audio frame.

After the computer device shown in Figure 6 inputs the original audio frame into the source separation model 630w, the source separation model 630w can be used to perform source separation on the original audio frame to obtain the object segment (or object track) and ambience segments (or ambience tracks). Since there may be a large number of silent segments in the target sound segment, and these silent segments will cause interference to the audio clustering results of subsequent clustering processing, and will also cause a waste of computing resources, at this time, the computer device can determine the target sound segment Is the audio frame to be processed for the business object. The computer device can then obtain the audio boundary detection strategy. For example, the audio boundary detection strategy here can be the VAD (Voice Activity Detection) algorithm. The VAD algorithm here can be widely used in speech coding, noise reduction and ASR scenarios. What we are talking about here is speech/non-speech (non-speech/silence) detection. A VAD system can usually include two parts, feature extraction and speech/non-speech decision. Further, based on the audio boundary detection strategy, the computer device can locate and cut the audio impact signal frame in the audio frame to be processed, that is, accurately locate the non-silent segment, so that N object audio frames shown in Figure 6 can be obtained, N is a positive integer.

When the computer device is at the audio semantic feature extraction node, the computer device may input the N object audio frames to the audio semantic feature extraction model 640w shown in FIG. 6 . For example, the audio semantic feature extraction model 640w can be an audio neural network (for example, PANNS network) based on a large audio data set and training, which is usually used for audio pattern recognition or audio frame level embedding, and serves as the front end of many models. Coding network. Further, the computer device can extract semantic features for each of the N object audio frames through the audio semantic feature extraction model 640w, and obtain the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 6, it may specifically include audio semantic feature vector 1, audio semantic feature vectors 1,..., and audio semantic feature vector N.

Further, as shown in Figure 6, when the computer device is in the clustering processing node, the number M of business objects to which the character pictures in the video frames indicated by the picture feature information belong can be used as a priori information, that is, M is determined as the number to be The number of cluster centers. Then, the computer device can perform clustering processing on the obtained audio semantic feature vector corresponding to each object audio frame based on the number of cluster centers, so that M audio clustering clusters can be obtained. Among them, the clustering strategy used for clustering processing in the embodiment of the present application may be a k-means clustering algorithm (k-means clustering algorithm, referred to as k-means clustering algorithm). The k-means clustering algorithm is an iterative clustering analysis algorithm. For example, the computer device may divide N audio semantic feature vectors into M initial clusters in advance. Furthermore, the computer device can randomly select M audio semantic feature vectors as initial cluster centers of the M initial clusters. Then, for each audio semantic feature vector (i.e., a vector to be attributed) in the audio semantic feature vector set except the M audio semantic feature vectors selected as cluster centers, the computer device may determine that each vector to be attributed is consistent with The vector distance between the cluster centers of each initial clustering cluster, and the vector to be attributed is divided into the initial clustering cluster with the minimum vector distance. At this time, the computer device can update the cluster centers of the divided initial clusters. By analogy, the computer device can determine M audio clusters shown in FIG. 6 . The M audio clusters may specifically include audio clusters C ₁ , audio clusters C ₂ , ..., and audio clusters C _M .

The embodiment of this application uses the audio semantic feature clustering method to classify N audio semantic feature vectors instead of training voiceprint classification through neural networks, thereby getting rid of the dependence on the actor's voiceprint ID and avoiding privacy violations. . At the same time, the embodiments of this application can The object audio frames in the multimedia data are directly used to extract the audio semantic feature vector corresponding to each object audio frame. This is deeply decoupled from the personal voiceprint ID of the business object and thus related to the voice of the character itself. The pattern information is correlated so that business characters voiced by professional voice actors can be identified. That is to say, the embodiment of the present application can still accurately identify the line character information even when the business character is not dubbed by the business object himself, thus improving the accuracy of audio character recognition. In addition, the embodiment of the present application uses the audio semantic feature clustering method to cluster N audio semantic feature vectors to perform audio character recognition, which makes the entire system portable and makes the entire audio character recognition system more efficient. It is versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of identification.

For ease of understanding, please further refer to FIG. 7 , which is a model architecture diagram of a source separation model provided by an embodiment of the present application. As shown in Figure 7, the information source separation model in the embodiment of the present application may be the information source separation model 630w in the embodiment corresponding to Figure 6. The source separation model may include a split network layer 7W ₁ (ie, a first split network layer, for example, VACAL-Unet) and a split network layer 7W ₂ (ie, a second split network layer, for example, BGM-Unet).

Among them, Unet is one of the algorithms that uses a fully convolutional network for semantic segmentation, using a symmetric U-shaped structure containing a compression path and an expansion path. The typical feature of the Unet network is that it has a U-shaped symmetrical structure and can contain 4 convolutional layers and corresponding 4 upsampling layers. Therefore, when implementing, you can either implement the network from scratch and initialize the weights, and then train the model, or you can borrow the convolutional layer structure of some networks and the corresponding trained weight files, plus subsequent upsampling. layer, perform training calculations, etc. Since the trained weight model files can be used in deep learning model training, the speed of Unet training is greatly accelerated. Another feature is that the feature map obtained by each convolutional layer of the Unet network will be connected to the corresponding upsampling layer, so that the feature map of each layer can be effectively used in subsequent calculations, that is, skip connection (skip-connection). It can effectively solve the problem of gradient dissipation and improve the efficiency of model training. In this way, compared with some other network structures (for example, fully convolutional network FCN), Unet avoids direct supervision and loss calculation in high-level feature maps, but combines the features in low-level feature maps, so that the final result can be achieved. The obtained feature map contains both first-level features (i.e., high-level features) and many second-level features (i.e., low-level features), achieving feature fusion at different levels, thereby improving The accuracy of the model’s results.

When the computer device inputs the original audio frame into the source separation model, it can generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model shown in Figure 7. For example, the computer device can perform spectrum conversion on the audio track of the original audio frame to obtain the audio track spectrum corresponding to the original audio frame, and then can generate the spectrum amplitude spectrum corresponding to the original audio frame by eliminating the phase of the audio track spectrum. Further, the computer device can input the spectrum amplitude spectrum into the segmentation network layer 7W ₁ and the segmentation network layer 7W ₂ respectively, so as to generate the first type of features (for example, object track features) corresponding to the spectrum amplitude spectrum through the segmentation network layer 7W ₁ , the second type of features (for example, ambient track features) corresponding to the spectral amplitude spectrum are generated by segmenting the network layer 7W ₂ .

Further, the computer device can merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features (ie, the first mask map). Furthermore, the computer device can generate a target type audio frame (i.e., an audio frame in the object segment) based on the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the output of the source separation model for the business object (including (with voice) audio frame to be processed. For example, when the computer device generates the first type features and the second type features shown in Figure 7, it can perform splicing processing on the first type features and the second type features to obtain spliced type features. Furthermore, the computer device performs two types of mask calculations on the splicing type features, so that a first mask image corresponding to the first type feature and a second mask image corresponding to the second type feature can be obtained. The mask calculation is, for example, by comparing the feature values of the points with the merged values after the splicing process. Further, the The computer device can perform corresponding position calculation (for example, multiplication) on the spectrum amplitude spectrum corresponding to the first mask image and the original audio frame, and then generate the first type audio frame (i.e., the audio frame in the object segment) through inverse spectrum transformation. . At the same time, the computer device can also calculate the corresponding position of the spectrum amplitude spectrum corresponding to the second mask image and the original audio frame, and then generate the second type of audio frame (ie, the audio frame in the environmental sound segment through inverse spectrum transformation ). Since the corresponding first type features and the amplitude spectrum of the first type features can be obtained after the mask and amplitude spectrum calculations above, the one-dimensional features of the first type features and the second type features can be obtained after the inverse spectrum transformation. The sampling point is the audio signal.

It can be seen that the computer device can separate environmental sounds (for example, BGM sounds) from the original audio frames of multimedia data through the source separation model shown in Figure 7 to eliminate the impact of environmental sounds on subsequent clustering, thereby improving Clustering accuracy.

For ease of understanding, please further refer to FIG. 8 , which is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application. As shown in Figure 8, the audio semantic feature extraction model in the embodiment of the present application may be the audio semantic feature extraction model 640w in the embodiment corresponding to Figure 6. For example, the audio semantic feature extraction model shown in Figure 8 can be the Wavegram_Logmel128_Cnn14 model. The biggest feature of this audio semantic feature extraction model is that the input of the model uses the original audio sampling point sequence of the audio, that is, the input of the entire network is audio N object audio frames of the signal. This eliminates the need to extract basic audio features in advance. Since the extraction of basic audio features is very time-consuming, and using basic audio features as input will occupy a particularly large amount of hardware resources, by using this audio semantic feature extraction model to process N object audio frames of the input audio signal, computers can be saved. resources and improve computing efficiency.

As shown in Figure 8, the audio semantic feature extraction model may include a time domain branch network layer, a frequency domain branch network layer and a convolution network layer.

The computer device can input N object audio frames to the audio semantic feature extraction model shown in Figure 8, and can then perform feature learning on the N object audio frames through the time domain branch network layer to obtain a learned time domain feature map ( time domain learning features). As shown in Figure 8, the time domain branch network layer here may include a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolutional layer), a max-pooling layer 803w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 804w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 805w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 806w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 807w (e.g., a max-pooling layer with a stride of 4) and reshape layer 808w. The computer device can directly learn the time domain characteristics of the audio signal in the time domain signal through these large one-dimensional convolution layers, especially information such as audio loudness and sampling point amplitude. After a large number of one-dimensional convolutional layers, a two-dimensional wavegram is obtained to represent the learned time domain feature map, so that the output of the time domain branch and the frequency domain branch can be combined.

At the same time, the computer device can also perform feature learning on N object audio frames through the frequency domain branch network layer to obtain the learned frequency domain feature map (frequency domain learning feature). Among them, the feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same. As shown in Figure 8, the frequency domain branch network layer here may include a convolution layer 809w (for example, a two-dimensional convolution layer including a basic block). The computer device can input N object audio frames to the frequency domain branch network layer and generate frequency domain spectra corresponding to the N object audio frames (for example, using Mel frequency to generate a log-mel spectrum). Further, the computer device inputs the frequency domain spectrum to the convolution layer 809w shown in Figure 8, so as to obtain the same characteristics as the learned time domain feature map through multiple two-dimensional convolution layers in the convolution layer 809w. Frequency domain feature maps for learning of feature dimensions.

Further, the computer device can superimpose (for example, splice) the learned frequency domain feature map and the learned time domain feature map, so that the superimposed feature can be obtained. The computer device then inputs the superimposed features into the convolutional network layer, and performs maximum averaging processing on the superimposed features, Output the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 8, the convolutional network layer here may include a convolutional layer 810w (for example, a two-dimensional convolutional layer) and an activation layer 811w. The computer device can splice the feature map used to represent the frequency domain feature map of learning and the feature map used to represent the time domain feature map of learning to form a set of two-dimensional frequency domain feature maps used to identify superimposed features. . Further, the computer device can input the two-dimensional frequency domain feature map used to represent the superimposed features into the convolution layer 810w shown in Figure 8, and then separately use two-dimensional pooling on the features output by the convolution layer 810w. (pooling) Perform maximum processing and average processing to extract the maximum representation and average representation of the current feature. Furthermore, the computer device can determine the maximum processed feature as the first sub-feature, and determine the average processed feature as the second sub-feature. At this time, the computer device can merge the first sub-feature and the second sub-feature, and then input the merged feature to the activation layer 811w shown in Figure 8 to finally generate an audio semantic feature vector set with 2048 dimensions. The audio semantic feature vector set may include an audio semantic feature vector corresponding to each of the N object audio frames.

It can be seen that the computer device can quickly perform audio semantic feature extraction on each of the N object audio frames through the audio semantic feature extraction model shown in Figure 8, so as to obtain each object more quickly and accurately The audio semantic feature vectors corresponding to the audio frames respectively.

Step S103: Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.

Among them, P can be a positive integer less than or equal to M. The object role mapping table (for example, the object role mapping table shown in Table 1 above) may include business roles that have a mapping relationship with the list business object, and there are P overlapping business objects between the list business object and the M business objects. . Specifically, the computer device may obtain the audio cluster C _k from the M audio clusters. Furthermore, the computer device can extract the first playing time of the audio cluster C _k in the multimedia data, where k is a positive integer less than or equal to M. The first playback time of the audio cluster C _k in the multimedia data is one or more playback times in the multimedia data of the object audio frame corresponding to the audio semantic feature vector included in the audio cluster C _k . Further, the computer device can obtain P business objects that overlap with the M business objects from the list of business objects in the object role mapping table associated with the multimedia data. Furthermore, the computer device can extract the second playback time of each of the P business objects in the multimedia data based on the picture feature information. The second playback time of each of the P business objects in the multimedia data is one or more playback times in the multimedia data of the video frame in which each of the P business objects is located. At this time, the computer device can respectively determine the time overlap between the first playback time and each second playback time of the audio cluster C _k . Furthermore, the computer device can use the business object corresponding to the second playback time with the highest degree of time overlap as the business object corresponding to the audio cluster C _k . Further, the computer device can obtain the business role corresponding to the business object corresponding to the audio cluster C _k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C _k .

The embodiments of this application start from the perspective of audio, identify characters in multimedia data, and classify each audio line into roles. This can supplement accurate lines when there is no information about key parts of the character in some other character shots and scenes. Role information, thus improving the accuracy of role recognition.

For ease of understanding, please further refer to FIG. 9 , which is a schematic diagram of a scenario for audio character recognition provided by an embodiment of the present application. As shown in Figure 9, after the computer device executes step S101, it can determine through the image feature information recognized by the first module 201 that the number M of business objects to which the character pictures in the video frames of the multimedia data belong is 3. Specifically, it can include objects a, object b and object c. After the computer device executes step S102, it can determine that there are three audio clusters through the audio processing results clustered by the second module 202. Specifically, it may include audio clustering cluster C ₁ , audio clustering cluster C ₂ and audio clustering cluster C ₃ shown in FIG. 9 .

Among them, the N object audio frames in the embodiment of the present application may include segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6 shown in FIG. 9 . Among them, these 6 segments are arranged according to playing time. The object audio frames corresponding to audio cluster C ₁ may include object audio frames in segment 1 and segment 3 . The object audio frames corresponding to audio cluster C ₂ may include object audio frames in segment 2, segment 4, and segment 6. The object audio frame corresponding to audio cluster C ₃ may include the object audio frame in segment 5 .

The computer device can obtain, from the list of business objects in the object role mapping table shown in Table 1, business objects that overlap with the M business objects obtained by the computer device in the first module. For example, the list business objects in Table 1 above may include four business objects: object a, object b, object c, and object d. The M business objects obtained by the computer device in the embodiment of the present application may include object a, object b, object c, and object d. There are three business objects: object b and object c. Therefore, the computer device can obtain from the above Table 1 that the number of overlapping business objects is 3, that is, object a, object b and object c. At this time, the computer device can extract the playback time (ie, the second playback time) of each of the three overlapping business objects in the multimedia data based on the picture feature information.

For example, the second playback time of object a in the multimedia data is playback time T ₁ (for example, 00:00-10:00) and playback time T ₃ (for example, 30:45-38:00); object b is in the multimedia data The second playback time in the data is playback time T ₂ (for example, 10:05-28:33), playback time T ₄ (for example, 40:05-55:39), and playback time T ₆ (for example, 100:03 -113:57); the second playback time of object c in the multimedia data is the playback time T ₅ (for example, 80:30-88:50).

The computer device can obtain the audio cluster C ₁ from these three audio clusters, and then can extract the playback time of the audio cluster C ₁ in the multimedia data (ie, the first playback time of the audio cluster C ₁ ). . The first playback time of the audio cluster C ₁ in the multimedia data may include the playback time t ₁ corresponding to the sound segment 1 (for example, 00:30-10:10) and the playback time t ₃ (for example, 00:30-10:10) corresponding to the sound segment 3 ( For example, 35:08-40:52). At this time, the computer device can respectively determine the time overlap between the audio cluster C ₁ and the second playback time corresponding to each business object. For example, the time overlap between the first playback time of audio cluster C ₁ and the second playback time of object a is 98%, the time overlap with the second playback time of object b is 5%, and the time overlap with the second playback time of object b is 5%. The temporal overlap between the second playback times of object c is 1%. Then, the computer device can determine the second playback time with the highest time overlap degree from the three time overlap degrees, that is, the second playback time of object a. Further, the computer device can use object a as an audio clustering cluster. The business object corresponding to C ₁ , and the business roles (ie, role 1 and role 2) that have a mapping relationship with object a are obtained from the above Table 1 as the business role corresponding to the audio cluster C ₁ . This means that the computer device can identify that each audio line in audio cluster C ₁ is spoken by either character 1 or character 2.

By analogy, the computer device can refer to the audio role identification method of the business role corresponding to the audio cluster C ₁ and determine that the business role corresponding to the audio cluster C ₂ can be the role 3 that has a mapping relationship with the object b. The audio cluster The business role corresponding to cluster C ₃ may be role 4 that has a mapping relationship with object c.

Further, please refer to FIG. 10 , which is another schematic flowchart of a data processing method provided by an embodiment of the present application. This method can be executed by a terminal device with an audio role recognition function (for example, any terminal device in the terminal device cluster shown in FIG. 1 above, for example, the terminal device 100a), or by a server with an audio role recognition function ( For example, the server 10F) shown in FIG. 1 can also be executed interactively by a target terminal device with a multimedia data playback function and a server with an audio character recognition function, which is not limited here. The method may at least include the following steps S201 to S205:

Step S201: Identify picture feature information from video frames of multimedia data.

Step S202, locate and separate the audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames in the multimedia data, and compare The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters.

Step S203: Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.

For the specific implementation of step S201 to step S203, please refer to the description of step S101 to step S103 in the embodiment corresponding to FIG. 3, which will not be described again here.

Step S204, based on the first playback time of the P audio clusters (specifically, the object audio frames corresponding to the P audio clusters) in the multimedia data and the business objects corresponding to the P audio clusters (specifically, The second playback time in the multimedia data of the video frame in which the P audio clusters respectively correspond to the business objects respectively determines the service playback time in the multimedia data of each of the P business objects.

Specifically, the computer device can obtain the target audio cluster from the P audio clusters, and further can determine the first playback time of the target audio cluster in the multimedia data, and the first playback time of the target audio cluster corresponding to the target audio cluster. The second playback time of the business object in the multimedia data. Further, the computer device can determine the time intersection or time union of the first playback time and the second playback time of the target audio cluster, and then can use the determined time intersection or time union as the target audio cluster. The service playback time of the business object corresponding to the cluster in the multimedia data is obtained until the service playback time of each of the P business objects in the multimedia data is obtained.

The embodiment of the present application uses the audio semantic feature clustering method to perform audio character recognition, which can make up for the problem that there is no character facial information or object information in some video frames, but the character cannot be recognized when audio appears, and can automatically identify the character based on the object audio. The semantic features of frames are used to cluster the business roles corresponding to the current audio clustering cluster, thus filling the shortcomings of using image recognition for role recognition and ensuring the integrity of the role's time positioning information in the entire multimedia data.

As shown in Figure 9, the first playback time of audio cluster C ₁ in the multimedia data may include the playback time t ₁ corresponding to segment 1 (for example, 00:30-10:10) and the playback time corresponding to segment 3 t ₃ (e.g., 35:08-40:52). The second playback time of the business object (for example, object a) corresponding to the audio cluster C ₁ in the multimedia data is the playback time T ₁ (for example, 00:00-10:00) and the playback time T ₃ (for example, 30 :45-38:00). If the computer device uses a time intersection method to determine the service playback time, the service playback time of object a determined by the computer device can be 00:30-10:00 and 35:08-38:00. If the computer device uses a time union method to determine the service playback time, the service playback time of object a determined by the computer device may be 00:00-10:10 and 30:45-40:52.

Step S205: Based on the service playback time corresponding to each of the P business objects, obtain the multimedia segment data corresponding to the P business objects from the multimedia data.

The multimedia segment data here may include audio frames associated with the corresponding business object and audio frames associated with the corresponding business object. video frames.

As shown in Figure 9, when the computer device obtains the service playback time of object a, the service playback time of object b, and the service playback time of object c, it can respectively obtain the multimedia segment data corresponding to these three service objects. For example, the computer device can obtain multimedia segment data that matches the service playback time of object a from the multimedia data (that is, including video frames associated with object a and audio frames associated with object a) as the object The multimedia segment data corresponding to a (for example, multimedia segment data 1). In the same way, the computer device can obtain the multimedia segment data that matches the service playback time of object b (that is, including the video frame associated with object b and the audio frame associated with object b) as the multimedia segment data corresponding to object b. Segment data (for example, multimedia segment data 2); obtain the multimedia segment data that matches the service playback time of object c (that is, including the video frame associated with object c and the audio frame associated with object c), as the object c corresponding multimedia segment data (for example, multimedia segment data 3).

The fully automatic audio character recognition solution based on the audio semantic feature clustering method provided by the embodiments of the present application can automatically combine picture feature information (for example, character facial information) to identify services in multimedia data. Character recognition can save a lot of manual annotation costs and time costs, and accelerate the implementation of video applications. Among them, when the computer device obtains the multimedia fragment data corresponding to each business object, it can be applied in the "watch TA only" user-specific service in the multimedia data playback scenario, and can target the business objects in the multimedia data. (or business role) to select storyboards, so that when the target user triggers this user-specific service, the multimedia segment data that is not selected by the user will be automatically skipped, so that the computer device can more clearly locate the business objects that the user likes. multimedia segment data.

The computer device can play multimedia data in a business playback display interface. The service playback display interface may include a playback selection control for triggering a target video data selection function. Further, when the target user performs a triggering operation on the playback selection control, the computer device may display the object playlist in response to the triggering operation. For example, the object playlist here can be displayed in the bottom area of the business playback display interface in a floating window form, a masked form, or a translucent form, or it can also be displayed on a shrinkable interface that can change the display size through drag and drop operations. The size is smaller than the service playback display interface. The object playlist here may include object cover data corresponding to Z business objects respectively; and Z is a positive integer less than or equal to P.

When the target user can perform a triggering operation on the target object cover data among the Z object cover data, the computer device can respond to the triggering operation and play the target multimedia segment data in the service playback interface. The target multimedia segment data here may be the multimedia segment data corresponding to the business object corresponding to the target object cover data, and the business object corresponding to the target object cover data belongs to P business objects. Among them, the triggering operations here may include contact operations such as clicks and long presses, and may also include non-contact operations such as voice and gestures, which will not be limited here.

For ease of understanding, please further refer to FIG. 11 , which is a schematic diagram of a scene for displaying multimedia segment data according to an embodiment of the present application. As shown in Figure 11, the computer device in the embodiment of the present application may be a target terminal device used by the target user. The target terminal device may be any terminal device in the terminal device cluster in the embodiment corresponding to Figure 1, for example, the terminal device 100a. Among them, the interface 1101J and the interface 1102J shown in Figure 11 are both service playback display interfaces at different times provided by a client with a multimedia data playback function.

The target terminal device used by the target user can display multimedia data in the interface 1101J. The multimedia data here can be the multimedia data 20S in the embodiment corresponding to Figure 2. The interface 1101J may include a control 11U, which is a playback selection control used to trigger the target video data selection function.

When the target user performs a triggering operation (for example, a click operation) on the control 11U, the target terminal device may display the object playlist 11B shown in FIG. 11 in response to the triggering operation. The object playlist 11B here may include object cover data corresponding to the Z business objects and cover data corresponding to the multimedia data (for example, "watch the complete video"). Taking three as an example, the object playlist 11B may specifically include object cover data 1 corresponding to object a (for example, "watch only the clips of object a"), and object cover data 2 corresponding to object b (for example, "watch only clips of object b"). ”) and the object cover data 3 corresponding to object c (for example, “see only the fragment of object c”). Among them, the object a, the object b and the object c here all belong to the P business objects obtained by the target terminal device and obtained by performing audio role recognition on the multimedia data.

At this time, the target user can perform a triggering operation on the target object cover data (for example, the object cover data 1 corresponding to object a) among the Z pieces of object cover data. In response to the triggering operation, the target terminal device can play the multimedia segment data corresponding to the object a corresponding to the object cover data 1 in the interface 1102J shown in FIG. 11 . As shown in Figure 11, the target terminal device can also highlight the playback progress corresponding to the multimedia segment data corresponding to object a in the playback progress bar corresponding to the multimedia data displayed on the interface 1102J, so that the target user can more quickly and accurately Find the next segment of multimedia segment data corresponding to the object a that you are interested in.

It should be noted that the interface and controls shown in Figure 11 are only some representations for reference. In actual business scenarios, developers can carry out relevant designs according to product requirements. The embodiments of this application do not limit the specific forms of the interfaces and controls involved.

Further, when the computer device obtains the multimedia segment data corresponding to each business object, it can also apply it in the merged and edited scene. For example, the computer device classifies the audio data in the multimedia data, distinguishes the business role corresponding to each audio line, and organizes the line voice collection (i.e., audio clustering cluster) corresponding to each business role in the entire multimedia data. Use it as production material and provide it to the intelligent production video team as alternative information for editing. For example, the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data. For another example, the computer device can merge and edit corresponding multimedia segment data of different business objects.

The multimedia data here may include first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited. The objects to be edited here belong to the P business objects obtained through audio role recognition by the computer equipment. For example, the first multimedia data here can be a war-themed TV series in which the subject to be edited participates. The second multimedia data here can be the TV series with the theme of fairy tales in which the subject to be edited participates.

The computer device can obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and further can obtain the first target business role related to the first multimedia data. The first multimedia segment data of the connection. The first multimedia segment data here is determined by the computer device based on the service playback time of the object to be edited in the first multimedia data. In the same way, the computer device can also obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role from the second multimedia data. Second multimedia segment data associated with the character. The second multimedia segment data here may be determined by the computer device based on the service playback time of the object to be edited in the second multimedia data. At this time, the computer device can perform merge and edit processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited. The merged clip data here can be used to upload to the business data platform where the client is located, so that objects accessing the client can check it on the corresponding terminal device.

In the embodiment of the present application, a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby Can accurately identify the angle with the object The P audio clusters associated with the color mapping table correspond to business roles respectively. This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. Instead, it can automatically identify and write the business role and audio line information before the multimedia data is put on the shelf, so that it can quickly provide downstream services (for example, User characteristic service business, merged editing business, etc.) are empowered. The embodiment of the present application adopts the audio semantic feature clustering method in the audio character recognition process, which can not only reduce the manpower time consumed, but also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition. In addition, At the same time, the entire audio character recognition system is more versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of recognition.

Further, please refer to FIG. 12 , which is a schematic structural diagram of a data processing device provided by an embodiment of the present application. As shown in Figure 12, the data processing device 1 may include: a picture information acquisition module 100, a clustering processing module 200, and an audio character recognition module 300.

The picture information acquisition module 100 is used to identify picture feature information from video frames of multimedia data. The picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer.

The clustering processing module 200 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.

The audio role recognition module 300 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data. Role, where P is a positive integer less than or equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.

For the specific implementation of the picture information acquisition module 100, the clustering processing module 200 and the audio character recognition module 300, please refer to the description of steps S101 to S103 in the embodiment corresponding to Figure 3 above, and will not be described again here. . In addition, the description of the beneficial effects of using the same method will not be described again.

Further, please refer to FIG. 13 , which is another schematic structural diagram of a data processing device provided by an embodiment of the present application. As shown in Figure 13, the data processing device 2 may include: a picture information acquisition module 11, a clustering processing module 12, an audio role recognition module 13, a business time determination module 14, a segment data determination module 15, a multimedia data playback module 16, Object list display module 17, segment data playback module 18, first segment data acquisition module 19, second segment data acquisition module 20 and merge editing module 21.

The picture information acquisition module 11 is used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, and M is a positive integer.

Among them, the picture information acquisition module 11 includes: a video frame acquisition unit 111, a picture cutting unit 112, a picture encoding unit 113, a vector matching unit 114 and a picture information acquisition unit 115.

The video frame acquisition unit 111 is used to acquire video frames from multimedia data.

The picture cutting unit 112 is used to cut pictures containing key parts of the character in the video frame to obtain the character picture corresponding to the video frame. The character pictures include X character cut pictures, where X is a positive integer greater than or equal to M.

The picture cutting unit 112 includes: a position determining subunit 1121 and a cutting subunit 1122.

The position determination subunit 1121 is used to detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame.

The cutting subunit 1122 is used to cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame.

For the specific implementation of the position determination sub-unit 1121 and the cutting sub-unit 1122, please refer to the description of the character cutting picture in the embodiment corresponding to FIG. 5, and will not be described again here.

The _picture encoding unit 113 is _used to obtain the character cut picture _Ti _among Or a positive integer equal to X.

The vector matching unit 114 is used to determine the object key information vector that matches the picture information vector _Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as the role Cut the business object corresponding to picture T _i .

The vector matching unit 114 includes: a database acquisition subunit 1141, a vector distance determination subunit 1142, and an object matching subunit 1143.

The database acquisition subunit 1141 is used to acquire an information vector database associated with candidate business objects, where the information vector database is used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.

The vector distance determination subunit 1142 is used to respectively determine the vector distance between the picture information vector _Li and each object key information vector in the Y object key information vectors, to obtain Y vector distances.

The object matching subunit 1143 is used to obtain the minimum vector distance that is less than or equal to the distance threshold from Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as a role Cut the business object corresponding to picture T _i .

Among them, the specific implementation of the database acquisition sub-unit 1141, vector distance determination sub-unit 1142 and object matching sub-unit 1143 can be referred to the description of object matching of character cut pictures in the embodiment corresponding to Figure 4 above, which will not be continued here. Elaborate.

The picture information acquisition unit 115 is configured to determine the picture feature information corresponding to the video frame based on the obtained business objects corresponding to the character cut pictures.

For the specific implementation of the video frame acquisition unit 111, picture cutting unit 112, picture encoding unit 113, vector matching unit 114 and picture information obtaining unit 115, please refer to the description of step S101 in the embodiment corresponding to Figure 3 above. Here No further details will be given.

The clustering processing module 12 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.

The clustering processing module 12 includes: an object audio frame determination unit 121, a semantic feature extraction unit 122, and a clustering processing unit 123.

The object audio frame determining unit 121 is used to locate and separate audio frames containing human voices from original audio frames of multimedia data to obtain N object audio frames.

The object audio frame determination unit 121 includes: an original audio frame acquisition subunit 1211, a source separation subunit 1212, and an object audio frame determination subunit 1213.

The original audio frame acquisition subunit 1211 is used to acquire original audio frames from multimedia data.

The source separation subunit 1212 is used to perform source separation on the original audio frame to obtain an audio frame to be processed that contains human voice.

Among them, the source separation sub-unit 1212 includes: an amplitude spectrum generating sub-unit 12121, a type feature generating sub-unit 12122, a merging mask sub-unit 12123 and an audio frame to be processed determining sub-unit 12124.

The amplitude spectrum generation subunit 12121 is used to input the original audio frame to the source separation model, and generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model. The source separation model includes a first segmentation network layer and a second segmentation network layer.

The type feature generation subunit 12122 is used to input the spectrum amplitude spectrum into the first segmentation network layer and the second segmentation network layer respectively, generate the first type feature corresponding to the spectrum amplitude spectrum through the first segmentation network layer, and generate the first type feature corresponding to the spectrum amplitude spectrum through the second segmentation network layer. Generate second type features corresponding to the spectral amplitude spectrum.

The merge mask subunit 12123 is used to perform merge mask processing on the first type features and the second type features to obtain a target mask map corresponding to the first type features.

The audio frame determination subunit 12124 to be processed is used to generate a target type audio frame through spectrum inverse transformation based on the corresponding position of the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the source separation model to output the audio frame containing the human voice. of audio frames to be processed.

Among them, the specific implementation of the amplitude spectrum generation sub-unit 12121, the type feature generation sub-unit 12122, the merging mask sub-unit 12123 and the audio frame to be processed determining sub-unit 12124 can be referred to the audio frame to be processed in the embodiment corresponding to Figure 7. description, which will not be described further here.

The object audio frame determination subunit 1213 is used to locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, to obtain N object audio frames.

For the specific implementation of the original audio frame acquisition sub-unit 1211, the source separation sub-unit 1212 and the object audio frame determination sub-unit 1213, please refer to the object positioning and separation processing of the original audio frame in the embodiment corresponding to Figure 3. description, which will not be described further here.

The semantic feature extraction unit 122 is used to extract semantic features from each of the N object audio frames, and obtain an audio semantic feature vector corresponding to each object audio frame.

The semantic feature extraction unit 122 includes: an audio frame input subunit 1221, a frequency domain feature determination subunit 1222, a time domain feature determination subunit 1223, and an audio feature vector determination subunit 1224.

The audio frame input subunit 1221 is used to input N object audio frames to the audio semantic feature extraction model. The audio semantic feature extraction model includes frequency domain branch network layer, time domain branch network layer and convolution network layer.

The frequency domain feature determination subunit 1222 is used to perform feature learning on N object audio frames through the frequency domain branch network layer to obtain a learned frequency domain feature map.

The time domain feature determination subunit 1223 is used to perform feature learning on N object audio frames through the time domain branch network layer to obtain a learned time domain feature map. The feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same.

The audio feature vector determination subunit 1224 is used to superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, and perform maximum average processing on the superimposed features. Output the audio semantic feature vector corresponding to each object audio frame.

Among them, the audio frame input subunit 1221, frequency domain feature determination subunit 1222, time domain feature determination subunit 1223 and audio For the specific implementation of the feature vector determination sub-unit 1224, please refer to the description of semantic feature extraction of the object audio frame in the embodiment corresponding to FIG. 8, and will not be described again here.

The clustering processing unit 123 is used to determine M as the number of cluster centers to be clustered, and perform clustering processing on the audio semantic feature vector corresponding to each obtained object audio frame based on the number of cluster centers to obtain M audio clusters. Class cluster.

For the specific implementation of the object audio frame determination unit 121, the semantic feature extraction unit 122 and the clustering processing unit 123, please refer to the description of step S102 in the embodiment corresponding to Figure 3 above, and will not be described again here.

The audio role recognition module 13 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data. Role, where P is a positive integer less than or equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.

The audio character recognition module 13 includes: a first time extraction unit 131, a second time extraction unit 132, a time overlap determination unit 133, and an audio character recognition unit 134.

The first time extraction unit 131 is used to obtain the audio cluster C _k from the M audio clusters, and extract the object audio frame in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C _k One or more playback times are used as the first playback time of the audio cluster C _k , where k is a positive integer less than or equal to M.

The second time extraction unit 132 is used to obtain P business objects that overlap with the M business objects from the list business objects in the object role mapping table, and extract each of the P business objects based on the picture feature information. One or more playback times in the multimedia data of the video frame where the business object is located are used as the second playback time of each business object.

The time overlap determination unit 133 is used to respectively determine the time overlap between the first playback time of the audio cluster C _k and the second playback time corresponding to each business object, and assign the second playback time with the highest time overlap to The business object corresponding to the time is used as the business object corresponding to the audio cluster C _k .

The audio role identification unit 134 is used to obtain the business role corresponding to the business object corresponding to the audio cluster C _k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C _k .

For the specific implementation of the first time extraction unit 131, the second time extraction unit 132, the time overlap determination unit 133 and the audio character recognition unit 134, please refer to the description of step S103 in the embodiment corresponding to Figure 3 above. Here No further details will be given.

The service time determination module 14 is configured to determine P based on the first playback time of the P audio clusters in the multimedia data and the second playback time of the business objects corresponding to the P audio clusters in the multimedia data. The service playback time of each business object in the multimedia data.

The segment data determination module 15 is used to obtain multimedia segment data corresponding to P business objects from the multimedia data based on the service playback time corresponding to each business object. The multimedia segment data includes audio frames associated with the corresponding business object and video frames associated with the corresponding business object.

The multimedia data playing module 16 is used to play multimedia data in the service playing display interface. The service playback display interface includes a playback selection control used to trigger the object video data selection function.

The object list display module 17 is used to display the object play list in response to the triggering operation of the play selection control, wherein the object play list The playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;

The segment data playback module 18 is used to respond to the trigger operation for the target object cover data among the Z object cover data, and play the target multimedia segment data in the service playback interface, where the target multimedia segment data is the service corresponding to the target object cover data. The multimedia fragment data corresponding to the object and the business object corresponding to the cover data of the target object belong to P business objects.

The multimedia data includes first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited. The object to be edited belongs to P business objects.

The first segment data acquisition module 19 is configured to obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and obtain the first target business role corresponding to the first multimedia data from the first multimedia data. The first multimedia segment data associated with the target business role; the first multimedia segment data is determined based on the service playback time of the object to be edited in the first multimedia data.

The second segment data acquisition module 20 is configured to obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role corresponding to the second multimedia data from the second multimedia data. The second multimedia segment data associated with the target business role; the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data.

The merging and editing module 21 is used to perform merging and editing processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.

Among them, the picture information acquisition module 11, clustering processing module 12, audio role recognition module 13, business time determination module 14, segment data determination module 15, multimedia data playback module 16, object list display module 17, segment data playback module 18 , the specific implementation of the first fragment data acquisition module 19, the second fragment data acquisition module 20 and the merge editing module 21 can be referred to the description of steps S201 to step S205 in the embodiment corresponding to Figure 10 above, and will not be continued here. Repeat. In addition, the description of the beneficial effects of using the same method will not be described again.

Further, please refer to FIG. 14 , which is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in Figure 14, the computer device 1000 may be a computer device with an audio character recognition function. The computer device 1000 may include: at least one processor 1001, for example, a CPU, at least one network interface 1004, a memory 1005, and at least one communication interface. Bus 1002. Among them, the communication bus 1002 is used to realize connection communication between these components. The network interface 1004 may include standard wired interfaces and wireless interfaces (such as WI-FI interfaces). The memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. As shown in Figure 14, memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program. Wherein, in some embodiments, the computer device may also include the user interface 1003 shown in Figure 14. For example, if the computer device is a terminal device with audio character recognition function shown in FIG. 1 (for example, terminal device 100a), the computer device may also include the user interface 1003. The user interface 1003 may include a display screen (Display), a keyboard (Keyboard), etc.

In the computer device 1000 shown in Figure 14, the network interface 1004 is mainly used for network communication, and the user interface 1003 is mainly used to provide an input interface for the user, and the processor 1001 can be used to call the device control stored in the memory 1005. application to achieve:

Locate and separate audio frames containing human voices from the original audio frames of multimedia data to obtain N object audio frames. From N objects The corresponding audio semantic feature vectors are extracted from the audio frames, and the audio semantic feature vectors corresponding to the N object audio frames are clustered to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster Corresponds to a business object;

The computer device 1000 described in the embodiment of the present application can execute the data processing method described in the embodiment corresponding to FIG. 3 and FIG. 10, and can also execute the data processing device 1 and the data processing method in the embodiment corresponding to FIG. 12. The description of the data processing device 2 in the embodiment corresponding to Figure 13 will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be described again.

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, the steps in Figure 3 and Figure 10 are implemented. For the data processing method provided, please refer to Figure 3 and the implementation provided in each step of Figure 10 for details, which will not be described again here.

The computer-readable storage medium may be the data transmission device provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or memory of the computer device. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc. Further, the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.

In one aspect, the present application provides a computer program product, which includes a computer program stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device can execute the description of the data processing method in the embodiment corresponding to Figure 3 or Figure 10, where No longer. In addition, the description of the beneficial effects of using the same method will not be described again.

The terms “first”, “second”, etc. in the description, claims, and drawings of the embodiments of this application are used to distinguish different objects, rather than describing a specific sequence. Furthermore, the term "includes" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, device, product or equipment that includes a series of steps or units is not limited to the listed steps or modules, but may also include unlisted steps or modules, or may also include steps for these processes, Other units of steps inherent in a method, apparatus, product, or equipment.

Those of ordinary skill in the art can appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of both. In order to clearly illustrate the relationship between hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

What is disclosed above is only the preferred embodiment of the present application. Of course, it cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application still fall within the scope of the present application.

Claims

A data processing method, characterized by including:

Identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, where M is a positive integer;

Locate and separate the audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N The audio semantic feature vectors corresponding to the object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;

Based on the picture feature information, the M audio clusters and the object role mapping table associated with the multimedia data, identify the business role corresponding to each of the P audio clusters, Wherein, P is a positive integer less than or equal to M, the object role mapping table includes business roles that have a mapping relationship with the list business object, and there are P overlapping businesses between the list business object and the M business objects. object.
The method according to claim 1, characterized in that identifying picture feature information from video frames of multimedia data includes:

Obtain video frames from multimedia data;

Carry out cutting processing on the pictures containing the key parts of the character in the video frame to obtain the character picture corresponding to the video frame, wherein the character picture includes X character cut pictures, and X is a positive integer greater than or equal to M;

Obtain the character cutting picture Ti among the X character cutting pictures, encode the character cutting picture Ti , and obtain the picture information vector Li corresponding to the character cutting picture Ti , where i is less than or equal to X a positive integer;

From the information vector database associated with the candidate business object, determine the object key information vector matching the picture information vector Li , and use the candidate business object corresponding to the matched object key information vector as the role cutting picture T The business object corresponding to i ;

Based on the business objects corresponding to the X character cut pictures, the picture feature information corresponding to the video frame is determined.
The method according to claim 2, characterized in that cutting the pictures containing key parts of the characters in the video frames to obtain the character pictures corresponding to the video frames includes:

Detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame;

Based on the position information, the key parts of the character are cut in the video frame to obtain X character cut pictures including the key parts of the character, and the X character cut pictures are used as the character pictures corresponding to the video frames.
The method according to claim 2, characterized in that: from the information vector database associated with the candidate business object, the object key information vector matching the picture information vector Li is determined, and the matched object is The candidate business object corresponding to the key information vector is used as the business object corresponding to the character cutting picture Ti , including:

Obtain an information vector database associated with the candidate business object. The information vector database is used to store object key information vectors corresponding to Y candidate objects, where Y is a positive integer greater than or equal to M;

Determine the vector distance between the picture information vector Li and each of the Y object key information vectors, respectively, to obtain Y vector distances;

Obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, and determine the distance corresponding to the minimum vector distance. The candidate business object corresponding to the object key information vector is used as the business object corresponding to the role cutting picture Ti .
The method according to any one of claims 1 to 4, characterized in that the corresponding audio semantic feature vectors are respectively extracted from the N object audio frames, and the audio corresponding to the N object audio frames are The semantic feature vectors are clustered to obtain M audio clusters, including:

Perform semantic feature extraction on each of the N object audio frames to obtain the audio semantic feature vector corresponding to each of the object audio frames;

M is determined as the number of cluster centers to be clustered. Based on the number of cluster centers, the audio semantic feature vector corresponding to each obtained object audio frame is clustered to obtain M audio clusters.
The method according to claim 5, characterized in that locating and separating audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames includes:

Obtain original audio frames from the multimedia data;

Perform source separation on the original audio frame to obtain an audio frame to be processed containing human voice;

Based on the audio boundary detection strategy for eliminating silent frames, the non-silent segments in the audio impact signal frames in the audio frames to be processed are located and cut to obtain N object audio frames.
The method according to claim 6, characterized in that said performing source separation on the original audio frame to obtain an audio frame to be processed containing human voice includes:

The original audio frame is input to the source separation model, and the spectrum amplitude spectrum corresponding to the original audio frame is generated through the source separation model. The source separation model includes a first segmentation network layer and a second segmentation network layer. ;

The spectrum amplitude spectrum is input into the first segmentation network layer and the second segmentation network layer respectively, and the first type feature corresponding to the spectrum amplitude spectrum is generated through the first segmentation network layer, and the first type feature corresponding to the spectrum amplitude spectrum is generated through the second segmentation network layer. Split the network layer to generate a second type of feature corresponding to the spectrum amplitude spectrum;

Merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features;

Based on the corresponding position of the target mask map and the spectrum amplitude spectrum, a target type audio frame is generated through spectrum inverse transformation, and the target type audio frame is used as the to-be-processed audio frame containing human voice output by the source separation model. audio frame.
The method according to claim 5, wherein the semantic feature extraction is performed on each object audio frame in the N object audio frames to obtain the audio semantic feature vector corresponding to each object audio frame, include:

Input the N object audio frames to an audio semantic feature extraction model, wherein the audio semantic feature extraction model includes a frequency domain branch network layer, a time domain branch network layer, and a convolution network layer;

Through the frequency domain branch network layer, feature learning is performed on the N object audio frames to obtain a learned frequency domain feature map;

Through the time domain branch network layer, feature learning is performed on the N object audio frames to obtain a learned time domain feature map, where the difference between the learned frequency domain feature map and the learned time domain feature map is The feature dimensions are the same;

Superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, perform maximum averaging processing on the superimposed features, and output The audio semantic features corresponding to each object audio frame vector.
The method according to any one of claims 1 to 4, characterized in that, based on the picture feature information, the M audio clusters and the object role mapping table associated with the multimedia data, the identification The business roles corresponding to each of the P audio clusters include:

Obtain audio cluster C k from the M audio clusters, and extract one or more object audio frames in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C k Playing time, as the first playing time of the audio cluster C k , k is a positive integer less than or equal to M;

From the list of business objects in the object role mapping table, P business objects that overlap with M business objects are obtained, and based on the picture feature information, the location of each business object in the P business objects is extracted. One or more playback times of the video frames in the multimedia data are used as the second playback time of each business object;

Determine the time overlap degree between the first playback time and each second playback time of the audio cluster C k respectively, and use the business object corresponding to the second playback time with the highest time overlap degree as the The business object corresponding to the audio cluster C k ;

From the object role mapping table, obtain the business role corresponding to the business object corresponding to the audio cluster C k , and use the obtained business role as the business role corresponding to the audio cluster C k .
The method according to any one of claims 1 to 4, characterized in that the method further includes:

Based on the first playback time of each of the P audio clusters in the multimedia data and the second playback time of the business object corresponding to the P audio clusters in the multimedia data, P is determined The service playback time of each service object in the multimedia data;

Based on the service playback time corresponding to each of the P business objects, multimedia segment data corresponding to the P business objects are obtained from the multimedia data. The multimedia segment data includes information related to the corresponding business object. associated audio frames and video frames associated with the corresponding business object.
The method of claim 10, further comprising:

Play the multimedia data in a service playback display interface, which includes a playback selection control for triggering an object video data selection function;

In response to the triggering operation of the playback selection control, an object playlist is displayed, where the object playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;

In response to the trigger operation for the target object cover data among the Z object cover data, the target multimedia segment data is played in the service playback interface, where the target multimedia segment data is the multimedia corresponding to the business object corresponding to the target object cover data. Fragment data, the business object corresponding to the cover data of the target object belongs to the P business objects.
The method according to claim 10, wherein the multimedia data includes first multimedia data and second multimedia data, and the first multimedia data and the second multimedia data both Including objects to be edited, the objects to be edited belong to the P business objects;

The method also includes:

Based on the object role mapping table associated with the first multimedia data, obtain the first target business role corresponding to the object to be edited, and obtain the first target business role from the first multimedia data. first multimedia segment data associated with the character, said first multimedia The media segment data is determined based on the service playback time of the object to be edited in the first multimedia data;

Based on the object role mapping table associated with the second multimedia data, obtain the second target business role corresponding to the object to be edited, and obtain the second target business role from the second multimedia data. Second multimedia segment data associated with the character, the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data;

The first multimedia segment data and the second multimedia segment data are merged and edited to obtain merged editing data corresponding to the object to be edited.
A data processing device, characterized in that it includes:

A picture information acquisition module, used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong; M is a positive integer;

A clustering processing module, used to locate and separate audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, and extract corresponding audio semantic features from the N object audio frames. vector, and cluster the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster corresponds to one business object;

An audio role recognition module, configured to identify each audio cluster in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table associated with the multimedia data. The business roles corresponding to the clusters respectively, where P is a positive integer less than or equal to M; the object role mapping table includes business roles that have a mapping relationship with the list business object; the list business object and the M business objects There are P overlapping business objects.
A computer device, characterized by including: a processor and a memory;

The processor is connected to a memory, wherein the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the method described in any one of claims 1 to 12.
A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the right The method described in any one of claims 1 to 12.