WO2023197979A1 - Data processing method and apparatus, and computer device and storage medium - Google Patents

Data processing method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2023197979A1
WO2023197979A1 PCT/CN2023/087208 CN2023087208W WO2023197979A1 WO 2023197979 A1 WO2023197979 A1 WO 2023197979A1 CN 2023087208 W CN2023087208 W CN 2023087208W WO 2023197979 A1 WO2023197979 A1 WO 2023197979A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
business
multimedia data
picture
frames
Prior art date
Application number
PCT/CN2023/087208
Other languages
French (fr)
Chinese (zh)
Inventor
冯鑫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023197979A1 publication Critical patent/WO2023197979A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3263Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving certificates, e.g. public key certificate [PKC] or attribute certificate [AC]; Public key infrastructure [PKI] arrangements
    • H04L9/3265Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving certificates, e.g. public key certificate [PKC] or attribute certificate [AC]; Public key infrastructure [PKI] arrangements using certificate chains, trees or paths; Hierarchical trust model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/66Arrangements for connecting between networks having differing types of switching systems, e.g. gateways
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0823Network architectures or network communication protocols for network security for authentication of entities using certificates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions

Definitions

  • the present application relates to the field of computer technology, and in particular, to a data processing method, device, computer equipment and storage medium.
  • Embodiments of the present application provide a data processing method, device, computer equipment and storage medium, which can improve the accuracy, efficiency and applicability of audio character recognition.
  • embodiments of the present application provide a data processing method, including:
  • the picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
  • M audio clusters and the object role mapping table associated with the multimedia data identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
  • a data processing device including:
  • the picture information acquisition module is used to identify picture feature information from the video frame of the multimedia data.
  • the picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer;
  • the clustering processing module is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and perform The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;
  • the audio role recognition module is used to identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • P is a positive integer less than or equal to M
  • the object role mapping table includes business roles that have a mapping relationship with the list business object; there are P overlapping business objects between the list business object and the M business objects.
  • embodiments of the present application provide a computer device, including: a processor and a memory;
  • the processor is connected to a memory, where the memory is used to store a computer program.
  • the computer program is executed by the processor, the computer device executes the method provided by the embodiment of the present application.
  • inventions of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application. Examples provide methods.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a computer program.
  • the computer program is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium.
  • the processor executes the computer program, so that the computer device executes the method in the embodiment of the present application.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic flow diagram of a system for audio character recognition provided by an embodiment of the present application
  • Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application
  • Figure 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application.
  • Figure 6 is a schematic architectural diagram of an audio semantic feature clustering provided by an embodiment of the present application.
  • Figure 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a scene for audio character recognition provided by an embodiment of the present application.
  • Figure 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of a scene for displaying multimedia segment data provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Figure 13 is another structural schematic diagram of a data processing device provided by an embodiment of the present application.
  • Figure 14 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the embodiment of the present application provides a character recognition method based on audio semantic feature clustering, which can be applied to the field of artificial intelligence.
  • Artificial Intelligence is the theory, method, technology and technology that uses digital computers or digital computer-controlled calculations to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system.
  • artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, autonomous driving, smart transportation and other major directions.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to machine vision such as using cameras and computers to replace human eyes to identify and measure targets, and Further graphics processing is performed to make the computer processing into an image more suitable for human eye observation or to be transmitted to the instrument for detection.
  • computer vision studies related theories and technologies trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous positioning and mapping Construction, autonomous driving, smart transportation and other technologies also include common biometric identification technologies such as facial recognition and fingerprint recognition.
  • the key technologies of speech technology include automatic speech recognition technology, speech synthesis technology and voiceprint recognition technology. Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • Natural Language Processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning often include artificial neural networks, Belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a server 1OF and a terminal device cluster.
  • the terminal device cluster may include one or more terminal devices, and there will be no limit on the number of terminal devices here.
  • the terminal device cluster may specifically include terminal devices 100a, terminal devices 100b, terminal devices 100c,..., terminal devices 100n.
  • the terminal device 100a, the terminal device 100b, the terminal device 100c, ..., the terminal device 100n can each have a network connection with the above-mentioned server 10F, so that each terminal device can perform data interaction with the server 10F through the network connection.
  • the network connection here is not limited to a connection method. It can be connected directly or indirectly through wired communication, or directly or indirectly through wireless communication. It can also be connected through other methods. This application does not limit it here.
  • Each terminal device in the terminal device cluster may include: smart phones, tablets, laptops, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart TVs and other smart terminals with audio role recognition functions.
  • Each terminal device in the terminal device cluster as shown in Figure 1 can be installed with a target application (for example, a client). When the client is running in each terminal device, it can perform data interaction with the server 10F shown in FIG. 1 .
  • the client may include a social client, a multimedia client (for example, a video client), an entertainment client (for example, a game client), an information flow client, an education client, a live broadcast client, and other clients.
  • the client can be an independent client or an embedded sub-client integrated in a certain client (for example, a social client, an education client, a multimedia client, etc.), which is not limited here.
  • the server 10F in the embodiment of the present application can be the server corresponding to the client.
  • the server 10F can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
  • the embodiment of this application will not limit the number of servers.
  • one terminal device may be selected as the target terminal device among the multiple terminal devices shown in FIG. 1 .
  • the terminal device 100a shown in FIG. 1 can be used as a target terminal device, and the target terminal device can be integrated with a target application (for example, a client).
  • the target terminal device can realize data interaction with the server 10F through the business data platform corresponding to the client.
  • the client here may have a frame sequence (for example, frame animation sequence) loading and playback function, which is used to play video frames, audio frames and text (for example, lines) in the service playback display interface provided by the client.
  • Multimedia data here refers to the interface displayed by the terminal device for playing multimedia data.
  • the data type of the multimedia data may include film and television drama types, animation types, variety show types, etc. The data type of multimedia data will not be limited here.
  • a computer device with an audio character recognition function obtains multimedia data (for example, TV series A), it can identify picture feature information from video frames of the multimedia data.
  • the picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer.
  • the picture feature information may indicate which actor plays the character in a certain character picture including key parts of the character (for example, the character's face) in the TV series A.
  • the computer device can also extract corresponding audio semantic feature vectors from the N object audio frames, and then perform clustering processing on the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clusters. Class cluster.
  • N is a positive integer
  • the N object audio frames here are obtained by the computer device locating and separating the audio frames containing human voices from the original audio frames in the multimedia data.
  • the computer device performs object positioning and separation processing on the original audio frame in order to reduce the interference caused by the silent frames in the environmental audio track and the object audio track (for example, the vocal track) in subsequent clustering processing, so as to improve the clustering accuracy, thereby improving the accuracy of character voice recognition.
  • the computer device can be based on the picture feature information, M audio clusters, and object roles associated with the multimedia data.
  • the mapping table identifies the business role corresponding to each of the P audio clusters.
  • P can be a positive integer less than or equal to M.
  • the object role mapping table here (for example, the cast list of TV series A) may include business roles (roles) that have a mapping relationship with the list business objects (actors). There are P overlapping business objects between the list business objects in the object role mapping table and the M business objects recognized by the computer.
  • the object mapping table may be an initial object role mapping table provided by the business editor of the multimedia data acquired by the computer device (for example, the editing user of TV series A), or it may be an initial object extracted by the target user of the access client based on the business editing. What is updated by the role mapping table will not be limited here.
  • the target user can add a mapping relationship between a certain business role in TV series A (for example, waiter in a restaurant) and a certain business object (for example, actor 1) in the initial object role mapping table, that is, the waiter in the restaurant is played by actor 1.
  • the computer device in the embodiment of the present application can combine the sound and character by combining the picture feature information (for example, face information) automatically recognized from the video frame and the M audio clusters of adaptive clustering. Association identification, so that the business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. It can not only reduce the time consumed by manpower, but also solve the problem of similar timbre recognition errors, thereby improving the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • Figure 2 is a schematic flow chart of a system for audio character recognition provided by an embodiment of the present application.
  • the computer device in the embodiment of the present application may be a computer device with audio character recognition function.
  • the computer device may be any terminal device in the terminal device cluster shown in FIG. 1 , for example, the terminal device 100a, or may be the server 10F shown in FIG. 1 .
  • Computer equipment will not be limited here.
  • the audio character recognition system may include three modules. Specifically, it may include a first module (for example, a key image recognition module), a second module 202 (for example, an audio semantic feature clustering model). ) and the third module 203 (for example, character recognition module).
  • the multimedia data 20S in the embodiment of the present application may be multimedia data acquired by the computer device that requires audio character recognition.
  • the multimedia data 20S can be multimedia data corresponding to a certain episode in a certain TV series, multimedia data corresponding to a certain movie, or multimedia data corresponding to a certain variety show, which will not be discussed one by one here.
  • the multimedia data 20S is composed of video data including original video frames and audio data including original audio frames.
  • the computer device can obtain video frames from video data including raw video frames.
  • the video frame here may refer to a video frame sequence obtained by deleting the beginning and end of the original video frame in the video data.
  • the computer device can identify picture feature information from the video frames of the multimedia data 20S through the first module 201 shown in FIG. 2 .
  • the first module 201 may include a key part detection model 210w and a picture encoding model 220w.
  • the key part detection model 210w can be used to detect character pictures in video frames.
  • the character picture here refers to a picture including the key parts of the character (for example, the character's face).
  • the picture encoding model 220w can be used to encode each character cut picture in the character picture to obtain picture vector information corresponding to the character cut picture.
  • the computer device may also obtain the information vector database 200K shown in FIG. 2 from its internal memory or externally, for example.
  • the information vector database 200K can be an information index database established by the computer device in advance based on a large amount of material data (for example, multimedia data belonging to film and television drama types, variety show types, etc.) through the same key image recognition method, and is specially used for Information base for key image recognition.
  • the information vector database 200K can be used to store object key information vectors respectively corresponding to Y candidate business objects.
  • the object here is related to
  • the key information vector may also be determined through the picture encoding model 220w, and Y is a positive integer greater than or equal to M.
  • the information vector database 200K may also include object information of each candidate business object, for example, the object attribute type of the candidate business object (including singing and dancing singers, modern idol dramas, ancient palace dramas, fairy tale dramas, war-themed dramas, etc. ).
  • the computer device can obtain the picture feature information shown in Figure 2 based on the information vector database 200K and the picture information vector output by the picture coding model 220w.
  • the computer device can also obtain audio clustering results associated with the N object audio frames in the multimedia data 20S through the second module 202 shown in FIG. 2 .
  • the N object audio frames here are obtained by subjecting the original audio frames in the multimedia data to object positioning and separation processing, and N is a positive integer.
  • the second module 202 here may include a source separation model 230w and an audio semantic feature extraction model 240w.
  • the source separation model 230w here can be used to perform source separation on the original audio frame to obtain the object sound segment (or object audio track) (for example, the vocal segment (or vocal track)) and the environmental sound segment (or ambient sound track) (e.g., background sound segment (or backing track)).
  • the audio semantic feature extraction model 240w here can be used to perform frame-level semantic feature extraction on each object audio frame when N object audio frames in the object segment are obtained, so as to obtain the corresponding information of each object audio frame. Audio semantic feature vector. Further, the computer device can perform clustering processing on N audio semantic feature vectors to obtain M audio clusters, and then these M audio clusters can be used as the audio clustering results obtained by the second module 202 . Among them, an audio cluster can correspond to a business object.
  • the computer device can identify each audio in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table 200B associated with the multimedia data 20S shown in FIG. 2 Cluster clusters correspond to business roles respectively.
  • P is a positive integer less than or equal to M.
  • the object role mapping table 200B here may include business roles that have a mapping relationship with the list business objects. There are P overlapping business objects between the list business object and the M business objects.
  • the computer device can perform audio character recognition on the output information of the first two modules through the third module 203.
  • the computer device determines the position of the video frame where the P overlapping business objects are located in the multimedia data 20S.
  • the playback time ie, the second playback time
  • the playback time ie, the first playback time
  • the computer device can determine the audio clusters corresponding to the P business objects by comparing the two playback times, and further determine the business roles corresponding to each of the P audio clusters. .
  • the computer device in the embodiment of the present application can combine the picture feature information (for example, face information) output by the first module 201 and the audio clustering result output by the second module 202, in the third module 203.
  • the audio and business roles are associated and identified, so that the business roles respectively corresponding to the P audio clusters associated with the object role mapping table 200B can be accurately identified.
  • This audio character recognition method not only improves the accuracy and efficiency of recognition, but also improves the applicability of recognition.
  • the computer device with the audio character recognition function recognizes the object character by combining the picture feature information (for example, face information) automatically recognized from the video frame of the multimedia data and the M audio clusters of adaptive clustering.
  • picture feature information for example, face information
  • M audio clusters of adaptive clustering For specific implementation methods of the service roles corresponding to the P audio clusters associated with the mapping table, please refer to the embodiments corresponding to Figures 3 to 11 below.
  • Figure 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the method can be performed by a computer device with audio character recognition capabilities.
  • the computer device may be a terminal device (for example, any terminal device in the terminal device cluster shown in Figure 1 above, for example, the terminal device 100a), or it may be a server (for example, the server 10F shown in Figure 1 above), No limitation is made here.
  • the embodiment of the present application uses this method to use the service with the audio character recognition function to Taking server execution as an example for illustration, the method may at least include the following steps S101 to S103:
  • Step S101 Identify picture feature information from video frames of multimedia data.
  • the picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer.
  • the computer device can obtain video frames from multimedia data, and can then perform picture cutting processing on the key parts of the character in the video frame (cutting the pictures containing the key parts of the character in the video frame) to obtain the video The character picture corresponding to the frame.
  • the character pictures here may include X character cut pictures, where X is a positive integer greater than or equal to M.
  • the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti .
  • i is a positive integer less than or equal to X.
  • the computer device can determine the object key information vector matching the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as The business object corresponding to the role cutting picture T i . Further, the computer device can determine the picture feature information corresponding to the video frame based on the business objects corresponding to the X character cut pictures.
  • the picture recognition system when the computer device detects and recognizes the key parts of the character in the video frame can be composed of a detection sub-module and a recognition sub-module, or it can be an integrated system that detects and recognizes the key parts of the character. Detection and recognition network will not be limited here.
  • the computer device when determining the character picture corresponding to the video frame, can detect and locate the key parts of the character in the video frame, thereby determining the position information of the key parts of the character in the video frame. Further, the computer device can cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame. Then, the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti . Among them, i here is a positive integer less than or equal to X.
  • the computer device can obtain the information vector database associated with the candidate business object from its internal memory or externally to find the candidate business object that has a matching relationship with the picture information vector Li .
  • the information vector database here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • the computer device When the computer device obtains the information vector database, it can directly search for candidate business objects that have a matching relationship with the picture information vector Li from the information vector database. Wherein, the computer device can respectively determine the vector distance between the picture information vector Li and each of the Y object key information vectors, and obtain Y vector distances. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture.
  • the business object corresponding to Ti The distance threshold here is a value set in advance by the computer device to ensure that the found candidate business object has a matching relationship with the character cut picture. It can be dynamically adjusted according to the actual situation, and will not be limited here.
  • the computer device can obtain the object role mapping table associated with the multimedia data, and use the object role mapping table and the information vector database to find candidate business objects that have a matching relationship with the picture information vector Li .
  • Table 1 is an object role mapping table associated with multimedia data provided by an embodiment of the present application. As shown in Table 1:
  • the business roles in the object role mapping table shown in Table 1 may include H, where H is a positive integer greater than or equal to M.
  • H is a positive integer greater than or equal to M.
  • both role 1 and role 2 may have a mapping relationship with the same business object (for example, object a). That is, both role 1 and role 2 are played by object a.
  • Role 3 has a mapping relationship with object b
  • role 4 has a mapping relationship with object c
  • role 5 has a mapping relationship with object d.
  • the computer device can select the object key information vector corresponding to the list business object in the object role mapping table from the information vector database according to the above Table 1, for example, the object key information vector of object a, the object key information vector of object b, and Object key information vector of object c. Further, the computer device can respectively determine the vector distance between the picture information vector Li and each of the selected three object key information vectors. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the three vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to Ti .
  • the computer device does not need to determine the vector distance between the key information vector of each object in the information vector database, but selects through the object role mapping table, which greatly reduces the matching time. , thereby improving the matching efficiency of finding candidate business objects with matching relationships from the information vector database.
  • FIG. 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application.
  • the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the first module 201 in the embodiment corresponding to FIG. 2 .
  • the video frame 4V shown in Figure 4 may be a video frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to Figure 2 described above).
  • the key part detection model 410w shown in Figure 4 can be used to detect key parts in the video frame 4V.
  • the key part detection model 410w may be the key part detection model 210w in the embodiment corresponding to FIG. 2 mentioned above.
  • the picture coding model 420w may be the picture coding model 420w in the embodiment corresponding to FIG. 2 described above.
  • the information vector database 400K shown in Figure 4 may be the information vector database 200K in the embodiment corresponding to Figure 2 described above.
  • the video frame 4V can be input to the key part detection model 410w shown in Figure 4, and through the key part detection model 410w , detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information). Further, the computer device can cut the key parts of the character in the video frame 4V based on the position information marked in the area 40Q, and obtain the character cutting picture including the key parts of the character as shown in Figure 4 (for example, as shown in Figure 4 Character cutting picture 400T).
  • the key part detection model 410w shown in Figure 4
  • the key part detection model 410w detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information).
  • the key part detection model 410w shown in Figure 4 may be a network structure used to detect and locate key parts of a character (for example, a character's face), for example, a face detection model (Multi-task Cascaded Convolutional Networks, MTCNN for short) network).
  • a character for example, a character's face
  • a face detection model Multi-task Cascaded Convolutional Networks, MTCNN for short
  • FIG. 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application.
  • the key part detection model in the embodiment of the present application may be the key part detection model 410w in the embodiment corresponding to Figure 4.
  • This key part detection model can be used to detect key parts in the video frame 5V shown in Figure 5, where the video frame 5V It may be the video frame 4V in the embodiment corresponding to FIG. 4 mentioned above.
  • the key part detection model may include three network layers, which may specifically include a filtering network layer 5W 1 (for example, Proposal Network, P-Net for short), a fine-tuning network layer 5W 2 (for example, Refinement network, referred to as R-Net) and the output network layer 5W 3 (for example, Output network, referred to as O-Net).
  • a filtering network layer 5W 1 for example, Proposal Network, P-Net for short
  • a fine-tuning network layer 5W 2 for example, Refinement network, referred to as R-Net
  • the output network layer 5W 3 for example, Output network, referred to as O-Net
  • the computer device in the embodiment of the present application can adjust the image size of the video frame 5V, so that the image pyramid corresponding to the video frame 5V can be obtained.
  • the computer device can obtain the resizing coefficient (for example, 0.7) from its internal memory or externally, and adjust the video frame 5V multiple times based on the resizing coefficient until the picture size of the adjusted video frame 5V is consistent with the filtering network layer 5W 1 matches the image size threshold associated with it (for example, 12*12*3).
  • the computer device can form a picture pyramid corresponding to the video frame 5V based on the video frames 5V with different picture sizes after multiple adjustments.
  • the size adjustment coefficient here may be dynamically set by the computer device according to the distribution of the key parts of the character in the video frame. If the size adjustment coefficient is set too large, it is easy to extend the time for detecting and locating the key parts of the character. If the size adjustment coefficient is set too small, the key parts of the character with a small distribution area in the video frame may be missed (for example, small and medium-sized faces). Based on this, the size adjustment coefficient in the embodiment of the present application can be set between 0.7-0.8.
  • the picture pyramid here may include the original picture (for example, the video frame 5V shown in Figure 5), the first adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V), the second adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V) The picture obtained by adjusting the picture size of the first adjusted picture), ..., and the Nth adjusted picture (that is, the picture obtained by adjusting the picture size of the N-1th adjusted picture).
  • the image size of the Nth adjusted image here may be the image size threshold associated with the filtering network layer 5W 1 (for example, 12*12).
  • the computer device can input the picture pyramid corresponding to the video frame 5V to the screening network layer 5W 1 shown in Figure 5, so that a large number of candidates can be obtained.
  • the picture obtained by cutting the video frame 5V by filtering the bounding box position information obtained by the network layer 5W 1 is called the first cut picture.
  • the computer device can input the pictures in the picture pyramid to the filtering network layer 5W 1 to obtain the output features (m, n, 16).
  • m and n here can be used to characterize the length and width of the image, and 16 is the dimension of the channel.
  • the computer device can screen out a large portion of candidates, thereby obtaining one or more first candidates.
  • the computer device then calibrates the bounding box (bbox for short) based on the obtained four offsets, and obtains the position information of the calibrated bounding box (for example, the coordinate information of the upper left and lower right).
  • the computer device can screen these first candidates again according to the Intersection over Union (iou), that is, by performing Non-Maximum Suppression (Non-Maximum Suppression, NMS algorithm) to screen out the first candidates.
  • NMS Non-Maximum Suppression
  • the computer device can sort the classification scores (for example, in descending order) to obtain a tensor of (num_left, 4), that is, the upper left and lower right absolute coordinates of num_left bboxes. Further, the computer device can determine the iou with the bounding box coordinates and remaining coordinates of the maximum score value after sorting each time, and can further filter out the iou that is greater than the intersection-to-union ratio threshold (for example, 0.6, the intersection-to-union ratio threshold is the computer (preset by the device) and move this maximum score value to the final result. In this embodiment of the present application, the above operation may be called a filtering operation.
  • the computer device repeats this filtering operation to filter out many bounding boxes with a large number of overlapping parts, and finally obtains (num_left_after_nms, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the bounding box, so that the picture size is 24*24, and the picture used to be input to the fine-tuning network layer 5W 2 shown in Figure 5 (i.e. the first cut picture).
  • the first cut picture here may be a square with the maximum side length of the bounding box captured by the computer device in the video frame 5V, thereby effectively ensuring that no deformation occurs during size adjustment and that more details of key parts of the character are retained.
  • the computer device can fine-tune the first cut picture through the fine-tuning network layer 5W 2 to obtain the result shown in Figure 5 of the second cut picture.
  • the fine-tuned network layer 5W 2 can output 2 outputs corresponding to the two-class one-hot, 4 outputs corresponding to the coordinate offset of the bounding box, and 10 outputs corresponding to the turning point (landmark).
  • the fine-tuned network layer 5W 2 can filter out most candidates that do not include key parts of the character (for example, the character's face) according to the binary classification score. After adjusting the bounding box according to the offset, repeat the filtering operation in the above filtering network layer 5W 1 again to obtain (num_left_after_Rnet, 16) candidates.
  • the computer device can accurately output the position information of the character's key parts in the video frame 5V through the output network layer 5W 3 , including the coordinate information of the bounding box and the coordinate information of the turning point.
  • the computer device in the output network layer 5W 3 , after classification screening and bounding box adjustment NMS screening, not only outputs the coordinate information of the bounding box, but also outputs the coordinate information of the turning point, thereby obtaining the key parts of the character in the video frame
  • the position information in 5V is used to subsequently cut the key parts of the character in the video frame 5V, thereby obtaining a picture including the key parts of the character (for example, the character cut picture 400T shown in Figure 4).
  • the computer device can input the character cutting picture 400T to the picture coding model 420w shown in FIG. 4.
  • the picture coding model 420w is a model based on Residual Network (Resnet).
  • Residual Network This series of networks can be widely used in fields such as target classification and as the backbone of classic neural networks for computer vision tasks.
  • typical networks include Resnet50, Resnet101, etc.
  • the picture coding model 420w in the embodiment of this application may be a Resnet50 network model.
  • the Resnet50 network model can include 5 stages, which can specifically include the first stage (for example, Stage 0), the second stage (for example, Stage 1), the third stage (for example, Stage 2), and the third stage.
  • Four stages e.g., Stage 3) and fifth stage (e.g., Stage 4).
  • the structure of Stage 0 is relatively simple. It can be regarded as the preprocessing of the character cutting image 400T.
  • the last four stages are all composed of bottleneck layers (Bottleneck), and the structures are relatively similar. Among them, Stage 1 can contain 3 Bottlenecks, Stage 2 can contain 4 Bottlenecks, Stage 3 can contain 6 Bottlenecks, and Stage 4 can contain 3 Bottlenecks.
  • the computer device inputs the character cutting picture 400T into the picture encoding model 420w.
  • the character cutting picture 400T can be converted into a picture information vector with 2048 dimensions.
  • the picture information vector can be used Semantic feature information used to represent key parts of the character (for example, face).
  • the computer device may obtain the information vector database 400K associated with the candidate business object shown in FIG. 4 .
  • the information vector database 400K here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • each object key information vector in the information vector database 400K can be extracted by the computer device using the same encoding processing method as the character cutting picture 400T.
  • An object key information vector can be used to represent a candidate business object corresponding to Key part identification (for example, Face ID).
  • the computer device can respectively determine the vector distance between the picture information vector corresponding to the character cutting picture 400T and each of the Y object key information vectors, thereby obtaining Y vector distances.
  • the computer device can set a distance threshold in advance. If the minimum vector distance determined by the computer device is greater than the distance threshold, it can be considered that the computer device has not matched the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it has not matched the character cutting picture 400T. The corresponding business object. If the computer equipment determines the minimum If the vector distance is less than or equal to the distance threshold, it can be considered that the computer device can match the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it can successfully match the business object corresponding to the character cutting picture 400T. .
  • the computer device when it obtains the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, it can determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and then the determined candidate business object can be As the business object corresponding to the character cutting picture 400T.
  • the computer device performs image recognition on each video frame in the multimedia data, it can refer to the specific implementation of key part identification of the video frame 5V shown in Figure 5 to obtain X key pictures of the character containing the key parts of the character. , which will not be described further here. Wherein, if a video frame includes key parts of multiple different characters, the computer device can cut out a corresponding number of key parts of the characters from the video frame.
  • the computer device can refer to the specific implementation of object matching for the character cutting pictures 400T in the corresponding embodiment of FIG. 4, perform object matching on each of the X character cutting pictures, and then can perform object matching based on The business objects corresponding to the obtained character cut pictures are determined to determine the picture feature information corresponding to the video frames in the multimedia data.
  • Step S102 Locate and separate audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N object audio frames with each other.
  • the audio semantic feature vectors corresponding to the frames are clustered to obtain M audio clusters.
  • N object audio frames are obtained after the computer device performs object positioning and separation processing on the original audio frames in the multimedia data, where N is a positive integer.
  • An audio cluster can correspond to a business object.
  • the computer device can obtain original audio frames from multimedia data, and can then perform object positioning and separation processing on the original audio frames to obtain N object audio frames.
  • the computer device can perform semantic feature extraction on each of the N object audio frames to obtain an audio semantic feature vector corresponding to each object audio frame.
  • the computer device can determine M as the number of cluster centers to be clustered, and based on the number of cluster centers, perform clustering processing on the audio semantic feature vector corresponding to each acquired audio frame of the object, so that M audio files can be obtained Cluster clusters. Audio semantic characteristics can be understood as the characteristics of the speaker’s voiceprint.
  • the embodiment of the present application innovatively uses the number M of business objects indicated by the picture feature information as the selection of the number of cluster centers.
  • This method of using picture feature information as prior knowledge enables the system to know the number of business objects in the multimedia data, thereby giving audio clustering a priori setting of cluster center data, which can be automatically set
  • the number of cluster centers improves the convergence speed of the entire system and the overall recognition performance, and saves computer resources.
  • FIG. 6 is a schematic architectural diagram of audio semantic feature clustering provided by an embodiment of the present application.
  • the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the second module 202 in the embodiment corresponding to FIG. 2 .
  • the original audio frame shown in FIG. 6 may be an original audio frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to FIG. 2 mentioned above).
  • the source separation model 630w shown in Figure 6 can be used to perform source separation on the original audio frame.
  • the information source separation model 630w may be the information source separation model 230w in the embodiment corresponding to FIG. 2 described above.
  • the audio semantic feature extraction model 640w shown in Figure 6 can be used to extract semantic features for each object audio frame.
  • the audio semantic feature extraction model 640w may be the audio semantic feature extraction model 240w in the embodiment corresponding to FIG. 2 described above.
  • the architectural schematic diagram in the embodiment of the present application may include three nodes, namely an audio paragraph cutting node, an audio semantic feature extraction node, and a clustering processing node.
  • the computer device when the computer device is at the audio segment cutting node, the computer device can obtain the original audio frame from the multimedia data to perform source separation on the original audio frame, thereby obtaining the business object-containing audio frame. Audio frame to be processed. Further, the computer device can locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, so that N object audio frames can be obtained.
  • source separation refers to separating mixed audio signals mixed with multiple audio signals through signal processing or other algorithms, extracting specified types of audio signal sequences from the mixed signals, and finally generating separate audio files.
  • the audio frame to be processed for the business object that is, the object segment
  • the object segment is extracted from the original audio frame.
  • the source separation model 630w can be used to perform source separation on the original audio frame to obtain the object segment (or object track) and ambience segments (or ambience tracks). Since there may be a large number of silent segments in the target sound segment, and these silent segments will cause interference to the audio clustering results of subsequent clustering processing, and will also cause a waste of computing resources, at this time, the computer device can determine the target sound segment Is the audio frame to be processed for the business object. The computer device can then obtain the audio boundary detection strategy.
  • the audio boundary detection strategy here can be the VAD (Voice Activity Detection) algorithm.
  • VAD Voice Activity Detection
  • the VAD algorithm here can be widely used in speech coding, noise reduction and ASR scenarios.
  • a VAD system can usually include two parts, feature extraction and speech/non-speech decision. Further, based on the audio boundary detection strategy, the computer device can locate and cut the audio impact signal frame in the audio frame to be processed, that is, accurately locate the non-silent segment, so that N object audio frames shown in Figure 6 can be obtained, N is a positive integer.
  • the computer device may input the N object audio frames to the audio semantic feature extraction model 640w shown in FIG. 6 .
  • the audio semantic feature extraction model 640w can be an audio neural network (for example, PANNS network) based on a large audio data set and training, which is usually used for audio pattern recognition or audio frame level embedding, and serves as the front end of many models. Coding network.
  • the computer device can extract semantic features for each of the N object audio frames through the audio semantic feature extraction model 640w, and obtain the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 6, it may specifically include audio semantic feature vector 1, audio semantic feature vectors 1,..., and audio semantic feature vector N.
  • the clustering strategy used for clustering processing in the embodiment of the present application may be a k-means clustering algorithm (k-means clustering algorithm, referred to as k-means clustering algorithm).
  • the k-means clustering algorithm is an iterative clustering analysis algorithm.
  • the computer device may divide N audio semantic feature vectors into M initial clusters in advance. Furthermore, the computer device can randomly select M audio semantic feature vectors as initial cluster centers of the M initial clusters. Then, for each audio semantic feature vector (i.e., a vector to be attributed) in the audio semantic feature vector set except the M audio semantic feature vectors selected as cluster centers, the computer device may determine that each vector to be attributed is consistent with The vector distance between the cluster centers of each initial clustering cluster, and the vector to be attributed is divided into the initial clustering cluster with the minimum vector distance. At this time, the computer device can update the cluster centers of the divided initial clusters. By analogy, the computer device can determine M audio clusters shown in FIG. 6 . The M audio clusters may specifically include audio clusters C 1 , audio clusters C 2 , ..., and audio clusters C M .
  • the embodiment of this application uses the audio semantic feature clustering method to classify N audio semantic feature vectors instead of training voiceprint classification through neural networks, thereby getting rid of the dependence on the actor's voiceprint ID and avoiding privacy violations.
  • the embodiments of this application can
  • the object audio frames in the multimedia data are directly used to extract the audio semantic feature vector corresponding to each object audio frame. This is deeply decoupled from the personal voiceprint ID of the business object and thus related to the voice of the character itself.
  • the pattern information is correlated so that business characters voiced by professional voice actors can be identified. That is to say, the embodiment of the present application can still accurately identify the line character information even when the business character is not dubbed by the business object himself, thus improving the accuracy of audio character recognition.
  • the embodiment of the present application uses the audio semantic feature clustering method to cluster N audio semantic feature vectors to perform audio character recognition, which makes the entire system portable and makes the entire audio character recognition system more efficient. It is versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of identification.
  • FIG. 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application.
  • the information source separation model in the embodiment of the present application may be the information source separation model 630w in the embodiment corresponding to Figure 6.
  • the source separation model may include a split network layer 7W 1 (ie, a first split network layer, for example, VACAL-Unet) and a split network layer 7W 2 (ie, a second split network layer, for example, BGM-Unet).
  • Unet is one of the algorithms that uses a fully convolutional network for semantic segmentation, using a symmetric U-shaped structure containing a compression path and an expansion path.
  • the typical feature of the Unet network is that it has a U-shaped symmetrical structure and can contain 4 convolutional layers and corresponding 4 upsampling layers. Therefore, when implementing, you can either implement the network from scratch and initialize the weights, and then train the model, or you can borrow the convolutional layer structure of some networks and the corresponding trained weight files, plus subsequent upsampling. layer, perform training calculations, etc. Since the trained weight model files can be used in deep learning model training, the speed of Unet training is greatly accelerated.
  • Another feature is that the feature map obtained by each convolutional layer of the Unet network will be connected to the corresponding upsampling layer, so that the feature map of each layer can be effectively used in subsequent calculations, that is, skip connection (skip-connection). It can effectively solve the problem of gradient dissipation and improve the efficiency of model training.
  • Unet avoids direct supervision and loss calculation in high-level feature maps, but combines the features in low-level feature maps, so that the final result can be achieved.
  • the obtained feature map contains both first-level features (i.e., high-level features) and many second-level features (i.e., low-level features), achieving feature fusion at different levels, thereby improving The accuracy of the model’s results.
  • the computer device When the computer device inputs the original audio frame into the source separation model, it can generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model shown in Figure 7. For example, the computer device can perform spectrum conversion on the audio track of the original audio frame to obtain the audio track spectrum corresponding to the original audio frame, and then can generate the spectrum amplitude spectrum corresponding to the original audio frame by eliminating the phase of the audio track spectrum.
  • the computer device can input the spectrum amplitude spectrum into the segmentation network layer 7W 1 and the segmentation network layer 7W 2 respectively, so as to generate the first type of features (for example, object track features) corresponding to the spectrum amplitude spectrum through the segmentation network layer 7W 1 , the second type of features (for example, ambient track features) corresponding to the spectral amplitude spectrum are generated by segmenting the network layer 7W 2 .
  • first type of features for example, object track features
  • second type of features for example, ambient track features
  • the computer device can merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features (ie, the first mask map). Furthermore, the computer device can generate a target type audio frame (i.e., an audio frame in the object segment) based on the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the output of the source separation model for the business object (including (with voice) audio frame to be processed. For example, when the computer device generates the first type features and the second type features shown in Figure 7, it can perform splicing processing on the first type features and the second type features to obtain spliced type features.
  • a target type audio frame i.e., an audio frame in the object segment
  • the computer device performs two types of mask calculations on the splicing type features, so that a first mask image corresponding to the first type feature and a second mask image corresponding to the second type feature can be obtained.
  • the mask calculation is, for example, by comparing the feature values of the points with the merged values after the splicing process.
  • the The computer device can perform corresponding position calculation (for example, multiplication) on the spectrum amplitude spectrum corresponding to the first mask image and the original audio frame, and then generate the first type audio frame (i.e., the audio frame in the object segment) through inverse spectrum transformation. .
  • the computer device can also calculate the corresponding position of the spectrum amplitude spectrum corresponding to the second mask image and the original audio frame, and then generate the second type of audio frame (ie, the audio frame in the environmental sound segment through inverse spectrum transformation ). Since the corresponding first type features and the amplitude spectrum of the first type features can be obtained after the mask and amplitude spectrum calculations above, the one-dimensional features of the first type features and the second type features can be obtained after the inverse spectrum transformation.
  • the sampling point is the audio signal.
  • the computer device can separate environmental sounds (for example, BGM sounds) from the original audio frames of multimedia data through the source separation model shown in Figure 7 to eliminate the impact of environmental sounds on subsequent clustering, thereby improving Clustering accuracy.
  • environmental sounds for example, BGM sounds
  • FIG. 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application.
  • the audio semantic feature extraction model in the embodiment of the present application may be the audio semantic feature extraction model 640w in the embodiment corresponding to Figure 6.
  • the audio semantic feature extraction model shown in Figure 8 can be the Wavegram_Logmel128_Cnn14 model.
  • the biggest feature of this audio semantic feature extraction model is that the input of the model uses the original audio sampling point sequence of the audio, that is, the input of the entire network is audio N object audio frames of the signal. This eliminates the need to extract basic audio features in advance. Since the extraction of basic audio features is very time-consuming, and using basic audio features as input will occupy a particularly large amount of hardware resources, by using this audio semantic feature extraction model to process N object audio frames of the input audio signal, computers can be saved. resources and improve computing efficiency.
  • the audio semantic feature extraction model may include a time domain branch network layer, a frequency domain branch network layer and a convolution network layer.
  • the time domain branch network layer here may include a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolutional layer), a max-pooling layer 803w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 804w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 805w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 806w (e.g., a one-dimensional convolutional layer including a basis block), a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolution
  • the computer device can directly learn the time domain characteristics of the audio signal in the time domain signal through these large one-dimensional convolution layers, especially information such as audio loudness and sampling point amplitude. After a large number of one-dimensional convolutional layers, a two-dimensional wavegram is obtained to represent the learned time domain feature map, so that the output of the time domain branch and the frequency domain branch can be combined.
  • the computer device can also perform feature learning on N object audio frames through the frequency domain branch network layer to obtain the learned frequency domain feature map (frequency domain learning feature).
  • the frequency domain branch network layer here may include a convolution layer 809w (for example, a two-dimensional convolution layer including a basic block).
  • the computer device can input N object audio frames to the frequency domain branch network layer and generate frequency domain spectra corresponding to the N object audio frames (for example, using Mel frequency to generate a log-mel spectrum).
  • the computer device inputs the frequency domain spectrum to the convolution layer 809w shown in Figure 8, so as to obtain the same characteristics as the learned time domain feature map through multiple two-dimensional convolution layers in the convolution layer 809w.
  • Frequency domain feature maps for learning of feature dimensions are provided.
  • the computer device can superimpose (for example, splice) the learned frequency domain feature map and the learned time domain feature map, so that the superimposed feature can be obtained.
  • the computer device then inputs the superimposed features into the convolutional network layer, and performs maximum averaging processing on the superimposed features, Output the audio semantic feature vector corresponding to each object audio frame.
  • the convolutional network layer here may include a convolutional layer 810w (for example, a two-dimensional convolutional layer) and an activation layer 811w.
  • the computer device can splice the feature map used to represent the frequency domain feature map of learning and the feature map used to represent the time domain feature map of learning to form a set of two-dimensional frequency domain feature maps used to identify superimposed features. .
  • the computer device can input the two-dimensional frequency domain feature map used to represent the superimposed features into the convolution layer 810w shown in Figure 8, and then separately use two-dimensional pooling on the features output by the convolution layer 810w. (pooling) Perform maximum processing and average processing to extract the maximum representation and average representation of the current feature. Furthermore, the computer device can determine the maximum processed feature as the first sub-feature, and determine the average processed feature as the second sub-feature. At this time, the computer device can merge the first sub-feature and the second sub-feature, and then input the merged feature to the activation layer 811w shown in Figure 8 to finally generate an audio semantic feature vector set with 2048 dimensions.
  • the audio semantic feature vector set may include an audio semantic feature vector corresponding to each of the N object audio frames.
  • the computer device can quickly perform audio semantic feature extraction on each of the N object audio frames through the audio semantic feature extraction model shown in Figure 8, so as to obtain each object more quickly and accurately The audio semantic feature vectors corresponding to the audio frames respectively.
  • Step S103 Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • the object role mapping table (for example, the object role mapping table shown in Table 1 above) may include business roles that have a mapping relationship with the list business object, and there are P overlapping business objects between the list business object and the M business objects.
  • the computer device may obtain the audio cluster C k from the M audio clusters.
  • the computer device can extract the first playing time of the audio cluster C k in the multimedia data, where k is a positive integer less than or equal to M.
  • the first playback time of the audio cluster C k in the multimedia data is one or more playback times in the multimedia data of the object audio frame corresponding to the audio semantic feature vector included in the audio cluster C k .
  • the computer device can obtain P business objects that overlap with the M business objects from the list of business objects in the object role mapping table associated with the multimedia data. Furthermore, the computer device can extract the second playback time of each of the P business objects in the multimedia data based on the picture feature information. The second playback time of each of the P business objects in the multimedia data is one or more playback times in the multimedia data of the video frame in which each of the P business objects is located. At this time, the computer device can respectively determine the time overlap between the first playback time and each second playback time of the audio cluster C k . Furthermore, the computer device can use the business object corresponding to the second playback time with the highest degree of time overlap as the business object corresponding to the audio cluster C k . Further, the computer device can obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
  • the embodiments of this application start from the perspective of audio, identify characters in multimedia data, and classify each audio line into roles. This can supplement accurate lines when there is no information about key parts of the character in some other character shots and scenes. Role information, thus improving the accuracy of role recognition.
  • FIG. 9 is a schematic diagram of a scenario for audio character recognition provided by an embodiment of the present application.
  • the computer device executes step S101, it can determine through the image feature information recognized by the first module 201 that the number M of business objects to which the character pictures in the video frames of the multimedia data belong is 3. Specifically, it can include objects a, object b and object c.
  • the computer device executes step S102, it can determine that there are three audio clusters through the audio processing results clustered by the second module 202. Specifically, it may include audio clustering cluster C 1 , audio clustering cluster C 2 and audio clustering cluster C 3 shown in FIG. 9 .
  • the N object audio frames in the embodiment of the present application may include segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6 shown in FIG. 9 .
  • these 6 segments are arranged according to playing time.
  • the object audio frames corresponding to audio cluster C 1 may include object audio frames in segment 1 and segment 3 .
  • the object audio frames corresponding to audio cluster C 2 may include object audio frames in segment 2, segment 4, and segment 6.
  • the object audio frame corresponding to audio cluster C 3 may include the object audio frame in segment 5 .
  • the computer device can obtain, from the list of business objects in the object role mapping table shown in Table 1, business objects that overlap with the M business objects obtained by the computer device in the first module.
  • the list business objects in Table 1 above may include four business objects: object a, object b, object c, and object d.
  • the M business objects obtained by the computer device in the embodiment of the present application may include object a, object b, object c, and object d.
  • the computer device can extract the playback time (ie, the second playback time) of each of the three overlapping business objects in the multimedia data based on the picture feature information.
  • the second playback time of object a in the multimedia data is playback time T 1 (for example, 00:00-10:00) and playback time T 3 (for example, 30:45-38:00); object b is in the multimedia data
  • the second playback time in the data is playback time T 2 (for example, 10:05-28:33), playback time T 4 (for example, 40:05-55:39), and playback time T 6 (for example, 100:03 -113:57);
  • the second playback time of object c in the multimedia data is the playback time T 5 (for example, 80:30-88:50).
  • the computer device can obtain the audio cluster C 1 from these three audio clusters, and then can extract the playback time of the audio cluster C 1 in the multimedia data (ie, the first playback time of the audio cluster C 1 ).
  • the first playback time of the audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to the sound segment 1 (for example, 00:30-10:10) and the playback time t 3 (for example, 00:30-10:10) corresponding to the sound segment 3 ( For example, 35:08-40:52).
  • the computer device can respectively determine the time overlap between the audio cluster C 1 and the second playback time corresponding to each business object.
  • the time overlap between the first playback time of audio cluster C 1 and the second playback time of object a is 98%
  • the time overlap with the second playback time of object b is 5%
  • the time overlap with the second playback time of object b is 5%
  • the temporal overlap between the second playback times of object c is 1%.
  • the computer device can determine the second playback time with the highest time overlap degree from the three time overlap degrees, that is, the second playback time of object a. Further, the computer device can use object a as an audio clustering cluster.
  • the business object corresponding to C 1 , and the business roles (ie, role 1 and role 2) that have a mapping relationship with object a are obtained from the above Table 1 as the business role corresponding to the audio cluster C 1 .
  • the computer device can refer to the audio role identification method of the business role corresponding to the audio cluster C 1 and determine that the business role corresponding to the audio cluster C 2 can be the role 3 that has a mapping relationship with the object b.
  • the business role corresponding to cluster C 3 may be role 4 that has a mapping relationship with object c.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • FIG. 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • This method can be executed by a terminal device with an audio role recognition function (for example, any terminal device in the terminal device cluster shown in FIG. 1 above, for example, the terminal device 100a), or by a server with an audio role recognition function (for example, the server 10F) shown in FIG. 1 can also be executed interactively by a target terminal device with a multimedia data playback function and a server with an audio character recognition function, which is not limited here.
  • the method may at least include the following steps S201 to S205:
  • Step S201 Identify picture feature information from video frames of multimedia data.
  • Step S202 locate and separate the audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames in the multimedia data, and compare The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters.
  • Step S203 Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • step S201 to step S203 please refer to the description of step S101 to step S103 in the embodiment corresponding to FIG. 3, which will not be described again here.
  • Step S204 based on the first playback time of the P audio clusters (specifically, the object audio frames corresponding to the P audio clusters) in the multimedia data and the business objects corresponding to the P audio clusters (specifically, The second playback time in the multimedia data of the video frame in which the P audio clusters respectively correspond to the business objects respectively determines the service playback time in the multimedia data of each of the P business objects.
  • the computer device can obtain the target audio cluster from the P audio clusters, and further can determine the first playback time of the target audio cluster in the multimedia data, and the first playback time of the target audio cluster corresponding to the target audio cluster.
  • the second playback time of the business object in the multimedia data Further, the computer device can determine the time intersection or time union of the first playback time and the second playback time of the target audio cluster, and then can use the determined time intersection or time union as the target audio cluster.
  • the service playback time of the business object corresponding to the cluster in the multimedia data is obtained until the service playback time of each of the P business objects in the multimedia data is obtained.
  • the embodiment of the present application uses the audio semantic feature clustering method to perform audio character recognition, which can make up for the problem that there is no character facial information or object information in some video frames, but the character cannot be recognized when audio appears, and can automatically identify the character based on the object audio.
  • the semantic features of frames are used to cluster the business roles corresponding to the current audio clustering cluster, thus filling the shortcomings of using image recognition for role recognition and ensuring the integrity of the role's time positioning information in the entire multimedia data.
  • the first playback time of audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to segment 1 (for example, 00:30-10:10) and the playback time corresponding to segment 3 t 3 (e.g., 35:08-40:52).
  • the second playback time of the business object (for example, object a) corresponding to the audio cluster C 1 in the multimedia data is the playback time T 1 (for example, 00:00-10:00) and the playback time T 3 (for example, 30 :45-38:00).
  • the computer device uses a time intersection method to determine the service playback time
  • the service playback time of object a determined by the computer device can be 00:30-10:00 and 35:08-38:00.
  • the service playback time of object a determined by the computer device may be 00:00-10:10 and 30:45-40:52.
  • Step S205 Based on the service playback time corresponding to each of the P business objects, obtain the multimedia segment data corresponding to the P business objects from the multimedia data.
  • the multimedia segment data here may include audio frames associated with the corresponding business object and audio frames associated with the corresponding business object. video frames.
  • the computer device when it obtains the service playback time of object a, the service playback time of object b, and the service playback time of object c, it can respectively obtain the multimedia segment data corresponding to these three service objects.
  • the computer device can obtain multimedia segment data that matches the service playback time of object a from the multimedia data (that is, including video frames associated with object a and audio frames associated with object a) as the object
  • the multimedia segment data corresponding to a for example, multimedia segment data 1).
  • the computer device can obtain the multimedia segment data that matches the service playback time of object b (that is, including the video frame associated with object b and the audio frame associated with object b) as the multimedia segment data corresponding to object b.
  • Segment data (for example, multimedia segment data 2); obtain the multimedia segment data that matches the service playback time of object c (that is, including the video frame associated with object c and the audio frame associated with object c), as the object c corresponding multimedia segment data (for example, multimedia segment data 3).
  • the fully automatic audio character recognition solution based on the audio semantic feature clustering method provided by the embodiments of the present application can automatically combine picture feature information (for example, character facial information) to identify services in multimedia data.
  • Character recognition can save a lot of manual annotation costs and time costs, and accelerate the implementation of video applications.
  • the computer device when the computer device obtains the multimedia fragment data corresponding to each business object, it can be applied in the "watch TA only" user-specific service in the multimedia data playback scenario, and can target the business objects in the multimedia data. (or business role) to select storyboards, so that when the target user triggers this user-specific service, the multimedia segment data that is not selected by the user will be automatically skipped, so that the computer device can more clearly locate the business objects that the user likes. multimedia segment data.
  • the computer device can play multimedia data in a business playback display interface.
  • the service playback display interface may include a playback selection control for triggering a target video data selection function.
  • the computer device may display the object playlist in response to the triggering operation.
  • the object playlist here can be displayed in the bottom area of the business playback display interface in a floating window form, a masked form, or a translucent form, or it can also be displayed on a shrinkable interface that can change the display size through drag and drop operations.
  • the size is smaller than the service playback display interface.
  • the object playlist here may include object cover data corresponding to Z business objects respectively; and Z is a positive integer less than or equal to P.
  • the target multimedia segment data here may be the multimedia segment data corresponding to the business object corresponding to the target object cover data, and the business object corresponding to the target object cover data belongs to P business objects.
  • the triggering operations here may include contact operations such as clicks and long presses, and may also include non-contact operations such as voice and gestures, which will not be limited here.
  • FIG. 11 is a schematic diagram of a scene for displaying multimedia segment data according to an embodiment of the present application.
  • the computer device in the embodiment of the present application may be a target terminal device used by the target user.
  • the target terminal device may be any terminal device in the terminal device cluster in the embodiment corresponding to Figure 1, for example, the terminal device 100a.
  • the interface 1101J and the interface 1102J shown in Figure 11 are both service playback display interfaces at different times provided by a client with a multimedia data playback function.
  • the target terminal device used by the target user can display multimedia data in the interface 1101J.
  • the multimedia data here can be the multimedia data 20S in the embodiment corresponding to Figure 2.
  • the interface 1101J may include a control 11U, which is a playback selection control used to trigger the target video data selection function.
  • the target terminal device may display the object playlist 11B shown in FIG. 11 in response to the triggering operation.
  • the object playlist 11B here may include object cover data corresponding to the Z business objects and cover data corresponding to the multimedia data (for example, "watch the complete video").
  • the object playlist 11B may specifically include object cover data 1 corresponding to object a (for example, "watch only the clips of object a"), and object cover data 2 corresponding to object b (for example, "watch only clips of object b"). ”) and the object cover data 3 corresponding to object c (for example, “see only the fragment of object c”).
  • the object a, the object b and the object c here all belong to the P business objects obtained by the target terminal device and obtained by performing audio role recognition on the multimedia data.
  • the target user can perform a triggering operation on the target object cover data (for example, the object cover data 1 corresponding to object a) among the Z pieces of object cover data.
  • the target terminal device can play the multimedia segment data corresponding to the object a corresponding to the object cover data 1 in the interface 1102J shown in FIG. 11 .
  • the target terminal device can also highlight the playback progress corresponding to the multimedia segment data corresponding to object a in the playback progress bar corresponding to the multimedia data displayed on the interface 1102J, so that the target user can more quickly and accurately Find the next segment of multimedia segment data corresponding to the object a that you are interested in.
  • the computer device when the computer device obtains the multimedia segment data corresponding to each business object, it can also apply it in the merged and edited scene. For example, the computer device classifies the audio data in the multimedia data, distinguishes the business role corresponding to each audio line, and organizes the line voice collection (i.e., audio clustering cluster) corresponding to each business role in the entire multimedia data. Use it as production material and provide it to the intelligent production video team as alternative information for editing. For example, the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data. For another example, the computer device can merge and edit corresponding multimedia segment data of different business objects.
  • the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data.
  • the computer device can merge and edit corresponding multimedia segment data of different business objects.
  • the multimedia data here may include first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited.
  • the objects to be edited belong to the P business objects obtained through audio role recognition by the computer equipment.
  • the first multimedia data here can be a war-themed TV series in which the subject to be edited participates.
  • the second multimedia data here can be the TV series with the theme of fairy tales in which the subject to be edited participates.
  • the computer device can obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and further can obtain the first target business role related to the first multimedia data.
  • the first multimedia segment data here is determined by the computer device based on the service playback time of the object to be edited in the first multimedia data.
  • the computer device can also obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role from the second multimedia data.
  • Second multimedia segment data associated with the character The second multimedia segment data here may be determined by the computer device based on the service playback time of the object to be edited in the second multimedia data.
  • the computer device can perform merge and edit processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.
  • the merged clip data here can be used to upload to the business data platform where the client is located, so that objects accessing the client can check it on the corresponding terminal device.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby Can accurately identify the angle with the object
  • the P audio clusters associated with the color mapping table correspond to business roles respectively.
  • This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. Instead, it can automatically identify and write the business role and audio line information before the multimedia data is put on the shelf, so that it can quickly provide downstream services (for example, User characteristic service business, merged editing business, etc.) are empowered.
  • the embodiment of the present application adopts the audio semantic feature clustering method in the audio character recognition process, which can not only reduce the manpower time consumed, but also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the entire audio character recognition system is more versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of recognition.
  • FIG. 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device 1 may include: a picture information acquisition module 100, a clustering processing module 200, and an audio character recognition module 300.
  • the picture information acquisition module 100 is used to identify picture feature information from video frames of multimedia data.
  • the picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer.
  • the clustering processing module 200 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
  • the audio role recognition module 300 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data.
  • Role where P is a positive integer less than or equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
  • FIG. 13 is another schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device 2 may include: a picture information acquisition module 11, a clustering processing module 12, an audio role recognition module 13, a business time determination module 14, a segment data determination module 15, a multimedia data playback module 16, Object list display module 17, segment data playback module 18, first segment data acquisition module 19, second segment data acquisition module 20 and merge editing module 21.
  • the picture information acquisition module 11 is used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, and M is a positive integer.
  • the picture information acquisition module 11 includes: a video frame acquisition unit 111, a picture cutting unit 112, a picture encoding unit 113, a vector matching unit 114 and a picture information acquisition unit 115.
  • the video frame acquisition unit 111 is used to acquire video frames from multimedia data.
  • the picture cutting unit 112 is used to cut pictures containing key parts of the character in the video frame to obtain the character picture corresponding to the video frame.
  • the character pictures include X character cut pictures, where X is a positive integer greater than or equal to M.
  • the picture cutting unit 112 includes: a position determining subunit 1121 and a cutting subunit 1122.
  • the position determination subunit 1121 is used to detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame.
  • the cutting subunit 1122 is used to cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame.
  • the picture encoding unit 113 is used to obtain the character cut picture Ti among Or a positive integer equal to X.
  • the vector matching unit 114 is used to determine the object key information vector that matches the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as the role Cut the business object corresponding to picture T i .
  • the vector matching unit 114 includes: a database acquisition subunit 1141, a vector distance determination subunit 1142, and an object matching subunit 1143.
  • the database acquisition subunit 1141 is used to acquire an information vector database associated with candidate business objects, where the information vector database is used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • the vector distance determination subunit 1142 is used to respectively determine the vector distance between the picture information vector Li and each object key information vector in the Y object key information vectors, to obtain Y vector distances.
  • the object matching subunit 1143 is used to obtain the minimum vector distance that is less than or equal to the distance threshold from Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as a role Cut the business object corresponding to picture T i .
  • the picture information acquisition unit 115 is configured to determine the picture feature information corresponding to the video frame based on the obtained business objects corresponding to the character cut pictures.
  • step S101 for the specific implementation of the video frame acquisition unit 111, picture cutting unit 112, picture encoding unit 113, vector matching unit 114 and picture information obtaining unit 115, please refer to the description of step S101 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
  • the clustering processing module 12 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
  • the clustering processing module 12 includes: an object audio frame determination unit 121, a semantic feature extraction unit 122, and a clustering processing unit 123.
  • the object audio frame determining unit 121 is used to locate and separate audio frames containing human voices from original audio frames of multimedia data to obtain N object audio frames.
  • the object audio frame determination unit 121 includes: an original audio frame acquisition subunit 1211, a source separation subunit 1212, and an object audio frame determination subunit 1213.
  • the original audio frame acquisition subunit 1211 is used to acquire original audio frames from multimedia data.
  • the source separation subunit 1212 is used to perform source separation on the original audio frame to obtain an audio frame to be processed that contains human voice.
  • the source separation sub-unit 1212 includes: an amplitude spectrum generating sub-unit 12121, a type feature generating sub-unit 12122, a merging mask sub-unit 12123 and an audio frame to be processed determining sub-unit 12124.
  • the amplitude spectrum generation subunit 12121 is used to input the original audio frame to the source separation model, and generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model.
  • the source separation model includes a first segmentation network layer and a second segmentation network layer.
  • the type feature generation subunit 12122 is used to input the spectrum amplitude spectrum into the first segmentation network layer and the second segmentation network layer respectively, generate the first type feature corresponding to the spectrum amplitude spectrum through the first segmentation network layer, and generate the first type feature corresponding to the spectrum amplitude spectrum through the second segmentation network layer. Generate second type features corresponding to the spectral amplitude spectrum.
  • the merge mask subunit 12123 is used to perform merge mask processing on the first type features and the second type features to obtain a target mask map corresponding to the first type features.
  • the audio frame determination subunit 12124 to be processed is used to generate a target type audio frame through spectrum inverse transformation based on the corresponding position of the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the source separation model to output the audio frame containing the human voice. of audio frames to be processed.
  • the specific implementation of the amplitude spectrum generation sub-unit 12121, the type feature generation sub-unit 12122, the merging mask sub-unit 12123 and the audio frame to be processed determining sub-unit 12124 can be referred to the audio frame to be processed in the embodiment corresponding to Figure 7. description, which will not be described further here.
  • the object audio frame determination subunit 1213 is used to locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, to obtain N object audio frames.
  • the source separation sub-unit 1212 and the object audio frame determination sub-unit 1213 please refer to the object positioning and separation processing of the original audio frame in the embodiment corresponding to Figure 3. description, which will not be described further here.
  • the semantic feature extraction unit 122 is used to extract semantic features from each of the N object audio frames, and obtain an audio semantic feature vector corresponding to each object audio frame.
  • the semantic feature extraction unit 122 includes: an audio frame input subunit 1221, a frequency domain feature determination subunit 1222, a time domain feature determination subunit 1223, and an audio feature vector determination subunit 1224.
  • the audio frame input subunit 1221 is used to input N object audio frames to the audio semantic feature extraction model.
  • the audio semantic feature extraction model includes frequency domain branch network layer, time domain branch network layer and convolution network layer.
  • the frequency domain feature determination subunit 1222 is used to perform feature learning on N object audio frames through the frequency domain branch network layer to obtain a learned frequency domain feature map.
  • the time domain feature determination subunit 1223 is used to perform feature learning on N object audio frames through the time domain branch network layer to obtain a learned time domain feature map.
  • the feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same.
  • the audio feature vector determination subunit 1224 is used to superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, and perform maximum average processing on the superimposed features. Output the audio semantic feature vector corresponding to each object audio frame.
  • the audio frame input subunit 1221 frequency domain feature determination subunit 1222, time domain feature determination subunit 1223 and audio
  • the feature vector determination sub-unit 1224 please refer to the description of semantic feature extraction of the object audio frame in the embodiment corresponding to FIG. 8, and will not be described again here.
  • the clustering processing unit 123 is used to determine M as the number of cluster centers to be clustered, and perform clustering processing on the audio semantic feature vector corresponding to each obtained object audio frame based on the number of cluster centers to obtain M audio clusters. Class cluster.
  • step S102 For the specific implementation of the object audio frame determination unit 121, the semantic feature extraction unit 122 and the clustering processing unit 123, please refer to the description of step S102 in the embodiment corresponding to Figure 3 above, and will not be described again here.
  • the audio role recognition module 13 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data.
  • Role where P is a positive integer less than or equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
  • the audio character recognition module 13 includes: a first time extraction unit 131, a second time extraction unit 132, a time overlap determination unit 133, and an audio character recognition unit 134.
  • the first time extraction unit 131 is used to obtain the audio cluster C k from the M audio clusters, and extract the object audio frame in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C k
  • One or more playback times are used as the first playback time of the audio cluster C k , where k is a positive integer less than or equal to M.
  • the second time extraction unit 132 is used to obtain P business objects that overlap with the M business objects from the list business objects in the object role mapping table, and extract each of the P business objects based on the picture feature information.
  • One or more playback times in the multimedia data of the video frame where the business object is located are used as the second playback time of each business object.
  • the time overlap determination unit 133 is used to respectively determine the time overlap between the first playback time of the audio cluster C k and the second playback time corresponding to each business object, and assign the second playback time with the highest time overlap to The business object corresponding to the time is used as the business object corresponding to the audio cluster C k .
  • the audio role identification unit 134 is used to obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
  • step S103 For the specific implementation of the first time extraction unit 131, the second time extraction unit 132, the time overlap determination unit 133 and the audio character recognition unit 134, please refer to the description of step S103 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
  • the service time determination module 14 is configured to determine P based on the first playback time of the P audio clusters in the multimedia data and the second playback time of the business objects corresponding to the P audio clusters in the multimedia data. The service playback time of each business object in the multimedia data.
  • the segment data determination module 15 is used to obtain multimedia segment data corresponding to P business objects from the multimedia data based on the service playback time corresponding to each business object.
  • the multimedia segment data includes audio frames associated with the corresponding business object and video frames associated with the corresponding business object.
  • the multimedia data playing module 16 is used to play multimedia data in the service playing display interface.
  • the service playback display interface includes a playback selection control used to trigger the object video data selection function.
  • the object list display module 17 is used to display the object play list in response to the triggering operation of the play selection control, wherein the object play list
  • the playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;
  • the segment data playback module 18 is used to respond to the trigger operation for the target object cover data among the Z object cover data, and play the target multimedia segment data in the service playback interface, where the target multimedia segment data is the service corresponding to the target object cover data.
  • the multimedia fragment data corresponding to the object and the business object corresponding to the cover data of the target object belong to P business objects.
  • the multimedia data includes first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited.
  • the object to be edited belongs to P business objects.
  • the first segment data acquisition module 19 is configured to obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and obtain the first target business role corresponding to the first multimedia data from the first multimedia data.
  • the first multimedia segment data associated with the target business role; the first multimedia segment data is determined based on the service playback time of the object to be edited in the first multimedia data.
  • the second segment data acquisition module 20 is configured to obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role corresponding to the second multimedia data from the second multimedia data.
  • the second multimedia segment data associated with the target business role; the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data.
  • the merging and editing module 21 is used to perform merging and editing processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.
  • the picture information acquisition module 11, clustering processing module 12, audio role recognition module 13, business time determination module 14, segment data determination module 15, multimedia data playback module 16, object list display module 17, segment data playback module 18 , the specific implementation of the first fragment data acquisition module 19, the second fragment data acquisition module 20 and the merge editing module 21 can be referred to the description of steps S201 to step S205 in the embodiment corresponding to Figure 10 above, and will not be continued here. Repeat. In addition, the description of the beneficial effects of using the same method will not be described again.
  • the computer device 1000 may be a computer device with an audio character recognition function.
  • the computer device 1000 may include: at least one processor 1001, for example, a CPU, at least one network interface 1004, a memory 1005, and at least one communication interface. Bus 1002.
  • the communication bus 1002 is used to realize connection communication between these components.
  • the network interface 1004 may include standard wired interfaces and wireless interfaces (such as WI-FI interfaces).
  • the memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001.
  • memory 1005 which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the computer device may also include the user interface 1003 shown in Figure 14.
  • the computer device may also include the user interface 1003.
  • the user interface 1003 may include a display screen (Display), a keyboard (Keyboard), etc.
  • the network interface 1004 is mainly used for network communication
  • the user interface 1003 is mainly used to provide an input interface for the user
  • the processor 1001 can be used to call the device control stored in the memory 1005. application to achieve:
  • the picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
  • N object audio frames Locate and separate audio frames containing human voices from the original audio frames of multimedia data to obtain N object audio frames. From N objects The corresponding audio semantic feature vectors are extracted from the audio frames, and the audio semantic feature vectors corresponding to the N object audio frames are clustered to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster Corresponds to a business object;
  • M audio clusters and the object role mapping table associated with the multimedia data identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
  • the computer device 1000 described in the embodiment of the present application can execute the data processing method described in the embodiment corresponding to FIG. 3 and FIG. 10, and can also execute the data processing device 1 and the data processing method in the embodiment corresponding to FIG. 12.
  • the description of the data processing device 2 in the embodiment corresponding to Figure 13 will not be repeated here.
  • the description of the beneficial effects of using the same method will not be described again.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the steps in Figure 3 and Figure 10 are implemented.
  • the steps in Figure 3 and Figure 10 are implemented.
  • the data processing method provided please refer to Figure 3 and the implementation provided in each step of Figure 10 for details, which will not be described again here.
  • the computer-readable storage medium may be the data transmission device provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • the present application provides a computer program product, which includes a computer program stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device can execute the description of the data processing method in the embodiment corresponding to Figure 3 or Figure 10, where No longer.
  • the description of the beneficial effects of using the same method will not be described again.

Abstract

Disclosed in the embodiments of the present application are a data processing method and apparatus, and a computer device and a storage medium, which can be applied to an artificial intelligence scene. The method comprises: identifying picture feature information from a video frame of multimedia data, wherein the picture feature information comprises M service objects to which role pictures in the video frame belong; positioning and separating, from an original audio frame of the multimedia data, audio frames that include human voice, so as to obtain N object audio frames, respectively extracting corresponding audio semantic feature vectors from the N object audio frames, and performing clustering processing on the audio semantic feature vectors corresponding to the N object audio frames, so as to obtain M audio clusters; and on the basis of the picture feature information, the M audio clusters and an object role mapping table associated with the multimedia data, identifying a service role corresponding to each of P audio clusters. By means of the embodiments of the present application, the precision, efficiency and applicability of audio role identification can be improved.

Description

一种数据处理方法、装置、计算机设备及存储介质A data processing method, device, computer equipment and storage medium
本申请要求2022年04月13日提交的申请号为202210383918.3、发明名称为“一种数据处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权。This application claims priority to the Chinese patent application with application number 202210383918.3 and the invention title "a data processing method, device, computer equipment and storage medium" submitted on April 13, 2022.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种数据处理方法、装置、计算机设备及存储介质。The present application relates to the field of computer technology, and in particular, to a data processing method, device, computer equipment and storage medium.
背景技术Background technique
许多视频内容提供平台推出了将多媒体数据(例如,影视剧)中的某个角色的所有片段单独剪辑出来供用户观看的服务。在所述剪辑过程中需要进行角色识别。目前的角色识别方案通常是由人工参与角色识别,这需要花费大量的时间和精力,对影视剧中的台词角色进行人工标注,例如,人工确定影视剧中出现的角色的数量,并对每一句台词进行标注。Many video content providing platforms have launched services that individually edit out all the clips of a certain character in multimedia data (for example, movies and TV series) for users to watch. Character identification is required during said editing process. Current character recognition solutions usually involve manual character recognition, which requires a lot of time and energy to manually annotate the characters in the film and television dramas. For example, manually determine the number of characters appearing in the film and television dramas, and label each sentence Lines are marked.
发明内容Contents of the invention
本申请实施例提供一种数据处理方法、装置、计算机设备及存储介质,可以提高音频角色识别的精确度、效率以及适用性。Embodiments of the present application provide a data processing method, device, computer equipment and storage medium, which can improve the accuracy, efficiency and applicability of audio character recognition.
本申请实施例一方面提供一种数据处理方法,包括:On the one hand, embodiments of the present application provide a data processing method, including:
从多媒体数据的视频帧中识别图片特征信息,图片特征信息包括视频帧中的角色图片所属的M个业务对象,M为正整数;Identify picture feature information from video frames of multimedia data. The picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象;Locate and separate the audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames. Extract the corresponding audio semantic feature vectors from the N object audio frames respectively, and compare the corresponding audio frames of the N object audio frames. The audio semantic feature vectors are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;
基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中,P为小于或者等于M的正整数,对象角色映射表包括与列表业务对象具有映射关系的业务角色,列表业务对象与M个业务对象之间存在P个重合的业务对象。Based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data, identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
本申请实施例一方面提供一种数据处理装置,包括:On the one hand, embodiments of the present application provide a data processing device, including:
图片信息获取模块,用于从多媒体数据的视频帧中识别图片特征信息,图片特征信息包括视频帧中的角色图片所属的M个业务对象,M为正整数;The picture information acquisition module is used to identify picture feature information from the video frame of the multimedia data. The picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer;
聚类处理模块,用于从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象;The clustering processing module is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and perform The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;
音频角色识别模块,用于基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色;P为小于或者等于M的正整数; 对象角色映射表包括与列表业务对象具有映射关系的业务角色;列表业务对象与M个业务对象之间存在P个重合的业务对象。The audio role recognition module is used to identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data. ;P is a positive integer less than or equal to M; The object role mapping table includes business roles that have a mapping relationship with the list business object; there are P overlapping business objects between the list business object and the M business objects.
本申请实施例一方面提供了一种计算机设备,包括:处理器和存储器;On the one hand, embodiments of the present application provide a computer device, including: a processor and a memory;
处理器与存储器相连,其中,存储器用于存储计算机程序,计算机程序被处理器执行时,使得该计算机设备执行本申请实施例提供的方法。The processor is connected to a memory, where the memory is used to store a computer program. When the computer program is executed by the processor, the computer device executes the method provided by the embodiment of the present application.
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,该计算机程序适于由处理器加载并执行,以使得具有该处理器的计算机设备执行本申请实施例提供的方法。On the one hand, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application. Examples provide methods.
本申请实施例一方面提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中;计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行本申请实施例中的方法。On the one hand, embodiments of the present application provide a computer program product. The computer program product includes a computer program. The computer program is stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium. , the processor executes the computer program, so that the computer device executes the method in the embodiment of the present application.
在本申请实施例中,具有音频角色识别功能的计算机设备可以通过结合从视频帧中自动识别出的图片特征信息以及自适应聚类的M个音频聚类簇,将声音与角色关联识别,从而可以准确识别出与对象角色映射表相关联的P个音频聚类簇分别对应的业务角色,这种音频角色识别方式无需人工标注每一句音频台词所归属的业务角色,不仅可以减少消耗的人力时间,还能够解决相似音色识别错误的情况,以至于提高了识别的精确度以及效率。此外,本申请实施例在音频角色识别过程中可以采用音频语义特征聚类的方法,使得整个音频角色识别系统更具通用性,可适用不同多媒体数据中业务对象不同的场景,从而有效提高了识别的适用性。In the embodiment of the present application, a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified. This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition. In addition, the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.
图1是本申请实施例提供的一种网络架构的结构示意图;Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application;
图2是本申请实施例提供的一种用于进行音频角色识别的系统流程示意图;Figure 2 is a schematic flow diagram of a system for audio character recognition provided by an embodiment of the present application;
图3是本申请实施例提供的一种数据处理方法的流程示意图;Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application;
图4是本申请实施例提供的一种从视频帧中获取图片特征信息的架构示意图;Figure 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application;
图5是本申请实施例提供的一种关键部位检测模型的模型架构图;Figure 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application;
图6是本申请实施例提供的一种音频语义特征聚类的架构示意图;Figure 6 is a schematic architectural diagram of an audio semantic feature clustering provided by an embodiment of the present application;
图7是本申请实施例提供的一种信源分离模型的模型架构图;Figure 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application;
图8是本申请实施例提供的一种音频语义特征提取模型的模型架构示意图;Figure 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application;
图9是本申请实施例提供的一种进行音频角色识别的场景示意图;Figure 9 is a schematic diagram of a scene for audio character recognition provided by an embodiment of the present application;
图10是本申请实施例提供的一种数据处理方法的另一个流程示意图;Figure 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application;
图11是本申请实施例提供的一种显示多媒体片段数据的场景示意图;Figure 11 is a schematic diagram of a scene for displaying multimedia segment data provided by an embodiment of the present application;
图12是本申请实施例提供的一种数据处理装置的结构示意图;Figure 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application;
图13是本申请实施例提供的一种数据处理装置的另一个结构示意图; Figure 13 is another structural schematic diagram of a data processing device provided by an embodiment of the present application;
图14是本申请实施例提供的一种计算机设备的示意图。Figure 14 is a schematic diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请实施例提供了一种基于音频语义特征聚类的角色识别方法,该方法可应用于人工智能领域。其中,所谓人工智能(Artificial Intelligence,简称AI)是利用数字计算机或者数字计算机控制的计算模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。The embodiment of the present application provides a character recognition method based on audio semantic feature clustering, which can be applied to the field of artificial intelligence. Among them, the so-called Artificial Intelligence (AI) is the theory, method, technology and technology that uses digital computers or digital computer-controlled calculations to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system. In other words, artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, autonomous driving, smart transportation and other major directions.
其中,计算机视觉技术(Computer Vision,CV)领域是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建、自动驾驶、智慧交通等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。Among them, the field of computer vision technology (Computer Vision, CV) is a science that studies how to make machines "see". Furthermore, it refers to machine vision such as using cameras and computers to replace human eyes to identify and measure targets, and Further graphics processing is performed to make the computer processing into an image more suitable for human eye observation or to be transmitted to the instrument for detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data. Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous positioning and mapping Construction, autonomous driving, smart transportation and other technologies also include common biometric identification technologies such as facial recognition and fingerprint recognition.
其中,语音技术(Speech Technology)的关键技术有自动语音识别技术和语音合成技术以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。Among them, the key technologies of speech technology include automatic speech recognition technology, speech synthesis technology and voiceprint recognition technology. Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
其中,自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
其中,机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、 置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Among them, Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning often include artificial neural networks, Belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
请参见图1,图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示,该网络架构可以包括服务器10F和终端设备集群。该终端设备集群可以包括一个或者多个终端设备,这里将不对终端设备的数量进行限制。如图1所示,该终端设备集群具体可以包括终端设备100a、终端设备100b、终端设备100c、…、终端设备100n。如图1所示,终端设备100a、终端设备100b、终端设备100c、…、终端设备100n可以分别与上述服务器10F进行网络连接,以便于每个终端设备可以通过该网络连接与服务器10F进行数据交互。其中,这里的网络连接不限定连接方式,可以通过有线通信方式进行直接或间接地连接,也可以通过无线通信方式进行直接或间接地连接,还可以通过其他方式,本申请在此不做限制。Please refer to Figure 1. Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application. As shown in Figure 1, the network architecture may include a server 1OF and a terminal device cluster. The terminal device cluster may include one or more terminal devices, and there will be no limit on the number of terminal devices here. As shown in Figure 1, the terminal device cluster may specifically include terminal devices 100a, terminal devices 100b, terminal devices 100c,..., terminal devices 100n. As shown in Figure 1, the terminal device 100a, the terminal device 100b, the terminal device 100c, ..., the terminal device 100n can each have a network connection with the above-mentioned server 10F, so that each terminal device can perform data interaction with the server 10F through the network connection. . The network connection here is not limited to a connection method. It can be connected directly or indirectly through wired communication, or directly or indirectly through wireless communication. It can also be connected through other methods. This application does not limit it here.
其中,该终端设备集群中的每个终端设备均可以包括:智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端、智能电视等具有音频角色识别功能的智能终端。如图1所示的终端设备集群中的每个终端设备均可以安装有目标应用(例如,客户端)。当该客户端运行于各终端设备中时,可以分别与上述图1所示的服务器10F之间进行数据交互。其中,该客户端可以包括社交客户端、多媒体客户端(例如,视频客户端)、娱乐客户端(例如,游戏客户端)、信息流客户端、教育客户端、直播客户端等客户端。其中,该客户端可以为独立的客户端,也可以为集成在某客户端(例如,社交客户端、教育客户端以及多媒体客户端等)中的嵌入式子客户端,在此不做限定。Each terminal device in the terminal device cluster may include: smart phones, tablets, laptops, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart TVs and other smart terminals with audio role recognition functions. Each terminal device in the terminal device cluster as shown in Figure 1 can be installed with a target application (for example, a client). When the client is running in each terminal device, it can perform data interaction with the server 10F shown in FIG. 1 . The client may include a social client, a multimedia client (for example, a video client), an entertainment client (for example, a game client), an information flow client, an education client, a live broadcast client, and other clients. The client can be an independent client or an embedded sub-client integrated in a certain client (for example, a social client, an education client, a multimedia client, etc.), which is not limited here.
如图1所示,本申请实施例中的服务器10F可以为该客户端对应的服务器。该服务器10F可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。其中,本申请实施例将不对服务器的数量进行限制。As shown in Figure 1, the server 10F in the embodiment of the present application can be the server corresponding to the client. The server 10F can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services. Among them, the embodiment of this application will not limit the number of servers.
为便于理解,本申请实施例可以在图1所示的多个终端设备中选择一个终端设备作为目标终端设备。例如,本申请实施例可以将图1所示的终端设备100a作为目标终端设备,该目标终端设备中可以集成有目标应用(例如,客户端)。此时,该目标终端设备可以通过该客户端对应的业务数据平台与服务器10F之间实现数据交互。其中,这里的客户端可以具有帧序列(例如,帧动画序列)加载和播放功能,用于在客户端所提供的业务播放显示界面中播放包括视频帧、音频帧以及文本(例如,台词)的多媒体数据。这里的业务播放显示界面是指终端设备显示的用于播放多媒体数据的界面。该多媒体数据的数据类型可以包括影视剧类型、动漫类型、综艺类型等等。这里将不对多媒体数据的数据类型进行限定。For ease of understanding, in this embodiment of the present application, one terminal device may be selected as the target terminal device among the multiple terminal devices shown in FIG. 1 . For example, in this embodiment of the present application, the terminal device 100a shown in FIG. 1 can be used as a target terminal device, and the target terminal device can be integrated with a target application (for example, a client). At this time, the target terminal device can realize data interaction with the server 10F through the business data platform corresponding to the client. The client here may have a frame sequence (for example, frame animation sequence) loading and playback function, which is used to play video frames, audio frames and text (for example, lines) in the service playback display interface provided by the client. Multimedia data. The service playback display interface here refers to the interface displayed by the terminal device for playing multimedia data. The data type of the multimedia data may include film and television drama types, animation types, variety show types, etc. The data type of multimedia data will not be limited here.
具有音频角色识别功能的计算机设备(例如,上述服务器10F)在获取到多媒体数据(例如,电视剧A)时,可以从多媒体数据的视频帧中识别图片特征信息。其中,这里的图片特征信息可以包括视频帧中的角色图片所属的M个业务对象,M为正整数。例如,该图片特征信息可以指示该电视剧A中的某个包括角色关键部位(例如,角色面部)的角色图片中的角色是由哪个演员所饰演的。与此同时,该计算机设备还可以从N个对象音频帧中分别提取对应的音频语义特征向量,进而通过对N个对象音频帧对应的音频语义特征向量进行聚类处理,以得到M个音频聚类簇。其中,N为正整数,且这里的N个对象音频帧是该计算机设备从多媒体数据中的原始音频帧中定位和分离包含有人声的音频帧得到的。该计算机设备对原始音频帧进行对象定位和分离处理是为了减少环境音轨以及对象音轨(例如,人声音轨)中的静音帧在后续聚类处理时所产生的干扰,以便提升聚类的准确性,进而提升角色声音识别的准确性。When a computer device with an audio character recognition function (for example, the above-mentioned server 10F) obtains multimedia data (for example, TV series A), it can identify picture feature information from video frames of the multimedia data. The picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer. For example, the picture feature information may indicate which actor plays the character in a certain character picture including key parts of the character (for example, the character's face) in the TV series A. At the same time, the computer device can also extract corresponding audio semantic feature vectors from the N object audio frames, and then perform clustering processing on the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clusters. Class cluster. Wherein, N is a positive integer, and the N object audio frames here are obtained by the computer device locating and separating the audio frames containing human voices from the original audio frames in the multimedia data. The computer device performs object positioning and separation processing on the original audio frame in order to reduce the interference caused by the silent frames in the environmental audio track and the object audio track (for example, the vocal track) in subsequent clustering processing, so as to improve the clustering accuracy, thereby improving the accuracy of character voice recognition.
此时,该计算机设备可以基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色 映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色。这里的P可以为小于或者等于M的正整数。其中,这里的对象角色映射表(例如,电视剧A的演职表)可以包括与列表业务对象(演员)具有映射关系的业务角色(角色)。对象角色映射表中的列表业务对象与该计算机所识别的M个业务对象之间存在P个重合的业务对象。该对象映射表可以是计算机设备获取到的多媒体数据的业务编辑(例如,电视剧A的编辑用户)所提供的初始对象角色映射表,还可以是访问客户端的目标用户基于业务编辑所提取的初始对象角色映射表所更新的,这里将不对其限定。例如,目标用户可以在初始对象角色映射表中添加电视剧A中的某一业务角色(例如,餐馆的服务员)与某一业务对象(例如,演员1)之间具有映射关系,即该餐馆的服务员是由演员1所饰演的。At this time, the computer device can be based on the picture feature information, M audio clusters, and object roles associated with the multimedia data. The mapping table identifies the business role corresponding to each of the P audio clusters. P here can be a positive integer less than or equal to M. The object role mapping table here (for example, the cast list of TV series A) may include business roles (roles) that have a mapping relationship with the list business objects (actors). There are P overlapping business objects between the list business objects in the object role mapping table and the M business objects recognized by the computer. The object mapping table may be an initial object role mapping table provided by the business editor of the multimedia data acquired by the computer device (for example, the editing user of TV series A), or it may be an initial object extracted by the target user of the access client based on the business editing. What is updated by the role mapping table will not be limited here. For example, the target user can add a mapping relationship between a certain business role in TV series A (for example, waiter in a restaurant) and a certain business object (for example, actor 1) in the initial object role mapping table, that is, the waiter in the restaurant is played by actor 1.
由此可见,本申请实施例中的计算机设备可以通过结合从视频帧中自动识别出的图片特征信息(例如,人脸信息)以及自适应聚类的M个音频聚类簇,将声音与角色关联识别,从而可以准确识别出与对象角色映射表相关联的P个音频聚类簇分别对应的业务角色。这种音频角色识别方式无需人工标注每一句音频台词所归属的业务角色,不仅可以减少消耗的人力时间,还能够解决相似音色识别错误的情况,以至于提高了识别的精确度以及效率。此外,本申请实施例在音频角色识别过程中可以采用音频语义特征聚类的方法,使得整个音频角色识别系统更具通用性,可适用不同多媒体数据中业务对象不同的场景,从而有效提高了识别的适用性。It can be seen that the computer device in the embodiment of the present application can combine the sound and character by combining the picture feature information (for example, face information) automatically recognized from the video frame and the M audio clusters of adaptive clustering. Association identification, so that the business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified. This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. It can not only reduce the time consumed by manpower, but also solve the problem of similar timbre recognition errors, thereby improving the accuracy and efficiency of recognition. In addition, the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
为便于理解,进一步地,请参见图2。图2是本申请实施例提供的一种用于进行音频角色识别的系统流程示意图。如图2所示,本申请实施例中的计算机设备可以为具有音频角色识别功能的计算机设备。该计算机设备可以为上述图1所示的终端设备集群中的任意一个终端设备,例如,终端设备100a,也可以为上述图1所示的服务器10F。这里将不对计算机设备进行限定。For ease of understanding, please further refer to Figure 2. Figure 2 is a schematic flow chart of a system for audio character recognition provided by an embodiment of the present application. As shown in Figure 2, the computer device in the embodiment of the present application may be a computer device with audio character recognition function. The computer device may be any terminal device in the terminal device cluster shown in FIG. 1 , for example, the terminal device 100a, or may be the server 10F shown in FIG. 1 . Computer equipment will not be limited here.
如图2所示,本申请实施例提供的音频角色识别系统可以包括三个模块,具体可以包括第一模块(例如,关键图像识别模块)、第二模块202(例如,音频语义特征聚类模型)以及第三模块203(例如,角色识别模块)。其中,本申请实施例中的多媒体数据20S可以为该计算机设备所获取到的需要进行音频角色识别的多媒体数据。该多媒体数据20S可以为某一电视剧中的某一剧集对应的多媒体数据,也可以为某一电影对应的多媒体数据,还可以为某一综艺节目对应的多媒体数据,这里将不对其进行一一举例。其中,该多媒体数据20S是由包括原始视频帧的视频数据以及包括原始音频帧的音频数据所组成的。As shown in Figure 2, the audio character recognition system provided by the embodiment of the present application may include three modules. Specifically, it may include a first module (for example, a key image recognition module), a second module 202 (for example, an audio semantic feature clustering model). ) and the third module 203 (for example, character recognition module). Among them, the multimedia data 20S in the embodiment of the present application may be multimedia data acquired by the computer device that requires audio character recognition. The multimedia data 20S can be multimedia data corresponding to a certain episode in a certain TV series, multimedia data corresponding to a certain movie, or multimedia data corresponding to a certain variety show, which will not be discussed one by one here. Example. The multimedia data 20S is composed of video data including original video frames and audio data including original audio frames.
该计算机设备可以从包括原始视频帧的视频数据中获取视频帧。这里的视频帧可以是指对视频数据中的原始视频帧的片头和片尾进行删减后所得到的视频帧序列。进一步地,该计算机设备可以通过图2所示的第一模块201,从多媒体数据20S的视频帧中识别图片特征信息。其中,该第一模块201可以包括关键部位检测模型210w以及图片编码模型220w。该关键部位检测模型210w可以用于检测视频帧中的角色图片。这里的角色图片是指包括角色关键部位(例如,角色面部)的图片。该图片编码模型220w可以用于对角色图片中的每个角色切割图片进行编码处理,以得到角色切割图片对应的图片向量信息。其中,该计算机设备还可以例如从其内部存储器或外部获取图2所示的信息向量数据库200K。该信息向量数据库200K可以为该计算机设备事先通过同样的关键图像识别方法在大量的素材数据(例如,属于影视剧类型、综艺类型等的多媒体数据)基础上所建立的信息索引库,专门用于进行关键图像识别的信息库。其中,该信息向量数据库200K可以用于存储Y个候选业务对象分别对应的对象关键信息向量。这里的对象关 键信息向量也可以是通过图片编码模型220w所确定的,Y为大于或者等于M的正整数。此外,该信息向量数据库200K还可以包括每个候选业务对象的对象信息,例如,候选业务对象的对象属性类型(包括唱跳歌手、现代偶像剧、古代宫廷剧、仙侠剧、战争题材剧等)。该计算机设备可以根据信息向量数据库200K,以及图片编码模型220w输出的图片信息向量,得到图2所示的图片特征信息。The computer device can obtain video frames from video data including raw video frames. The video frame here may refer to a video frame sequence obtained by deleting the beginning and end of the original video frame in the video data. Further, the computer device can identify picture feature information from the video frames of the multimedia data 20S through the first module 201 shown in FIG. 2 . Among them, the first module 201 may include a key part detection model 210w and a picture encoding model 220w. The key part detection model 210w can be used to detect character pictures in video frames. The character picture here refers to a picture including the key parts of the character (for example, the character's face). The picture encoding model 220w can be used to encode each character cut picture in the character picture to obtain picture vector information corresponding to the character cut picture. The computer device may also obtain the information vector database 200K shown in FIG. 2 from its internal memory or externally, for example. The information vector database 200K can be an information index database established by the computer device in advance based on a large amount of material data (for example, multimedia data belonging to film and television drama types, variety show types, etc.) through the same key image recognition method, and is specially used for Information base for key image recognition. Among them, the information vector database 200K can be used to store object key information vectors respectively corresponding to Y candidate business objects. The object here is related to The key information vector may also be determined through the picture encoding model 220w, and Y is a positive integer greater than or equal to M. In addition, the information vector database 200K may also include object information of each candidate business object, for example, the object attribute type of the candidate business object (including singing and dancing singers, modern idol dramas, ancient palace dramas, fairy tale dramas, war-themed dramas, etc. ). The computer device can obtain the picture feature information shown in Figure 2 based on the information vector database 200K and the picture information vector output by the picture coding model 220w.
与此同时,该计算机设备还可以通过图2所示的第二模块202,得到与多媒体数据20S中的N个对象音频帧相关联的音频聚类结果。其中,这里的N个对象音频帧是对多媒体数据中的原始音频帧进行对象定位和分离处理后所得到的,N为正整数。如图2所示,这里的第二模块202可以包括信源分离模型230w以及音频语义特征提取模型240w。这里的信源分离模型230w可以用于对原始音频帧进行信源分离,以得到对象音段(或对象音轨)(例如,人声音段(或人声音轨))以及环境音段(或环境音轨)(例如,背景音段(或背景音轨))。这里的音频语义特征提取模型240w可以用于在获取到对象音段中的N个对象音频帧时,对每个对象音频帧进行帧级别的语义特征提取,以得到每个对象音频帧分别对应的音频语义特征向量。进一步地,该计算机设备可以对N个音频语义特征向量进行聚类处理,以得到M个音频聚类簇,进而可以将这M个音频聚类簇作为第二模块202所得到的音频聚类结果。其中,一个音频聚类簇可以对应一个业务对象。At the same time, the computer device can also obtain audio clustering results associated with the N object audio frames in the multimedia data 20S through the second module 202 shown in FIG. 2 . Among them, the N object audio frames here are obtained by subjecting the original audio frames in the multimedia data to object positioning and separation processing, and N is a positive integer. As shown in Figure 2, the second module 202 here may include a source separation model 230w and an audio semantic feature extraction model 240w. The source separation model 230w here can be used to perform source separation on the original audio frame to obtain the object sound segment (or object audio track) (for example, the vocal segment (or vocal track)) and the environmental sound segment (or ambient sound track) (e.g., background sound segment (or backing track)). The audio semantic feature extraction model 240w here can be used to perform frame-level semantic feature extraction on each object audio frame when N object audio frames in the object segment are obtained, so as to obtain the corresponding information of each object audio frame. Audio semantic feature vector. Further, the computer device can perform clustering processing on N audio semantic feature vectors to obtain M audio clusters, and then these M audio clusters can be used as the audio clustering results obtained by the second module 202 . Among them, an audio cluster can correspond to a business object.
进一步地,该计算机设备可以基于图片特征信息、M个音频聚类簇、以及图2所示的与多媒体数据20S相关联的对象角色映射表200B,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色。P为小于或者等于M的正整数。其中,这里的对象角色映射表200B可以包括与列表业务对象具有映射关系的业务角色。该列表业务对象与M个业务对象之间存在P个重合的业务对象。该计算机设备可以通过第三模块203,对前两个模块的输出信息进行音频角色识别。例如,该计算机设备基于第一模块201输出的图片特征信息、第二模块202输出的音频聚类结果以及对象角色映射表200B,确定P个重合的业务对象所在的视频帧在多媒体数据20S中的播放时间(即第二播放时间),以及每个音频聚类簇对应的对象音频帧分别在多媒体数据20S中的播放时间(即第一播放时间)。进而该计算机设备可以通过比对这两个播放时间,确定P个业务对象分别对应的音频聚类簇,进而可以确定这P个音频聚类簇中的每个音频聚类簇分别对应的业务角色。Further, the computer device can identify each audio in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table 200B associated with the multimedia data 20S shown in FIG. 2 Cluster clusters correspond to business roles respectively. P is a positive integer less than or equal to M. The object role mapping table 200B here may include business roles that have a mapping relationship with the list business objects. There are P overlapping business objects between the list business object and the M business objects. The computer device can perform audio character recognition on the output information of the first two modules through the third module 203. For example, based on the picture feature information output by the first module 201, the audio clustering result output by the second module 202, and the object role mapping table 200B, the computer device determines the position of the video frame where the P overlapping business objects are located in the multimedia data 20S. The playback time (ie, the second playback time), and the playback time (ie, the first playback time) of the target audio frame corresponding to each audio cluster in the multimedia data 20S. Furthermore, the computer device can determine the audio clusters corresponding to the P business objects by comparing the two playback times, and further determine the business roles corresponding to each of the P audio clusters. .
由此可见,本申请实施例中的计算机设备可以通过结合第一模块201输出的图片特征信息(例如,人脸信息)以及第二模块202输出的音频聚类结果,在第三模块203中将音频与业务角色关联识别,从而可以准确识别出与对象角色映射表200B相关联的P个音频聚类簇分别对应的业务角色。这种音频角色识别方式不仅提高了识别的精确度以及效率,还提高了识别的适用性。It can be seen that the computer device in the embodiment of the present application can combine the picture feature information (for example, face information) output by the first module 201 and the audio clustering result output by the second module 202, in the third module 203. The audio and business roles are associated and identified, so that the business roles respectively corresponding to the P audio clusters associated with the object role mapping table 200B can be accurately identified. This audio character recognition method not only improves the accuracy and efficiency of recognition, but also improves the applicability of recognition.
其中,具有音频角色识别功能的计算机设备通过结合从多媒体数据的视频帧中自动识别出的图片特征信息(例如,人脸信息)以及自适应聚类的M个音频聚类簇,识别与对象角色映射表相关联的P个音频聚类簇分别对应的业务角色的具体实现方式可以参见下述图3-图11所对应的实施例。Among them, the computer device with the audio character recognition function recognizes the object character by combining the picture feature information (for example, face information) automatically recognized from the video frame of the multimedia data and the M audio clusters of adaptive clustering. For specific implementation methods of the service roles corresponding to the P audio clusters associated with the mapping table, please refer to the embodiments corresponding to Figures 3 to 11 below.
进一步地,请参见图3,图3是本申请实施例提供的一种数据处理方法的流程示意图。如图3所示,该方法可以由具有音频角色识别功能的计算机设备执行。该计算机设备可以为终端设备(例如,上述图1所示的终端设备集群中的任意一个终端设备,例如,终端设备100a),也可以为服务器(例如,上述图1所示的服务器10F),在此不做限定。为便于理解,本申请实施例以该方法由具有音频角色识别功能的服 务器执行为例进行说明,该方法至少可以包括以下步骤S101-步骤S103:Further, please refer to Figure 3, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 3, the method can be performed by a computer device with audio character recognition capabilities. The computer device may be a terminal device (for example, any terminal device in the terminal device cluster shown in Figure 1 above, for example, the terminal device 100a), or it may be a server (for example, the server 10F shown in Figure 1 above), No limitation is made here. For ease of understanding, the embodiment of the present application uses this method to use the service with the audio character recognition function to Taking server execution as an example for illustration, the method may at least include the following steps S101 to S103:
步骤S101,从多媒体数据的视频帧中识别图片特征信息。Step S101: Identify picture feature information from video frames of multimedia data.
其中,这里的图片特征信息可以包括视频帧中的角色图片所属的M个业务对象,M为正整数。具体地,该计算机设备可以从多媒体数据中获取视频帧,进而可以对视频帧中的角色关键部位进行图片切割处理(对所述视频帧中包含角色关键部位的图片进行切割处理),以得到视频帧对应的角色图片。其中,这里的角色图片可以包括X个角色切割图片,X为大于或者等于M的正整数。进一步地,该计算机设备可以获取X个角色切割图片中的角色切割图片Ti,对角色切割图片Ti进行编码处理,以得到角色切割图片Ti对应的图片信息向量Li。其中,这里的i为小于或者等于X的正整数。此时,该计算机设备可以从与候选业务对象相关联的信息向量数据库中,确定与图片信息向量Li相匹配的对象关键信息向量,并将匹配到的对象关键信息向量对应的候选业务对象作为角色切割图片Ti对应的业务对象。进一步地,该计算机设备可以基于X个角色切割图片分别对应的业务对象,确定视频帧对应的图片特征信息。The picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer. Specifically, the computer device can obtain video frames from multimedia data, and can then perform picture cutting processing on the key parts of the character in the video frame (cutting the pictures containing the key parts of the character in the video frame) to obtain the video The character picture corresponding to the frame. Among them, the character pictures here may include X character cut pictures, where X is a positive integer greater than or equal to M. Further, the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti . Among them, i here is a positive integer less than or equal to X. At this time, the computer device can determine the object key information vector matching the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as The business object corresponding to the role cutting picture T i . Further, the computer device can determine the picture feature information corresponding to the video frame based on the business objects corresponding to the X character cut pictures.
其中,该计算机设备对视频帧中的角色关键部位进行检测识别时的图片识别系统,可以采用检测子模块与识别子模块所共同组成的,也可以是对角色关键部位进行检测和识别的一体化检测识别网络,这里将不对其进行限定。Among them, the picture recognition system when the computer device detects and recognizes the key parts of the character in the video frame can be composed of a detection sub-module and a recognition sub-module, or it can be an integrated system that detects and recognizes the key parts of the character. Detection and recognition network will not be limited here.
比如,该计算机设备在确定视频帧对应的角色图片时,可以对视频帧中的角色关键部位进行检测定位,从而确定角色关键部位在视频帧中的位置信息。进一步地,该计算机设备可以基于位置信息,在视频帧中切割角色关键部位,得到包含角色关键部位的X个角色切割图片,将X个角色切割图片作为视频帧对应的角色图片。然后,该计算机设备可以获取X个角色切割图片中的角色切割图片Ti,对角色切割图片Ti进行编码处理,以得到角色切割图片Ti对应的图片信息向量Li。其中,这里的i为小于或者等于X的正整数。此时,该计算机设备可以从其内部存储器或外部获取与候选业务对象相关联的信息向量数据库,以查找与图片信息向量Li具有匹配关系的候选业务对象。其中,这里的信息向量数据库可以用于存储Y个候选业务对象分别对应的对象关键信息向量,Y为大于或者等于M的正整数。For example, when determining the character picture corresponding to the video frame, the computer device can detect and locate the key parts of the character in the video frame, thereby determining the position information of the key parts of the character in the video frame. Further, the computer device can cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame. Then, the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti . Among them, i here is a positive integer less than or equal to X. At this time, the computer device can obtain the information vector database associated with the candidate business object from its internal memory or externally to find the candidate business object that has a matching relationship with the picture information vector Li . Among them, the information vector database here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
该计算机设备在获取到信息向量数据库时,可以直接从该信息向量数据库中查找与图片信息向量Li具有匹配关系的候选业务对象。其中,该计算机设备可以分别确定图片信息向量Li与Y个对象关键信息向量中的每个对象关键信息向量之间的向量距离,得到Y个向量距离。进而该计算机设备可以从Y个向量距离中获取小于或者等于距离阈值的最小向量距离,确定最小向量距离对应的对象关键信息向量所对应的候选业务对象,并将确定的候选业务对象作为角色切割图片Ti对应的业务对象。其中,这里的距离阈值是该计算机设备事先设置的一个用于确保查找到的候选业务对象与角色切割图片具有匹配关系的数值,可以根据实际情况进行动态调整,这里将不对其进行限定。When the computer device obtains the information vector database, it can directly search for candidate business objects that have a matching relationship with the picture information vector Li from the information vector database. Wherein, the computer device can respectively determine the vector distance between the picture information vector Li and each of the Y object key information vectors, and obtain Y vector distances. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to Ti . The distance threshold here is a value set in advance by the computer device to ensure that the found candidate business object has a matching relationship with the character cut picture. It can be dynamically adjusted according to the actual situation, and will not be limited here.
为了提高匹配效率,该计算机设备可以获取与多媒体数据相关联的对象角色映射表,通过对象角色映射表以及信息向量数据库,查找与图片信息向量Li具有匹配关系的候选业务对象。为便于理解,进一步地,请参见表1,表1是本申请实施例提供的一种与多媒体数据相关联的对象角色映射表。如表1所示:In order to improve the matching efficiency, the computer device can obtain the object role mapping table associated with the multimedia data, and use the object role mapping table and the information vector database to find candidate business objects that have a matching relationship with the picture information vector Li . For ease of understanding, please further refer to Table 1. Table 1 is an object role mapping table associated with multimedia data provided by an embodiment of the present application. As shown in Table 1:
表1

Table 1

其中,为便于理解,表1所示的对象角色映射表中的业务角色可以包括H个,H为大于或者等于M的正整数。这里以5个业务角色为例,具体可以包括角色1、角色2、角色3、角色4以及角色5。其中,该角色1与角色2均可以与同一业务对象(例如,对象a)具有映射关系。即该角色1与角色2均由对象a所饰演。角色3与对象b具有映射关系,角色4与对象c具有映射关系,角色5与对象d具有映射关系。For ease of understanding, the business roles in the object role mapping table shown in Table 1 may include H, where H is a positive integer greater than or equal to M. Here we take five business roles as an example, which may include role 1, role 2, role 3, role 4 and role 5. Wherein, both role 1 and role 2 may have a mapping relationship with the same business object (for example, object a). That is, both role 1 and role 2 are played by object a. Role 3 has a mapping relationship with object b, role 4 has a mapping relationship with object c, and role 5 has a mapping relationship with object d.
计算机设备可以根据上述表1,从信息向量数据库中选择出该对象角色映射表中的列表业务对象对应的对象关键信息向量,例如,对象a的对象关键信息向量、对象b的对象关键信息向量以及对象c的对象关键信息向量。进一步地,该计算机设备可以分别确定图片信息向量Li与所选择的这3个对象关键信息向量中的每个对象关键信息向量之间的向量距离。进而该计算机设备可以从3个向量距离中获取小于或者等于距离阈值的最小向量距离,确定最小向量距离对应的对象关键信息向量所对应的候选业务对象,并将确定的候选业务对象作为角色切割图片Ti对应的业务对象。由此可见,该计算机设备在匹配候选业务对象时,无需确定与信息向量数据库中的每个对象关键信息向量之间的向量距离,而是通过对象角色映射表进行选择,极大地减少了匹配时间,从而提高了从信息向量数据库中查找到具有匹配关系的候选业务对象的匹配效率。The computer device can select the object key information vector corresponding to the list business object in the object role mapping table from the information vector database according to the above Table 1, for example, the object key information vector of object a, the object key information vector of object b, and Object key information vector of object c. Further, the computer device can respectively determine the vector distance between the picture information vector Li and each of the selected three object key information vectors. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the three vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to Ti . It can be seen that when matching candidate business objects, the computer device does not need to determine the vector distance between the key information vector of each object in the information vector database, but selects through the object role mapping table, which greatly reduces the matching time. , thereby improving the matching efficiency of finding candidate business objects with matching relationships from the information vector database.
为便于理解,进一步地,请参见图4,图4是本申请实施例提供的一种从视频帧中获取图片特征信息的架构示意图。如图4所示,本申请实施例中的架构示意图可以为上述图2所对应实施例中的第一模块201对应的架构示意图。图4所示的视频帧4V可以为多媒体数据(例如,上述图2所对应实施例中的多媒体数据20S)中的一个视频帧。图4所示的关键部位检测模型410w可以用于检测视频帧4V中的关键部位。该关键部位检测模型410w可以为上述图2所对应实施例中的关键部位检测模型210w。图4所示的图片编码模型420w可以用于对角色切割图片400S进行编码处理。该图片编码模型420w可以为上述图2所对应实施例中的图片编码模型420w。图4所示的信息向量数据库400K可以为上述图2所对应实施例中的信息向量数据库200K。For ease of understanding, please further refer to FIG. 4 , which is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application. As shown in FIG. 4 , the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the first module 201 in the embodiment corresponding to FIG. 2 . The video frame 4V shown in Figure 4 may be a video frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to Figure 2 described above). The key part detection model 410w shown in Figure 4 can be used to detect key parts in the video frame 4V. The key part detection model 410w may be the key part detection model 210w in the embodiment corresponding to FIG. 2 mentioned above. The picture encoding model 420w shown in FIG. 4 can be used to encode the character cut picture 400S. The picture coding model 420w may be the picture coding model 420w in the embodiment corresponding to FIG. 2 described above. The information vector database 400K shown in Figure 4 may be the information vector database 200K in the embodiment corresponding to Figure 2 described above.
如图4所示,本申请实施例中的计算机设备在对视频帧4V进行图像识别时,可以将该视频帧4V输入至图4所示的关键部位检测模型410w,通过该关键部位检测模型410w,对该视频帧4V中的角色关键部位(例如,角色的面部五官)进行检测定位,以确定该角色关键部位在视频帧4V中的位置信息(例如,图4所示的区域40Q中所标记的五官位置信息)。进一步地,该计算机设备可以基于在区域40Q中标记的位置信息,在视频帧4V中切割该角色关键部位,得到图4所示的包括角色关键部位的角色切割图片(例如,图4所示的角色切割图片400T)。As shown in Figure 4, when the computer device in the embodiment of the present application performs image recognition on the video frame 4V, the video frame 4V can be input to the key part detection model 410w shown in Figure 4, and through the key part detection model 410w , detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information). Further, the computer device can cut the key parts of the character in the video frame 4V based on the position information marked in the area 40Q, and obtain the character cutting picture including the key parts of the character as shown in Figure 4 (for example, as shown in Figure 4 Character cutting picture 400T).
其中,图4所示的关键部位检测模型410w可以是一种用于对角色关键部位(例如,角色面部)进行检测定位的网络结构,例如,面部检测模型(Multi-task Cascaded Convolutional Networks,简称MTCNN网络)。为便于理解,进一步地,请一并参见图5,图5是本申请实施例提供的一种关键部位检测模型的模型架构图。如图5所示,本申请实施例中的关键部位检测模型可以为上述图4所对应实施例中的关键部位检测模型410w。该关键部位检测模型可以用于检测图5所示的视频帧5V中的关键部位,这里的视频帧5V 可以为上述图4所对应实施例中的视频帧4V。Among them, the key part detection model 410w shown in Figure 4 may be a network structure used to detect and locate key parts of a character (for example, a character's face), for example, a face detection model (Multi-task Cascaded Convolutional Networks, MTCNN for short) network). For ease of understanding, please further refer to FIG. 5 , which is a model architecture diagram of a key part detection model provided by an embodiment of the present application. As shown in Figure 5, the key part detection model in the embodiment of the present application may be the key part detection model 410w in the embodiment corresponding to Figure 4. This key part detection model can be used to detect key parts in the video frame 5V shown in Figure 5, where the video frame 5V It may be the video frame 4V in the embodiment corresponding to FIG. 4 mentioned above.
其中,如图5所示,该关键部位检测模型可以包括三个网络层,具体可以包括筛选网络层5W1(例如,Proposal Network,简称P-Net)、精调网络层5W2(例如,Refinement network,简称R-Net)以及输出网络层5W3(例如,Output network,简称O-Net)。Among them, as shown in Figure 5, the key part detection model may include three network layers, which may specifically include a filtering network layer 5W 1 (for example, Proposal Network, P-Net for short), a fine-tuning network layer 5W 2 (for example, Refinement network, referred to as R-Net) and the output network layer 5W 3 (for example, Output network, referred to as O-Net).
本申请实施例中的计算机设备在获取到视频帧5V时,可以对视频帧5V进行图片尺寸调整,从而可以得到该视频帧5V对应的图片金字塔。比如,该计算机设备可以从其内部存储器或外部获取尺寸调整系数(例如,0.7),基于尺寸调整系数对视频帧5V进行多次调整,直到调整后的视频帧5V的图片尺寸与筛选网络层5W1所关联的图片尺寸阈值(例如,12*12*3)相匹配。此时,该计算机设备可以基于多次调整后具有不同图片尺寸的视频帧5V,组成该视频帧5V对应的图片金字塔。其中,这里的尺寸调整系数可以是该计算机设备根据角色关键部位在视频帧中所处位置的分布情况所动态设置的。尺寸调整系数设置过大,容易延长检测定位角色关键部位的时间,尺寸调整系数设置过小,可能会漏掉视频帧中分布面积较小的角色关键部位(例如,中小型人脸)。基于此,本申请实施例中的尺寸调整系数可以设置在0.7-0.8之间。其中,这里的图片金字塔可以包括原始图片(例如,图5所示的视频帧5V)、第一调整图片(即对视频帧5V进行图片尺寸调整后所得到的图片)、第二调整图片(即对第一调整图片进行图片尺寸调整后所得到的图片)、…、第N调整图片(即对第N-1调整图片进行图片尺寸调整后所得到的图片)。其中,这里的第N调整图片的图片尺寸可以为筛选网络层5W1所关联的图片尺寸阈值(例如,12*12)。When the computer device in the embodiment of the present application obtains the video frame 5V, it can adjust the image size of the video frame 5V, so that the image pyramid corresponding to the video frame 5V can be obtained. For example, the computer device can obtain the resizing coefficient (for example, 0.7) from its internal memory or externally, and adjust the video frame 5V multiple times based on the resizing coefficient until the picture size of the adjusted video frame 5V is consistent with the filtering network layer 5W 1 matches the image size threshold associated with it (for example, 12*12*3). At this time, the computer device can form a picture pyramid corresponding to the video frame 5V based on the video frames 5V with different picture sizes after multiple adjustments. The size adjustment coefficient here may be dynamically set by the computer device according to the distribution of the key parts of the character in the video frame. If the size adjustment coefficient is set too large, it is easy to extend the time for detecting and locating the key parts of the character. If the size adjustment coefficient is set too small, the key parts of the character with a small distribution area in the video frame may be missed (for example, small and medium-sized faces). Based on this, the size adjustment coefficient in the embodiment of the present application can be set between 0.7-0.8. The picture pyramid here may include the original picture (for example, the video frame 5V shown in Figure 5), the first adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V), the second adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V) The picture obtained by adjusting the picture size of the first adjusted picture), ..., and the Nth adjusted picture (that is, the picture obtained by adjusting the picture size of the N-1th adjusted picture). The image size of the Nth adjusted image here may be the image size threshold associated with the filtering network layer 5W 1 (for example, 12*12).
进一步地,该计算机设备可以将视频帧5V对应的图片金字塔输入到图5所示的筛选网络层5W1,从而可以得到大量的候选。其中,本申请实施例可以将通过筛选网络层5W1得到的边界框位置信息,对视频帧5V进行切割处理后的图片称为第一切割图片。其中,该计算机设备可以将图片金字塔中的图片输入到筛选网络层5W1,得到输出特征为(m,n,16)。其中,这里的m和n可以用于表征图片的长和宽,16是通道的维度。根据筛选网络层5W1得到的分类得分,该计算机设备可以筛选掉一大部分候选,从而得到一个或多个第一候选。计算机设备再根据得到的4个偏移量,对边界框(bounding box,简称bbox)进行校准,得到校准后的边界框的位置信息(例如,左上右下的坐标信息)。进而计算机设备可以根据交并比(Intersection over Union,简称iou)对这些第一候选再次进行筛选,即通过进行非极大值抑制(Non-Maximum Suppression,简称NMS算法)从第一候选中筛选掉一大部分候选,得到第二候选。换言之,该计算机设备可以将分类得分进行排序处理(例如,降序处理),得到(num_left,4)的张量,即num_left个bbox的左上、右下绝对坐标。进一步地,该计算机设备可以每次以排序处理后的最大分数值的边界框坐标和剩余坐标,确定iou,进而可以过滤掉iou大于交并比阈值(例如,0.6,该交并比阈值是计算机设备事先设置的)的边界框,并把这个最大分数值移到最终结果。其中,本申请实施例可以将上述操作称为过滤操作。进一步地,该计算机设备重复这个过滤操作,可以过滤掉很多有大量重叠部分的边界框,最终得到(num_left_after_nms,16)个候选。这些候选需要根据边界框的位置信息对视频帧5V进行切割处理,从而可以得到图片尺寸为24*24,且用于输入到图5所示的精调网络层5W2的图片(即第一切割图片)。其中,这里的第一切割图片可以是计算机设备在视频帧5V中截取边界框最大边长的正方形,从而有效确保尺寸调整时不产生形变以及保留更多的角色关键部位的细节。Further, the computer device can input the picture pyramid corresponding to the video frame 5V to the screening network layer 5W 1 shown in Figure 5, so that a large number of candidates can be obtained. Among them, in the embodiment of the present application, the picture obtained by cutting the video frame 5V by filtering the bounding box position information obtained by the network layer 5W 1 is called the first cut picture. Wherein, the computer device can input the pictures in the picture pyramid to the filtering network layer 5W 1 to obtain the output features (m, n, 16). Among them, m and n here can be used to characterize the length and width of the image, and 16 is the dimension of the channel. According to the classification score obtained by screening the network layer 5W1 , the computer device can screen out a large portion of candidates, thereby obtaining one or more first candidates. The computer device then calibrates the bounding box (bbox for short) based on the obtained four offsets, and obtains the position information of the calibrated bounding box (for example, the coordinate information of the upper left and lower right). Then the computer device can screen these first candidates again according to the Intersection over Union (iou), that is, by performing Non-Maximum Suppression (Non-Maximum Suppression, NMS algorithm) to screen out the first candidates. A large number of candidates get the second candidate. In other words, the computer device can sort the classification scores (for example, in descending order) to obtain a tensor of (num_left, 4), that is, the upper left and lower right absolute coordinates of num_left bboxes. Further, the computer device can determine the iou with the bounding box coordinates and remaining coordinates of the maximum score value after sorting each time, and can further filter out the iou that is greater than the intersection-to-union ratio threshold (for example, 0.6, the intersection-to-union ratio threshold is the computer (preset by the device) and move this maximum score value to the final result. In this embodiment of the present application, the above operation may be called a filtering operation. Further, the computer device repeats this filtering operation to filter out many bounding boxes with a large number of overlapping parts, and finally obtains (num_left_after_nms, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the bounding box, so that the picture size is 24*24, and the picture used to be input to the fine-tuning network layer 5W 2 shown in Figure 5 (i.e. the first cut picture). The first cut picture here may be a square with the maximum side length of the bounding box captured by the computer device in the video frame 5V, thereby effectively ensuring that no deformation occurs during size adjustment and that more details of key parts of the character are retained.
然后,该计算机设备可以通过精调网络层5W2,对第一切割图片进行精调处理,从而得到图5所示 的第二切割图片。其中,该精调网络层5W2可以输出二分类one-hot对应的2个输出、边界框的坐标偏移量对应的4个输出以及转折点(landmark)对应的10个输出。进而该精调网络层5W2可以根据二分类得分过滤掉大部分不包括角色关键部位(例如,角色面部)的候选。根据偏移量对边界框进行调整后,再次重复上述筛选网络层5W1中的过滤操作,以得到(num_left_after_Rnet,16)个候选。这些候选需要根据调整后的边界框的位置信息对视频帧5V进行切割处理,从而可以得到图片尺寸为48*48,且用于输入到图5所示的输出网络层5W3的图片(即第二切割图片)。当然,该计算机设备得到第二切割图片的具体处理方式可以参见得到第一切割图片的具体处理方式,以避免形变并保留更多细节。Then, the computer device can fine-tune the first cut picture through the fine-tuning network layer 5W 2 to obtain the result shown in Figure 5 of the second cut picture. Among them, the fine-tuned network layer 5W 2 can output 2 outputs corresponding to the two-class one-hot, 4 outputs corresponding to the coordinate offset of the bounding box, and 10 outputs corresponding to the turning point (landmark). Furthermore, the fine-tuned network layer 5W 2 can filter out most candidates that do not include key parts of the character (for example, the character's face) according to the binary classification score. After adjusting the bounding box according to the offset, repeat the filtering operation in the above filtering network layer 5W 1 again to obtain (num_left_after_Rnet, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the adjusted bounding box, so that the picture size is 48*48, and the picture used to be input to the output network layer 5W 3 shown in Figure 5 (i.e., the first 2 cutting pictures). Of course, the specific processing method of obtaining the second cut picture by the computer device can be referred to the specific processing method of obtaining the first cut picture to avoid deformation and retain more details.
进一步地,该计算机设备可以通过输出网络层5W3,准确输出角色关键部位在视频帧5V中的位置信息,包括边界框的坐标信息以及转折点的坐标信息。其中,该计算机设备在输出网络层5W3中,经过分类筛选、边界框调整后的NMS筛选,不仅输出边界框的坐标信息,还输出了转折点的坐标信息,从而得到了角色关键部位在视频帧5V中的位置信息,以便后续在视频帧5V中切割该角色关键部位,从而得到包括该角色关键部位的图片(例如,图4所示的角色切割图片400T)。Further, the computer device can accurately output the position information of the character's key parts in the video frame 5V through the output network layer 5W 3 , including the coordinate information of the bounding box and the coordinate information of the turning point. Among them, the computer device, in the output network layer 5W 3 , after classification screening and bounding box adjustment NMS screening, not only outputs the coordinate information of the bounding box, but also outputs the coordinate information of the turning point, thereby obtaining the key parts of the character in the video frame The position information in 5V is used to subsequently cut the key parts of the character in the video frame 5V, thereby obtaining a picture including the key parts of the character (for example, the character cut picture 400T shown in Figure 4).
进一步地,该计算机设备可以将角色切割图片400T输入至图4所示的图片编码模型420w,通过该图片编码模型420w,对角色切割图片400T进行编码处理,从而可以得到该角色切割图片400T对应的图片信息向量。其中,本申请实施例中的图片编码模型420w是一种基于残差网络(Residual Network,简称Resnet)的模型,该系列网络可以广泛用于目标分类等领域以及作为计算机视觉任务主干经典神经网络的一部分,典型的网络有Resnet50,Resnet101等。例如,本申请实施例中的该图片编码模型420w可以为Resnet50网络模型。如图4所示,该Resnet50网络模型可以包括5个阶段,具体可以包括第一阶段(例如,Stage 0)、第二阶段(例如,Stage 1)、第三阶段(例如,Stage 2)、第四阶段(例如,Stage 3)以及第五阶段(例如,Stage 4)。其中Stage 0的结构比较简单,可以视其为对角色切割图片400T的预处理,后4个阶段均是由瓶颈层(Bottleneck)组成,结构较为相似。其中,Stage 1可以包含3个Bottleneck,Stage2可以包含4个Bottleneck,Stage 3可以包含6个Bottleneck,以及Stage 4可以包含3个Bottleneck。该计算机设备将角色切割图片400T输入至图片编码模型420w,通过该图片编码模型420w中的5个阶段,可以将角色切割图片400T转化为一个具有2048维度的图片信息向量,该图片信息向量可以用于表征角色关键部位(例如,人脸)的语义特征信息。Further, the computer device can input the character cutting picture 400T to the picture coding model 420w shown in FIG. 4. Through the picture coding model 420w, the character cutting picture 400T is coded, so that the character cutting picture 400T corresponding to the character cutting picture 400T can be obtained. Picture information vector. Among them, the picture coding model 420w in the embodiment of the present application is a model based on Residual Network (Resnet). This series of networks can be widely used in fields such as target classification and as the backbone of classic neural networks for computer vision tasks. Some, typical networks include Resnet50, Resnet101, etc. For example, the picture coding model 420w in the embodiment of this application may be a Resnet50 network model. As shown in Figure 4, the Resnet50 network model can include 5 stages, which can specifically include the first stage (for example, Stage 0), the second stage (for example, Stage 1), the third stage (for example, Stage 2), and the third stage. Four stages (e.g., Stage 3) and fifth stage (e.g., Stage 4). The structure of Stage 0 is relatively simple. It can be regarded as the preprocessing of the character cutting image 400T. The last four stages are all composed of bottleneck layers (Bottleneck), and the structures are relatively similar. Among them, Stage 1 can contain 3 Bottlenecks, Stage 2 can contain 4 Bottlenecks, Stage 3 can contain 6 Bottlenecks, and Stage 4 can contain 3 Bottlenecks. The computer device inputs the character cutting picture 400T into the picture encoding model 420w. Through the five stages in the picture encoding model 420w, the character cutting picture 400T can be converted into a picture information vector with 2048 dimensions. The picture information vector can be used Semantic feature information used to represent key parts of the character (for example, face).
进一步地,该计算机设备可以获取图4所示的与候选业务对象相关联的信息向量数据库400K。其中,这里的信息向量数据库400K可以用于存储Y个候选业务对象分别对应的对象关键信息向量,Y为大于或者等于M的正整数。其中,这信息向量数据库400K中的每个对象关键信息向量可以为该计算机设备采用与角色切割图片400T相同的编码处理方式所提取到的,一个对象关键信息向量可以用于表征一个候选业务对象对应的关键部位标识(例如,人脸ID)。此时,该计算机设备可以分别确定角色切割图片400T对应的图片信息向量与这Y个对象关键信息向量中的每个对象关键信息向量之间的向量距离,从而可以得到Y个向量距离。进一步地,为了有效确保该计算机设备能够从信息向量数据库400K中准确匹配到对应的候选业务对象,该计算机设备可以预先设置一个距离阈值。若计算机设备确定出的最小向量距离大于该距离阈值,则可以认为该计算机设备在信息向量数据库400K中未匹配到角色切割图片400T所对应的对象关键信息向量,即未匹配到该角色切割图片400T对应的业务对象。若计算机设备确定出的最小 向量距离小于或者等于该距离阈值,则可以认为该计算机设备能够在信息向量数据库400K中匹配到角色切割图片400T所对应的对象关键信息向量,即可以成功匹配到该角色切割图片400T对应的业务对象。Further, the computer device may obtain the information vector database 400K associated with the candidate business object shown in FIG. 4 . Among them, the information vector database 400K here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M. Among them, each object key information vector in the information vector database 400K can be extracted by the computer device using the same encoding processing method as the character cutting picture 400T. An object key information vector can be used to represent a candidate business object corresponding to Key part identification (for example, Face ID). At this time, the computer device can respectively determine the vector distance between the picture information vector corresponding to the character cutting picture 400T and each of the Y object key information vectors, thereby obtaining Y vector distances. Further, in order to effectively ensure that the computer device can accurately match the corresponding candidate business object from the information vector database 400K, the computer device can set a distance threshold in advance. If the minimum vector distance determined by the computer device is greater than the distance threshold, it can be considered that the computer device has not matched the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it has not matched the character cutting picture 400T. The corresponding business object. If the computer equipment determines the minimum If the vector distance is less than or equal to the distance threshold, it can be considered that the computer device can match the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it can successfully match the business object corresponding to the character cutting picture 400T. .
因此,该计算机设备在从Y个向量距离中获取小于或者等于距离阈值的最小向量距离时,可以确定最小向量距离对应的对象关键信息向量所对应的候选业务对象,进而可以将确定的候选业务对象作为角色切割图片400T对应的业务对象。计算机设备对多媒体数据中的每个视频帧进行图像识别时,均可以参见图5所示的对该视频帧5V进行关键部位识别的具体实施方式,以得到X个包含角色关键部位的角色关键图片,这里将不再继续进行赘述。其中,若一个视频帧中包括多个不同角色的角色关键部位,则该计算机设备可以从该视频帧中切割出对应数量的角色关键部位。进一步地,该计算机设备可以参见图4所对应实施例中对角色切割图片400T对应进行对象匹配的具体实施方式,对X个角色切割图片中的每个角色切割图片均进行对象匹配,进而可以基于获取到的角色切割图片分别对应的业务对象,确定多媒体数据中的视频帧对应的图片特征信息。Therefore, when the computer device obtains the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, it can determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and then the determined candidate business object can be As the business object corresponding to the character cutting picture 400T. When the computer device performs image recognition on each video frame in the multimedia data, it can refer to the specific implementation of key part identification of the video frame 5V shown in Figure 5 to obtain X key pictures of the character containing the key parts of the character. , which will not be described further here. Wherein, if a video frame includes key parts of multiple different characters, the computer device can cut out a corresponding number of key parts of the characters from the video frame. Further, the computer device can refer to the specific implementation of object matching for the character cutting pictures 400T in the corresponding embodiment of FIG. 4, perform object matching on each of the X character cutting pictures, and then can perform object matching based on The business objects corresponding to the obtained character cut pictures are determined to determine the picture feature information corresponding to the video frames in the multimedia data.
步骤S102,从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇。Step S102: Locate and separate audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N object audio frames with each other. The audio semantic feature vectors corresponding to the frames are clustered to obtain M audio clusters.
其中,N个对象音频帧是计算机设备对多媒体数据中的原始音频帧进行对象定位和分离处理后所得到的,这里的N为正整数。一个音频聚类簇可以对应一个业务对象。具体地,该计算机设备可以从多媒体数据中获取原始音频帧,进而可以对原始音频帧进行对象定位和分离处理,以得到N个对象音频帧。进一步地,该计算机设备可以对N个对象音频帧中的每个对象音频帧进行语义特征提取,得到每个对象音频帧对应的音频语义特征向量。此时,该计算机设备可以将M确定为待聚类的簇心数量,基于簇心数量,对获取到的每个对象音频帧对应的音频语义特征向量进行聚类处理,从而可以得到M个音频聚类簇。音频语义特性可以理解为说话人声纹的特征。Among them, N object audio frames are obtained after the computer device performs object positioning and separation processing on the original audio frames in the multimedia data, where N is a positive integer. An audio cluster can correspond to a business object. Specifically, the computer device can obtain original audio frames from multimedia data, and can then perform object positioning and separation processing on the original audio frames to obtain N object audio frames. Further, the computer device can perform semantic feature extraction on each of the N object audio frames to obtain an audio semantic feature vector corresponding to each object audio frame. At this time, the computer device can determine M as the number of cluster centers to be clustered, and based on the number of cluster centers, perform clustering processing on the audio semantic feature vector corresponding to each acquired audio frame of the object, so that M audio files can be obtained Cluster clusters. Audio semantic characteristics can be understood as the characteristics of the speaker’s voiceprint.
本申请实施例在聚类过程中,创新性地使用图片特征信息所指示的业务对象的数量M作为簇心数量的选择。这种使用图片特征信息作为先验知识的方式,能够使系统得知在该多媒体数据中的业务对象的数量,从而能够给到音频聚类一个簇心数据的先验设定,能够自动地设置簇心数量,以至于提高了整个系统收敛的速度以及整体识别性能,节省了计算机资源。In the clustering process, the embodiment of the present application innovatively uses the number M of business objects indicated by the picture feature information as the selection of the number of cluster centers. This method of using picture feature information as prior knowledge enables the system to know the number of business objects in the multimedia data, thereby giving audio clustering a priori setting of cluster center data, which can be automatically set The number of cluster centers improves the convergence speed of the entire system and the overall recognition performance, and saves computer resources.
为便于理解,进一步地,请参见图6,图6是本申请实施例提供的一种音频语义特征聚类的架构示意图。如图6所示,本申请实施例中的架构示意图可以为上述图2所对应实施例中的第二模块202对应的架构示意图。图6所示的原始音频帧可以为多媒体数据(例如,上述图2所对应实施例中的多媒体数据20S)中的原始音频帧。图6所示的信源分离模型630w可以用于对原始音频帧进行信源分离。该信源分离模型630w可以为上述图2所对应实施例中的信源分离模型230w。图6所示的音频语义特征提取模型640w可以用于对每个对象音频帧进行语义特征提取。该音频语义特征提取模型640w可以为上述图2所对应实施例中的音频语义特征提取模型240w。For ease of understanding, please further refer to FIG. 6 , which is a schematic architectural diagram of audio semantic feature clustering provided by an embodiment of the present application. As shown in FIG. 6 , the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the second module 202 in the embodiment corresponding to FIG. 2 . The original audio frame shown in FIG. 6 may be an original audio frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to FIG. 2 mentioned above). The source separation model 630w shown in Figure 6 can be used to perform source separation on the original audio frame. The information source separation model 630w may be the information source separation model 230w in the embodiment corresponding to FIG. 2 described above. The audio semantic feature extraction model 640w shown in Figure 6 can be used to extract semantic features for each object audio frame. The audio semantic feature extraction model 640w may be the audio semantic feature extraction model 240w in the embodiment corresponding to FIG. 2 described above.
如图6所示,本申请实施例中的架构示意图可以包括三个节点,分别是音频段落切割节点、音频语义特征提取节点以及聚类处理节点。其中,在计算机设备处于音频段落切割节点时,计算机设备可以从多媒体数据中获取原始音频帧,以对原始音频帧进行信源分离,从而得到针对业务对象的包含有人声的 待处理音频帧。进一步地,该计算机设备可以基于用于剔除静音帧的音频边界检测策略,对待处理音频帧中的音频冲击信号帧中的非静音段进行定位和切割,从而可以得到N个对象音频帧。其中,信源分离是指通过信号处理或者其他算法将掺杂着多种音频信号的混合音频信号进行分离,从混合信号中提取出指定种类的音频信号序列,最终生成单独的音频文件。例如,从原始音频帧中提取针对业务对象的待处理音频帧(即对象音段)。As shown in Figure 6, the architectural schematic diagram in the embodiment of the present application may include three nodes, namely an audio paragraph cutting node, an audio semantic feature extraction node, and a clustering processing node. Wherein, when the computer device is at the audio segment cutting node, the computer device can obtain the original audio frame from the multimedia data to perform source separation on the original audio frame, thereby obtaining the business object-containing audio frame. Audio frame to be processed. Further, the computer device can locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, so that N object audio frames can be obtained. Among them, source separation refers to separating mixed audio signals mixed with multiple audio signals through signal processing or other algorithms, extracting specified types of audio signal sequences from the mixed signals, and finally generating separate audio files. For example, the audio frame to be processed for the business object (that is, the object segment) is extracted from the original audio frame.
图6所示的计算机设备将原始音频帧输入到信源分离模型630w后,可以通过该信源分离模型630w,对原始音频帧进行信源分离,以得到图6所示的对象音段(或对象音轨)以及环境音段(或环境音轨)。由于对象音段中可能存在大量的静音段,且这些静音段会对后续聚类处理的音频聚类结果造成干扰,同时也会造成计算资源浪费,此时,该计算机设备可以将对象音段确定为针对业务对象的待处理音频帧。进而该计算机设备可以获取音频边界检测策略。例如,这里的音频边界检测策略可以为VAD(Voice Activity Detection)算法,这里的VAD算法可以广泛应用于语音编码,降噪和ASR场景中。这里所说的是语音/非语音(非语音/静音)检测。一个VAD系统通常可以包括两个部分,特征提取和语音/非语音判决。进一步地,该计算机设备可以基于音频边界检测策略,对待处理音频帧中的音频冲击信号帧进行定位和切割,即精准定位非静音段,从而可以得到图6所示的N个对象音频帧,N为正整数。After the computer device shown in Figure 6 inputs the original audio frame into the source separation model 630w, the source separation model 630w can be used to perform source separation on the original audio frame to obtain the object segment (or object track) and ambience segments (or ambience tracks). Since there may be a large number of silent segments in the target sound segment, and these silent segments will cause interference to the audio clustering results of subsequent clustering processing, and will also cause a waste of computing resources, at this time, the computer device can determine the target sound segment Is the audio frame to be processed for the business object. The computer device can then obtain the audio boundary detection strategy. For example, the audio boundary detection strategy here can be the VAD (Voice Activity Detection) algorithm. The VAD algorithm here can be widely used in speech coding, noise reduction and ASR scenarios. What we are talking about here is speech/non-speech (non-speech/silence) detection. A VAD system can usually include two parts, feature extraction and speech/non-speech decision. Further, based on the audio boundary detection strategy, the computer device can locate and cut the audio impact signal frame in the audio frame to be processed, that is, accurately locate the non-silent segment, so that N object audio frames shown in Figure 6 can be obtained, N is a positive integer.
在计算机设备处于音频语义特征提取节点时,该计算机设备可以将这N个对象音频帧输入至图6所示的音频语义特征提取模型640w。例如,该音频语义特征提取模型640w可以为一种基于大型音频数据集与训练的音频神经网络(例如,PANNS网络),其通常用于音频模式识别或者音频帧级别的embedding化,作为众多模型前端编码网络。进一步地,该计算机设备可以通过该音频语义特征提取模型640w,对N个对象音频帧中的每个对象音频帧进行语义特征提取,得到每个对象音频帧分别对应的音频语义特征向量。如图6所示,具体可以包括音频语义特征向量1、音频语义特征向量1、…、以及音频语义特征向量N。When the computer device is at the audio semantic feature extraction node, the computer device may input the N object audio frames to the audio semantic feature extraction model 640w shown in FIG. 6 . For example, the audio semantic feature extraction model 640w can be an audio neural network (for example, PANNS network) based on a large audio data set and training, which is usually used for audio pattern recognition or audio frame level embedding, and serves as the front end of many models. Coding network. Further, the computer device can extract semantic features for each of the N object audio frames through the audio semantic feature extraction model 640w, and obtain the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 6, it may specifically include audio semantic feature vector 1, audio semantic feature vectors 1,..., and audio semantic feature vector N.
进一步地,如图6所示,计算机设备在处于聚类处理节点时,可以将图片特征信息所指示的视频帧中的角色图片所属的业务对象的数量M作为先验信息,即将M确定为待聚类的簇心数量。进而计算机设备可以基于该簇心数量,对获取到的每个对象音频帧对应的音频语义特征向量进行聚类处理,从而可以得到M个音频聚类簇。其中,本申请实施例中用于进行聚类处理的聚类策略可以为k均值聚类算法(k-means clustering algorithm,简称k-means聚类算法)。该k均值聚类算法是一种迭代求解的聚类分析算法。比如,该计算机设备可以预先将N个音频语义特征向量分为M个初始聚类簇。进而计算机设备可以随机选择M个音频语义特征向量分别作为M个初始聚类簇的初始的簇心。然后,针对音频语义特征向量集中除被选作簇心的M个音频语义特征向量之外的每个音频语义特征向量(即待归属向量)而言,该计算机设备可以确定每个待归属向量与各个初始聚类簇的簇心之间的向量距离,并将该待归属向量划分至具有最小向量距离的初始聚类簇中。此时,该计算机设备可以更新已划分的初始聚类簇的簇心。以此类推,该计算机设备可以确定出图6所示的M个音频聚类簇。这M个音频聚类簇具体可以包括音频聚类簇C1、音频聚类簇C2、…、以及音频聚类簇CMFurther, as shown in Figure 6, when the computer device is in the clustering processing node, the number M of business objects to which the character pictures in the video frames indicated by the picture feature information belong can be used as a priori information, that is, M is determined as the number to be The number of cluster centers. Then, the computer device can perform clustering processing on the obtained audio semantic feature vector corresponding to each object audio frame based on the number of cluster centers, so that M audio clustering clusters can be obtained. Among them, the clustering strategy used for clustering processing in the embodiment of the present application may be a k-means clustering algorithm (k-means clustering algorithm, referred to as k-means clustering algorithm). The k-means clustering algorithm is an iterative clustering analysis algorithm. For example, the computer device may divide N audio semantic feature vectors into M initial clusters in advance. Furthermore, the computer device can randomly select M audio semantic feature vectors as initial cluster centers of the M initial clusters. Then, for each audio semantic feature vector (i.e., a vector to be attributed) in the audio semantic feature vector set except the M audio semantic feature vectors selected as cluster centers, the computer device may determine that each vector to be attributed is consistent with The vector distance between the cluster centers of each initial clustering cluster, and the vector to be attributed is divided into the initial clustering cluster with the minimum vector distance. At this time, the computer device can update the cluster centers of the divided initial clusters. By analogy, the computer device can determine M audio clusters shown in FIG. 6 . The M audio clusters may specifically include audio clusters C 1 , audio clusters C 2 , ..., and audio clusters C M .
本申请实施例使用音频语义特征聚类的方法对N个音频语义特征向量进行分类,而不是通过神经网络来训练声纹分类,从而摆脱对演员声纹ID的依赖,能够避免有侵犯隐私的现象。同时本申请实施例可 以直接使用的是多媒体数据中的对象音频帧,提取到每个对象音频帧对应的音频语义特征向量,这与业务对象的个人声纹ID进行了深层次的解耦,从而与角色本身的声纹信息进行相关,以至于能够识别由专业配音演员配音的业务角色。也就是本申请实施例能够在该业务角色并非业务对象自己配音的情况下,仍然能够准确地识别出台词角色信息,从而提高了音频角色识别的准确度。此外,本申请实施例使用音频语义特征聚类的方法对N个音频语义特征向量的聚类,以进行音频角色识别的方式,造就了整个系统的可移植性,使得整个音频角色识别系统更具通用性,可适用不同多媒体数据中业务对象不同的场景,从而有效提高了识别的适用性。The embodiment of this application uses the audio semantic feature clustering method to classify N audio semantic feature vectors instead of training voiceprint classification through neural networks, thereby getting rid of the dependence on the actor's voiceprint ID and avoiding privacy violations. . At the same time, the embodiments of this application can The object audio frames in the multimedia data are directly used to extract the audio semantic feature vector corresponding to each object audio frame. This is deeply decoupled from the personal voiceprint ID of the business object and thus related to the voice of the character itself. The pattern information is correlated so that business characters voiced by professional voice actors can be identified. That is to say, the embodiment of the present application can still accurately identify the line character information even when the business character is not dubbed by the business object himself, thus improving the accuracy of audio character recognition. In addition, the embodiment of the present application uses the audio semantic feature clustering method to cluster N audio semantic feature vectors to perform audio character recognition, which makes the entire system portable and makes the entire audio character recognition system more efficient. It is versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of identification.
其中,为便于理解,进一步地,请一并参见图7,图7是本申请实施例提供的一种信源分离模型的模型架构图。如图7所示,本申请实施例中的信源分离模型可以为上述图6所对应实施例中的信源分离模型630w。其中,该信源分离模型可以包括分割网络层7W1(即第一分割网络层,例如,VACAL-Unet)和分割网络层7W2(即第二分割网络层,例如,BGM-Unet)。For ease of understanding, please further refer to FIG. 7 , which is a model architecture diagram of a source separation model provided by an embodiment of the present application. As shown in Figure 7, the information source separation model in the embodiment of the present application may be the information source separation model 630w in the embodiment corresponding to Figure 6. The source separation model may include a split network layer 7W 1 (ie, a first split network layer, for example, VACAL-Unet) and a split network layer 7W 2 (ie, a second split network layer, for example, BGM-Unet).
其中,Unet是使用全卷积网络进行语义分割的算法之一,使用包含压缩路径和扩展路径的对称U形结构。Unet网络的典型特点是,它是U型对称结构,可以包含4个卷积层和对应的4个上采样层。所以在实现的时候,既可以从头实现网络并进行权重的初始化,然后进行模型的训练,也可以借用一些网络的卷积层结构和对应的已训练好的权重文件,再加上后面的上采样层,进行训练计算等。由于在深度学习的模型训练中能够使用已训练好的权重模型文件,从而大大加快Unet训练的速度。另一个特点是,Unet网络的每个卷积层得到的特征图都会连接到对应的上采样层,从而实现对每层特征图都有效使用到后续计算中,即跳跃连接(skip-connection),以有效解决梯度消散问题,有利于提高模型训练的效率。这样,同其他的一些网络结构(例如,全卷积网络FCN)比较,Unet避免了直接在高级特征图中进行监督和损失计算,而是结合了低级特征图中的特征,从而可以使得最终所得到的特征图中既包含了第一层级特征(即high-level的feature),也包含很多的第二层级特征(即low-level的feature),实现了不同等级下的特征融合,从而提高了模型的结果精确度。Among them, Unet is one of the algorithms that uses a fully convolutional network for semantic segmentation, using a symmetric U-shaped structure containing a compression path and an expansion path. The typical feature of the Unet network is that it has a U-shaped symmetrical structure and can contain 4 convolutional layers and corresponding 4 upsampling layers. Therefore, when implementing, you can either implement the network from scratch and initialize the weights, and then train the model, or you can borrow the convolutional layer structure of some networks and the corresponding trained weight files, plus subsequent upsampling. layer, perform training calculations, etc. Since the trained weight model files can be used in deep learning model training, the speed of Unet training is greatly accelerated. Another feature is that the feature map obtained by each convolutional layer of the Unet network will be connected to the corresponding upsampling layer, so that the feature map of each layer can be effectively used in subsequent calculations, that is, skip connection (skip-connection). It can effectively solve the problem of gradient dissipation and improve the efficiency of model training. In this way, compared with some other network structures (for example, fully convolutional network FCN), Unet avoids direct supervision and loss calculation in high-level feature maps, but combines the features in low-level feature maps, so that the final result can be achieved. The obtained feature map contains both first-level features (i.e., high-level features) and many second-level features (i.e., low-level features), achieving feature fusion at different levels, thereby improving The accuracy of the model’s results.
该计算机设备在将原始音频帧输入至信源分离模型时,可以通过图7所示的信源分离模型,生成原始音频帧对应的频谱幅度谱。比如,该计算机设备可以对原始音频帧的音轨进行频谱转换,得到该原始音频帧对应的音轨频谱,进而可以通过消除音轨频谱的相位,生成原始音频帧对应的频谱幅度谱。进一步地,该计算机设备可以将频谱幅度谱分别输入分割网络层7W1以及分割网络层7W2,以通过分割网络层7W1生成频谱幅度谱对应的第一类型特征(例如,对象音轨特征),通过分割网络层7W2生成频谱幅度谱对应的第二类型特征(例如,环境音轨特征)。When the computer device inputs the original audio frame into the source separation model, it can generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model shown in Figure 7. For example, the computer device can perform spectrum conversion on the audio track of the original audio frame to obtain the audio track spectrum corresponding to the original audio frame, and then can generate the spectrum amplitude spectrum corresponding to the original audio frame by eliminating the phase of the audio track spectrum. Further, the computer device can input the spectrum amplitude spectrum into the segmentation network layer 7W 1 and the segmentation network layer 7W 2 respectively, so as to generate the first type of features (for example, object track features) corresponding to the spectrum amplitude spectrum through the segmentation network layer 7W 1 , the second type of features (for example, ambient track features) corresponding to the spectral amplitude spectrum are generated by segmenting the network layer 7W 2 .
进一步地,该计算机设备可以对第一类型特征和第二类型特征进行合并和掩码处理,得到第一类型特征对应的目标掩码图(即第一掩码图)。进而该计算机设备可以基于目标掩码图与频谱幅度谱,生成目标类型音频帧(即对象音段中的音频帧),将目标类型音频帧作为信源分离模型所输出的针对业务对象的(包含有人声的)待处理音频帧。比如,该计算机设备在生成图7所示的第一类型特征和第二类型特征时,可以对第一类型特征和第二类型特征进行拼接处理,得到拼接类型特征。进而该计算机设备对拼接类型特征分别进行两种类型的掩码计算,从而可以得到第一类型特征对应的第一掩码图,以及第二类型特征对应的第二掩码图。所述掩码计算例如是通过点位的特征值与拼接处理后的合并值相比。进一步地,该 计算机设备可以对第一掩码图与原始音频帧对应的频谱幅度谱进行对应位置计算(例如相乘),然后经过频谱反变换,生成第一类型音频帧(即对象音段中的音频帧)。与此同时,该计算机设备还可以对第二掩码图与原始音频帧对应的频谱幅度谱进行对应位置计算,然后经过频谱反变换,生成第二类型音频帧(即环境音段中的音频帧)。由于上面经过掩码和幅度谱计算后就能够得到相对应的第一类型特征和第一类型特征的幅度谱,因此经过频谱反变换后就能够得到第一类型特征和第二类型特征的一维度采样点,也就是音频信号。Further, the computer device can merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features (ie, the first mask map). Furthermore, the computer device can generate a target type audio frame (i.e., an audio frame in the object segment) based on the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the output of the source separation model for the business object (including (with voice) audio frame to be processed. For example, when the computer device generates the first type features and the second type features shown in Figure 7, it can perform splicing processing on the first type features and the second type features to obtain spliced type features. Furthermore, the computer device performs two types of mask calculations on the splicing type features, so that a first mask image corresponding to the first type feature and a second mask image corresponding to the second type feature can be obtained. The mask calculation is, for example, by comparing the feature values of the points with the merged values after the splicing process. Further, the The computer device can perform corresponding position calculation (for example, multiplication) on the spectrum amplitude spectrum corresponding to the first mask image and the original audio frame, and then generate the first type audio frame (i.e., the audio frame in the object segment) through inverse spectrum transformation. . At the same time, the computer device can also calculate the corresponding position of the spectrum amplitude spectrum corresponding to the second mask image and the original audio frame, and then generate the second type of audio frame (ie, the audio frame in the environmental sound segment through inverse spectrum transformation ). Since the corresponding first type features and the amplitude spectrum of the first type features can be obtained after the mask and amplitude spectrum calculations above, the one-dimensional features of the first type features and the second type features can be obtained after the inverse spectrum transformation. The sampling point is the audio signal.
由此可见,该计算机设备可以通过图7所示的信源分离模型,从多媒体数据的原始音频帧中分离环境音(例如,BGM音),以剔除环境音对后续聚类的影响,从而提高聚类的准确度。It can be seen that the computer device can separate environmental sounds (for example, BGM sounds) from the original audio frames of multimedia data through the source separation model shown in Figure 7 to eliminate the impact of environmental sounds on subsequent clustering, thereby improving Clustering accuracy.
为便于理解,进一步地,请参见一并图8,图8是本申请实施例提供的一种音频语义特征提取模型的模型架构示意图。如图8所示,本申请实施例中的音频语义特征提取模型可以为上述图6所对应实施例中的音频语义特征提取模型640w。例如,图8所示的该音频语义特征提取模型可以为Wavegram_Logmel128_Cnn14模型,该音频语义特征提取模型的最大特点是模型的输入使用的是音频的原音频采样点序列,也就是整个网络的输入是音频信号的N个对象音频帧。这样能够不用提前提取音频基础特征。由于音频基础特征的提取相当耗时,并且使用音频基础特征作为输入的话会占用特别大的硬件资源,通过使用该音频语义特征提取模型对输入音频信号的N个对象音频帧进行处理,可以节省计算机资源并且提高计算效率。For ease of understanding, please further refer to FIG. 8 , which is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application. As shown in Figure 8, the audio semantic feature extraction model in the embodiment of the present application may be the audio semantic feature extraction model 640w in the embodiment corresponding to Figure 6. For example, the audio semantic feature extraction model shown in Figure 8 can be the Wavegram_Logmel128_Cnn14 model. The biggest feature of this audio semantic feature extraction model is that the input of the model uses the original audio sampling point sequence of the audio, that is, the input of the entire network is audio N object audio frames of the signal. This eliminates the need to extract basic audio features in advance. Since the extraction of basic audio features is very time-consuming, and using basic audio features as input will occupy a particularly large amount of hardware resources, by using this audio semantic feature extraction model to process N object audio frames of the input audio signal, computers can be saved. resources and improve computing efficiency.
如图8所示,该音频语义特征提取模型可以包括时域分支网络层、频域分支网络层以及卷积网络层。As shown in Figure 8, the audio semantic feature extraction model may include a time domain branch network layer, a frequency domain branch network layer and a convolution network layer.
该计算机设备可以将N个对象音频帧输入至图8所示的音频语义特征提取模型,进而可以通过时域分支网络层,对N个对象音频帧进行特征学习,得到学习的时域特征图(时域学习特征)。如图8所示,这里的时域分支网络层可以包括卷积层801w(例如,卷积尺寸为1,步长为5的一维卷积层)、卷积层802w(例如,包括基础块的一维卷积层)、最大池化层803w(例如,步长为4的最大池化层)、卷积层804w(例如,包括基础块的一维卷积层)、最大池化层805w(例如,步长为4的最大池化层)、卷积层806w(例如,包括基础块的一维卷积层)、最大池化层807w(例如,步长为4的最大池化层)以及重塑层808w。该计算机设备可以通过这些大量的一维卷积层,在时域信号中能够直接学习到音频信号的时域特性,尤其是像音频响度和采样点幅度的信息。经过大量的一维卷积层后,得到一个用于表示学习的时域特征图的二维图谱wavegram,以为了能够让该时域支路与频域支路的输出进行相结合。The computer device can input N object audio frames to the audio semantic feature extraction model shown in Figure 8, and can then perform feature learning on the N object audio frames through the time domain branch network layer to obtain a learned time domain feature map ( time domain learning features). As shown in Figure 8, the time domain branch network layer here may include a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolutional layer), a max-pooling layer 803w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 804w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 805w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 806w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 807w (e.g., a max-pooling layer with a stride of 4) and reshape layer 808w. The computer device can directly learn the time domain characteristics of the audio signal in the time domain signal through these large one-dimensional convolution layers, especially information such as audio loudness and sampling point amplitude. After a large number of one-dimensional convolutional layers, a two-dimensional wavegram is obtained to represent the learned time domain feature map, so that the output of the time domain branch and the frequency domain branch can be combined.
与此同时,该计算机设备还可以通过频域分支网络层,对N个对象音频帧进行特征学习,得到学习的频域特征图(频域学习特征)。其中,这里的学习的频域特征图与学习的时域特征图之间的特征维度相同。如图8所示,这里的频域分支网络层可以包括卷积层809w(例如,包括基础块的二维卷积层)。该计算机设备可以将N个对象音频帧输入至频域分支网络层,生成N个对象音频帧对应的频域频谱(例如,采用的是梅尔频率,生成log-mel频谱)。进一步地,该计算机设备将该频域频谱输入到图8所示的卷积层809w,以通过该卷积层809w中的多个二维卷积层,得到与学习的时域特征图具有相同特征维度的学习的频域特征图。At the same time, the computer device can also perform feature learning on N object audio frames through the frequency domain branch network layer to obtain the learned frequency domain feature map (frequency domain learning feature). Among them, the feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same. As shown in Figure 8, the frequency domain branch network layer here may include a convolution layer 809w (for example, a two-dimensional convolution layer including a basic block). The computer device can input N object audio frames to the frequency domain branch network layer and generate frequency domain spectra corresponding to the N object audio frames (for example, using Mel frequency to generate a log-mel spectrum). Further, the computer device inputs the frequency domain spectrum to the convolution layer 809w shown in Figure 8, so as to obtain the same characteristics as the learned time domain feature map through multiple two-dimensional convolution layers in the convolution layer 809w. Frequency domain feature maps for learning of feature dimensions.
进一步地,该计算机设备可以将学习的频域特征图与学习的时域特征图进行叠加(例如,拼接),从而可以得到叠加特征。然后该计算机设备将叠加特征输入至卷积网络层,对叠加特征进行最大平均处理, 输出每个对象音频帧对应的音频语义特征向量。如图8所示,这里的卷积网络层可以包括卷积层810w(例如,二维卷积层)以及激活层811w。该计算机设备可以将用于表示学习的频域特征图的特征图与用于表示学习的时域特征图的特征图进行拼接处理,共同组成一组用于标识叠加特征的二维频域特征图。进一步地,该计算机设备可以将用于表示叠加特征的二维频域特征图输入到图8所示的卷积层810w中,然后分别对由卷积层810w输出的特征通过使用二维池化(pooling)进行最大处理和平均处理,以提取当前特征的最大表征和平均表征。进而该计算机设备可以将最大处理后的特征确定为第一子特征,将平均处理后的特征确定为第二子特征。此时,该计算机设备可以将第一子特征以及第二子特征进行合并,再将合并后的特征输入至图8所示的激活层811w,最终生成具有2048维度的音频语义特征向量集。其中,该音频语义特征向量集可以包括N个对象音频帧中的每个对象音频帧分别对应的音频语义特征向量。Further, the computer device can superimpose (for example, splice) the learned frequency domain feature map and the learned time domain feature map, so that the superimposed feature can be obtained. The computer device then inputs the superimposed features into the convolutional network layer, and performs maximum averaging processing on the superimposed features, Output the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 8, the convolutional network layer here may include a convolutional layer 810w (for example, a two-dimensional convolutional layer) and an activation layer 811w. The computer device can splice the feature map used to represent the frequency domain feature map of learning and the feature map used to represent the time domain feature map of learning to form a set of two-dimensional frequency domain feature maps used to identify superimposed features. . Further, the computer device can input the two-dimensional frequency domain feature map used to represent the superimposed features into the convolution layer 810w shown in Figure 8, and then separately use two-dimensional pooling on the features output by the convolution layer 810w. (pooling) Perform maximum processing and average processing to extract the maximum representation and average representation of the current feature. Furthermore, the computer device can determine the maximum processed feature as the first sub-feature, and determine the average processed feature as the second sub-feature. At this time, the computer device can merge the first sub-feature and the second sub-feature, and then input the merged feature to the activation layer 811w shown in Figure 8 to finally generate an audio semantic feature vector set with 2048 dimensions. The audio semantic feature vector set may include an audio semantic feature vector corresponding to each of the N object audio frames.
由此可见,该计算机设备可以通过图8所示的音频语义特征提取模型,能够快速对N个对象音频帧中的每个对象音频帧进行音频语义特征提取,以更加快速准确地得到每个对象音频帧分别对应的音频语义特征向量。It can be seen that the computer device can quickly perform audio semantic feature extraction on each of the N object audio frames through the audio semantic feature extraction model shown in Figure 8, so as to obtain each object more quickly and accurately The audio semantic feature vectors corresponding to the audio frames respectively.
步骤S103,基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色。Step S103: Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
其中,P可以为小于或者等于M的正整数。对象角色映射表(例如,上述表1所示的对象角色映射表)可以包括与列表业务对象具有映射关系的业务角色,且该列表业务对象与M个业务对象之间存在P个重合的业务对象。具体地,该计算机设备可以从M个音频聚类簇中获取音频聚类簇Ck。进而该计算机设备可以提取音频聚类簇Ck在多媒体数据中的第一播放时间,这里的k为小于或者等于M的正整数。音频聚类簇Ck在多媒体数据中的第一播放时间是音频聚类簇Ck中包括的音频语义特征向量所对应的对象音频帧在多媒体数据中的一个或多个播放时间。进一步地,该计算机设备可以从与多媒体数据相关联的对象角色映射表的列表业务对象中,获取与M个业务对象之间存在重合的P个业务对象。进而该计算机设备可以基于图片特征信息,提取P个业务对象中的每个业务对象在多媒体数据中的第二播放时间。P个业务对象中的每个业务对象在多媒体数据中的第二播放时间是P个业务对象中的每个业务对象所在的视频帧在多媒体数据中的一个或多个播放时间。此时,该计算机设备可以分别确定音频聚类簇Ck的第一播放时间与每个第二播放时间之间的时间重叠度。进而该计算机设备可以将具有最高时间重叠度的第二播放时间所对应的业务对象作为音频聚类簇Ck对应的业务对象。进一步地,该计算机设备可以从对象角色映射表中,获取音频聚类簇Ck对应的业务对象所对应的业务角色,将获取到的业务角色作为音频聚类簇Ck对应的业务角色。Among them, P can be a positive integer less than or equal to M. The object role mapping table (for example, the object role mapping table shown in Table 1 above) may include business roles that have a mapping relationship with the list business object, and there are P overlapping business objects between the list business object and the M business objects. . Specifically, the computer device may obtain the audio cluster C k from the M audio clusters. Furthermore, the computer device can extract the first playing time of the audio cluster C k in the multimedia data, where k is a positive integer less than or equal to M. The first playback time of the audio cluster C k in the multimedia data is one or more playback times in the multimedia data of the object audio frame corresponding to the audio semantic feature vector included in the audio cluster C k . Further, the computer device can obtain P business objects that overlap with the M business objects from the list of business objects in the object role mapping table associated with the multimedia data. Furthermore, the computer device can extract the second playback time of each of the P business objects in the multimedia data based on the picture feature information. The second playback time of each of the P business objects in the multimedia data is one or more playback times in the multimedia data of the video frame in which each of the P business objects is located. At this time, the computer device can respectively determine the time overlap between the first playback time and each second playback time of the audio cluster C k . Furthermore, the computer device can use the business object corresponding to the second playback time with the highest degree of time overlap as the business object corresponding to the audio cluster C k . Further, the computer device can obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
本申请实施例从音频角度出发,对多媒体数据中的角色进行识别,将每一句音频台词进行角色归类,能够在一些其他角色镜头和场景中无角色关键部位信息的情况下,补充准确的台词角色信息,从而提高了角色识别的精确度。The embodiments of this application start from the perspective of audio, identify characters in multimedia data, and classify each audio line into roles. This can supplement accurate lines when there is no information about key parts of the character in some other character shots and scenes. Role information, thus improving the accuracy of role recognition.
为便于理解,进一步地,请参见图9,图9是本申请实施例提供的一种进行音频角色识别的场景示意图。如图9所示,计算机设备在执行步骤S101后,通过第一模块201识别的图像特征信息可以确定多媒体数据的视频帧中的角色图片所属的业务对象的数量M为3个,具体可以包括对象a、对象b以及对象c。该计算机设备在执行步骤S102后,通过第二模块202聚类的音频处理结果可以确定有3个音频聚类簇, 具体可以包括图9所示的音频聚类簇C1、音频聚类簇C2以及音频聚类簇C3For ease of understanding, please further refer to FIG. 9 , which is a schematic diagram of a scenario for audio character recognition provided by an embodiment of the present application. As shown in Figure 9, after the computer device executes step S101, it can determine through the image feature information recognized by the first module 201 that the number M of business objects to which the character pictures in the video frames of the multimedia data belong is 3. Specifically, it can include objects a, object b and object c. After the computer device executes step S102, it can determine that there are three audio clusters through the audio processing results clustered by the second module 202. Specifically, it may include audio clustering cluster C 1 , audio clustering cluster C 2 and audio clustering cluster C 3 shown in FIG. 9 .
其中,本申请实施例中的N个对象音频帧可以包括图9所示的音段1、音段2、音段3、音段4、音段5以及音段6。其中,这6个音段是按照播放时间排列的。音频聚类簇C1对应的对象音频帧可以包括音段1和音段3中的对象音频帧。音频聚类簇C2对应的对象音频帧可以包括音段2、音段4以及音段6中的对象音频帧。音频聚类簇C3对应的对象音频帧可以包括音段5中的对象音频帧。Among them, the N object audio frames in the embodiment of the present application may include segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6 shown in FIG. 9 . Among them, these 6 segments are arranged according to playing time. The object audio frames corresponding to audio cluster C 1 may include object audio frames in segment 1 and segment 3 . The object audio frames corresponding to audio cluster C 2 may include object audio frames in segment 2, segment 4, and segment 6. The object audio frame corresponding to audio cluster C 3 may include the object audio frame in segment 5 .
该计算机设备可以从上述表1所示的对象角色映射表的列表业务对象中,获取与该计算机设备在第一模块得到的M个业务对象之间存在重合的业务对象。比如,上述表1中的列表业务对象可以包括对象a、对象b、对象c以及对象d这4个业务对象,而本申请实施例中的计算机设备获取到的M个业务对象可以包括对象a、对象b以及对象c这3个业务对象,因此,该计算机设备可以从上述表1中,获取存在重合的业务对象的数量为3,即对象a、对象b和对象c。此时,该计算机设备可以基于图片特征信息,提取这3个存在重合的业务对象中的每个业务对象在多媒体数据中的播放时间(即第二播放时间)。The computer device can obtain, from the list of business objects in the object role mapping table shown in Table 1, business objects that overlap with the M business objects obtained by the computer device in the first module. For example, the list business objects in Table 1 above may include four business objects: object a, object b, object c, and object d. The M business objects obtained by the computer device in the embodiment of the present application may include object a, object b, object c, and object d. There are three business objects: object b and object c. Therefore, the computer device can obtain from the above Table 1 that the number of overlapping business objects is 3, that is, object a, object b and object c. At this time, the computer device can extract the playback time (ie, the second playback time) of each of the three overlapping business objects in the multimedia data based on the picture feature information.
例如,对象a在多媒体数据中的第二播放时间为播放时间T1(例如,00:00-10:00)以及播放时间T3(例如,30:45-38:00);对象b在多媒体数据中的第二播放时间为播放时间T2(例如,10:05-28:33),播放时间T4(例如,40:05-55:39)以及播放时间T6(例如,100:03-113:57);对象c在多媒体数据中的第二播放时间为播放时间T5(例如,80:30-88:50)。For example, the second playback time of object a in the multimedia data is playback time T 1 (for example, 00:00-10:00) and playback time T 3 (for example, 30:45-38:00); object b is in the multimedia data The second playback time in the data is playback time T 2 (for example, 10:05-28:33), playback time T 4 (for example, 40:05-55:39), and playback time T 6 (for example, 100:03 -113:57); the second playback time of object c in the multimedia data is the playback time T 5 (for example, 80:30-88:50).
计算机设备可以从这3个音频聚类簇中获取音频聚类簇C1,进而可以提取音频聚类簇C1在多媒体数据中的播放时间(即音频聚类簇C1的第一播放时间)。其中,该音频聚类簇C1在多媒体数据中的第一播放时间可以包括音段1对应的播放时间t1(例如,00:30-10:10)和音段3对应的播放时间t3(例如,35:08-40:52)。此时,该计算机设备可以分别确定音频聚类簇C1与每个业务对象对应的第二播放时间之间的时间重叠度。例如,音频聚类簇C1的第一播放时间与对象a的第二播放时间之间的时间重叠度为98%,与对象b的第二播放时间之间的时间重叠度为5%,与对象c的第二播放时间之间的时间重叠度为1%。然后,该计算机设备可以从这3个时间重叠度中确定具有最高时间重叠度的第二播放时间,即对象a的第二播放时间,进一步地,该计算机设备可以将对象a作为音频聚类簇C1对应的业务对象,且从上述表1中获取与对象a具有映射关系的业务角色(即角色1与角色2)作为该音频聚类簇C1对应的业务角色。这意味着该计算机设备可以识别出音频聚类簇C1中的每句音频台词均是由角色1或角色2所说出的。The computer device can obtain the audio cluster C 1 from these three audio clusters, and then can extract the playback time of the audio cluster C 1 in the multimedia data (ie, the first playback time of the audio cluster C 1 ). . The first playback time of the audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to the sound segment 1 (for example, 00:30-10:10) and the playback time t 3 (for example, 00:30-10:10) corresponding to the sound segment 3 ( For example, 35:08-40:52). At this time, the computer device can respectively determine the time overlap between the audio cluster C 1 and the second playback time corresponding to each business object. For example, the time overlap between the first playback time of audio cluster C 1 and the second playback time of object a is 98%, the time overlap with the second playback time of object b is 5%, and the time overlap with the second playback time of object b is 5%. The temporal overlap between the second playback times of object c is 1%. Then, the computer device can determine the second playback time with the highest time overlap degree from the three time overlap degrees, that is, the second playback time of object a. Further, the computer device can use object a as an audio clustering cluster. The business object corresponding to C 1 , and the business roles (ie, role 1 and role 2) that have a mapping relationship with object a are obtained from the above Table 1 as the business role corresponding to the audio cluster C 1 . This means that the computer device can identify that each audio line in audio cluster C 1 is spoken by either character 1 or character 2.
以此类推,该计算机设备可以参见音频聚类簇C1对应的业务角色的音频角色识别方式,确定音频聚类簇C2对应的业务角色可以为与对象b具有映射关系的角色3,音频聚类簇C3对应的业务角色可以为与对象c具有映射关系的角色4。By analogy, the computer device can refer to the audio role identification method of the business role corresponding to the audio cluster C 1 and determine that the business role corresponding to the audio cluster C 2 can be the role 3 that has a mapping relationship with the object b. The audio cluster The business role corresponding to cluster C 3 may be role 4 that has a mapping relationship with object c.
在本申请实施例中,具有音频角色识别功能的计算机设备可以通过结合从视频帧中自动识别出的图片特征信息以及自适应聚类的M个音频聚类簇,将声音与角色关联识别,从而可以准确识别出与对象角色映射表相关联的P个音频聚类簇分别对应的业务角色,这种音频角色识别方式无需人工标注每一句音频台词所归属的业务角色,不仅可以减少消耗的人力时间,还能够解决相似音色识别错误的情况,以至于提高了识别的精确度以及效率。此外,本申请实施例在音频角色识别过程中可以采用音频语义特征聚类的方法,使得整个音频角色识别系统更具通用性,可适用不同多媒体数据中业务对象不同的场景,从而有效提高了识别的适用性。 In the embodiment of the present application, a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified. This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition. In addition, the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
进一步地,请参见图10,图10是本申请实施例提供的一种数据处理方法的另一个流程示意图。该方法可以由具有音频角色识别功能的终端设备(例如,上述图1所示的终端设备集群中的任意一个终端设备,例如,终端设备100a)执行,也可以由具有音频角色识别功能的服务器(例如,上述图1所示的服务器10F)执行,还可以由具有多媒体数据播放功能的目标终端设备和具备音频角色识别功能的服务器交互执行,在此不做限定。该方法至少可以包括以下步骤S201-步骤S205:Further, please refer to FIG. 10 , which is another schematic flowchart of a data processing method provided by an embodiment of the present application. This method can be executed by a terminal device with an audio role recognition function (for example, any terminal device in the terminal device cluster shown in FIG. 1 above, for example, the terminal device 100a), or by a server with an audio role recognition function ( For example, the server 10F) shown in FIG. 1 can also be executed interactively by a target terminal device with a multimedia data playback function and a server with an audio character recognition function, which is not limited here. The method may at least include the following steps S201 to S205:
步骤S201,从多媒体数据的视频帧中识别图片特征信息。Step S201: Identify picture feature information from video frames of multimedia data.
步骤S202,从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从多媒体数据中的N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇。Step S202, locate and separate the audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames in the multimedia data, and compare The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters.
步骤S203,基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色。Step S203: Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
其中,该步骤S201-步骤S203的具体实施方式可参见上述图3所对应实施例中对步骤S101-步骤S103的描述,这里将不再赘述。For the specific implementation of step S201 to step S203, please refer to the description of step S101 to step S103 in the embodiment corresponding to FIG. 3, which will not be described again here.
步骤S204,基于P个音频聚类簇(具体为P个音频聚类簇对应的对象音频帧)分别在多媒体数据中的第一播放时间以及P个音频聚类簇分别对应的业务对象(具体为P个音频聚类簇分别对应的业务对象所在的视频帧)在多媒体数据中的第二播放时间,确定P个业务对象中的每个业务对象在多媒体数据中的业务播放时间。Step S204, based on the first playback time of the P audio clusters (specifically, the object audio frames corresponding to the P audio clusters) in the multimedia data and the business objects corresponding to the P audio clusters (specifically, The second playback time in the multimedia data of the video frame in which the P audio clusters respectively correspond to the business objects respectively determines the service playback time in the multimedia data of each of the P business objects.
具体地,该计算机设备可以从P个音频聚类簇中获取目标音频聚类簇,进而可以确定该目标音频聚类簇在多媒体数据中的第一播放时间,以及该目标音频聚类簇对应的业务对象在多媒体数据中的第二播放时间。进一步地,该计算机设备可以确定目标音频聚类簇的第一播放时间与第二播放时间的时间交集或时间并集,进而可以将确定出的时间交集或时间并集,作为该目标音频聚类簇对应的业务对象在多媒体数据中的业务播放时间,直到得到P个业务对象中的每个业务对象在多媒体数据中的业务播放时间。Specifically, the computer device can obtain the target audio cluster from the P audio clusters, and further can determine the first playback time of the target audio cluster in the multimedia data, and the first playback time of the target audio cluster corresponding to the target audio cluster. The second playback time of the business object in the multimedia data. Further, the computer device can determine the time intersection or time union of the first playback time and the second playback time of the target audio cluster, and then can use the determined time intersection or time union as the target audio cluster. The service playback time of the business object corresponding to the cluster in the multimedia data is obtained until the service playback time of each of the P business objects in the multimedia data is obtained.
本申请实施例使用音频语义特征聚类方法来进行音频角色识别,能够弥补在一些视频帧画面中无角色面部信息或者对象信息,但有音频出现时无法识别角色的问题,能够自动的根据对象音频帧的语义特征来聚类出当前音频聚类簇对应的业务角色,从而填补上了使用图像识别进行角色识别上的缺陷,保障了整个多媒体数据中角色时间定位信息的完整性。The embodiment of the present application uses the audio semantic feature clustering method to perform audio character recognition, which can make up for the problem that there is no character facial information or object information in some video frames, but the character cannot be recognized when audio appears, and can automatically identify the character based on the object audio. The semantic features of frames are used to cluster the business roles corresponding to the current audio clustering cluster, thus filling the shortcomings of using image recognition for role recognition and ensuring the integrity of the role's time positioning information in the entire multimedia data.
如图9所示,音频聚类簇C1在多媒体数据中的第一播放时间可以包括音段1对应的播放时间t1(例如,00:30-10:10)和音段3对应的播放时间t3(例如,35:08-40:52)。音频聚类簇C1对应的业务对象(例如,对象a)在多媒体数据中的第二播放时间为播放时间T1(例如,00:00-10:00)以及播放时间T3(例如,30:45-38:00)。若该计算机设备采用时间交集的方式确定业务播放时间,则该计算机设备确定的对象a的业务播放时间可以为00:30-10:00以及35:08-38:00。若该计算机设备采用时间并集的方式确定业务播放时间,则该计算机设备确定的对象a的业务播放时间可以为00:00-10:10以及30:45-40:52。As shown in Figure 9, the first playback time of audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to segment 1 (for example, 00:30-10:10) and the playback time corresponding to segment 3 t 3 (e.g., 35:08-40:52). The second playback time of the business object (for example, object a) corresponding to the audio cluster C 1 in the multimedia data is the playback time T 1 (for example, 00:00-10:00) and the playback time T 3 (for example, 30 :45-38:00). If the computer device uses a time intersection method to determine the service playback time, the service playback time of object a determined by the computer device can be 00:30-10:00 and 35:08-38:00. If the computer device uses a time union method to determine the service playback time, the service playback time of object a determined by the computer device may be 00:00-10:10 and 30:45-40:52.
步骤S205,基于P个业务对象中的每个业务对象对应的业务播放时间,从多媒体数据中获取P个业务对象分别对应的多媒体片段数据。Step S205: Based on the service playback time corresponding to each of the P business objects, obtain the multimedia segment data corresponding to the P business objects from the multimedia data.
其中,这里的多媒体片段数据可以包括与对应业务对象相关联的音频帧以及与对应业务对象相关联 的视频帧。The multimedia segment data here may include audio frames associated with the corresponding business object and audio frames associated with the corresponding business object. video frames.
如图9所示,该计算机设备在获取到对象a的业务播放时间、对象b的业务播放时间以及对象c的业务播放时间时,可以分别获取这3个业务对象分别对应的多媒体片段数据。比如,该计算机设备可以从多媒体数据中获取与对象a的业务播放时间相匹配的多媒体片段数据(即包括与对象a相关联的视频帧以及与对象a相关联的音频帧),以作为该对象a对应的多媒体片段数据(例如,多媒体片段数据1)。同理,该计算机设备可以获取与对象b的业务播放时间相匹配的多媒体片段数据(即包括与对象b相关联的视频帧以及与对象b相关联的音频帧),作为该对象b对应的多媒体片段数据(例如,多媒体片段数据2);获取与对象c的业务播放时间相匹配的多媒体片段数据(即包括与对象c相关联的视频帧以及与对象c相关联的音频帧),作为该对象c对应的多媒体片段数据(例如,多媒体片段数据3)。As shown in Figure 9, when the computer device obtains the service playback time of object a, the service playback time of object b, and the service playback time of object c, it can respectively obtain the multimedia segment data corresponding to these three service objects. For example, the computer device can obtain multimedia segment data that matches the service playback time of object a from the multimedia data (that is, including video frames associated with object a and audio frames associated with object a) as the object The multimedia segment data corresponding to a (for example, multimedia segment data 1). In the same way, the computer device can obtain the multimedia segment data that matches the service playback time of object b (that is, including the video frame associated with object b and the audio frame associated with object b) as the multimedia segment data corresponding to object b. Segment data (for example, multimedia segment data 2); obtain the multimedia segment data that matches the service playback time of object c (that is, including the video frame associated with object c and the audio frame associated with object c), as the object c corresponding multimedia segment data (for example, multimedia segment data 3).
本申请实施例所提供的这种全自动的基于音频语义特征聚类的方法来进行的音频角色识别的方案,能够自动地结合图片特征信息(例如,角色面部信息)来对多媒体数据中的业务角色进行识别,从而能够节省大量的人工标注成本以及时间成本,加速视频应用的实现。其中,该计算机设备在获取到每个业务对象分别对应的多媒体片段数据时,可以将其应用在多媒体数据播放场景的“只看TA”这一用户特色服务中,能够针对多媒体数据中的业务对象(或业务角色)来进行分镜的选择,从而在目标用户触发这一用户特色服务时,自动跳过非用户选定的多媒体片段数据,使得计算机设备能够更加清楚的定位到用户喜欢的业务对象的多媒体片段数据。The fully automatic audio character recognition solution based on the audio semantic feature clustering method provided by the embodiments of the present application can automatically combine picture feature information (for example, character facial information) to identify services in multimedia data. Character recognition can save a lot of manual annotation costs and time costs, and accelerate the implementation of video applications. Among them, when the computer device obtains the multimedia fragment data corresponding to each business object, it can be applied in the "watch TA only" user-specific service in the multimedia data playback scenario, and can target the business objects in the multimedia data. (or business role) to select storyboards, so that when the target user triggers this user-specific service, the multimedia segment data that is not selected by the user will be automatically skipped, so that the computer device can more clearly locate the business objects that the user likes. multimedia segment data.
该计算机设备可以在业务播放显示界面中播放多媒体数据。其中,该业务播放显示界面可以包括用于触发对象视频数据选择功能的播放选择控件。进一步地,当目标用户针对播放选择控件执行触发操作时,该计算机设备可以响应该触发操作,显示对象播放列表。例如,这里的对象播放列表可以以浮窗形式或蒙层形式或半透明形式展现在业务播放显示界面的底部区域,也可以显示在能够通过拖拽操作改变显示尺寸且可收缩的界面,该界面的尺寸小于该业务播放显示界面。其中,这里的对象播放列表可以包括Z个业务对象分别对应的对象封面数据;且Z为小于或等于P的正整数。The computer device can play multimedia data in a business playback display interface. The service playback display interface may include a playback selection control for triggering a target video data selection function. Further, when the target user performs a triggering operation on the playback selection control, the computer device may display the object playlist in response to the triggering operation. For example, the object playlist here can be displayed in the bottom area of the business playback display interface in a floating window form, a masked form, or a translucent form, or it can also be displayed on a shrinkable interface that can change the display size through drag and drop operations. The size is smaller than the service playback display interface. The object playlist here may include object cover data corresponding to Z business objects respectively; and Z is a positive integer less than or equal to P.
当该目标用户可以针对Z个对象封面数据中的目标对象封面数据执行触发操作时,该计算机设备可以响应该触发操作,在业务播放界面中播放目标多媒体片段数据。其中,这里的目标多媒体片段数据可以为目标对象封面数据对应的业务对象所对应的多媒体片段数据,且该目标对象封面数据对应的业务对象属于P个业务对象。其中,这里的触发操作可以包括点击、长按等接触性操作,也可以包括语音、手势等非接触性操作,这里将不对其进行限定。When the target user can perform a triggering operation on the target object cover data among the Z object cover data, the computer device can respond to the triggering operation and play the target multimedia segment data in the service playback interface. The target multimedia segment data here may be the multimedia segment data corresponding to the business object corresponding to the target object cover data, and the business object corresponding to the target object cover data belongs to P business objects. Among them, the triggering operations here may include contact operations such as clicks and long presses, and may also include non-contact operations such as voice and gestures, which will not be limited here.
为便于理解,进一步地,请参见图11,图11是本申请实施例提供的一种显示多媒体片段数据的场景示意图。如图11所示,本申请实施例中的计算机设备可以为目标用户所使用的目标终端设备。该目标终端设备可以为上述图1所对应实施例中的终端设备集群中的任意一个终端设备,例如,终端设备100a。其中,图11所示的界面1101J以及界面1102J均为具有多媒体数据播放功能的客户端所提供的不同时刻下的业务播放显示界面。For ease of understanding, please further refer to FIG. 11 , which is a schematic diagram of a scene for displaying multimedia segment data according to an embodiment of the present application. As shown in Figure 11, the computer device in the embodiment of the present application may be a target terminal device used by the target user. The target terminal device may be any terminal device in the terminal device cluster in the embodiment corresponding to Figure 1, for example, the terminal device 100a. Among them, the interface 1101J and the interface 1102J shown in Figure 11 are both service playback display interfaces at different times provided by a client with a multimedia data playback function.
目标用户所使用的目标终端设备可以在界面1101J中显示多媒体数据,这里的多媒体数据可以为上述图2所对应实施例中的多媒体数据20S。其中,界面1101J中可以包括控件11U,该控件11U为用于触发对象视频数据选择功能的播放选择控件。 The target terminal device used by the target user can display multimedia data in the interface 1101J. The multimedia data here can be the multimedia data 20S in the embodiment corresponding to Figure 2. The interface 1101J may include a control 11U, which is a playback selection control used to trigger the target video data selection function.
当目标用户针对控件11U执行触发操作(例如,点击操作)时,该目标终端设备可以响应该触发操作,显示图11所示的对象播放列表11B。其中,这里的对象播放列表11B可以包括Z个业务对象分别对应的对象封面数据以及多媒体数据对应的封面数据(例如,“观看完整视频”)。以3个为例,对象播放列表11B具体可以包括对象a对应的对象封面数据1(例如,“只看对象a片段”),对象b对应的对象封面数据2(例如,“只看对象b片段”)以及对象c对应的对象封面数据3(例如,“只看对象c片段”)。其中,这里的对象a、对象b以及对象c均属于目标终端设备所获取到的对多媒体数据进行音频角色识别后所得到的P个业务对象。When the target user performs a triggering operation (for example, a click operation) on the control 11U, the target terminal device may display the object playlist 11B shown in FIG. 11 in response to the triggering operation. The object playlist 11B here may include object cover data corresponding to the Z business objects and cover data corresponding to the multimedia data (for example, "watch the complete video"). Taking three as an example, the object playlist 11B may specifically include object cover data 1 corresponding to object a (for example, "watch only the clips of object a"), and object cover data 2 corresponding to object b (for example, "watch only clips of object b"). ”) and the object cover data 3 corresponding to object c (for example, “see only the fragment of object c”). Among them, the object a, the object b and the object c here all belong to the P business objects obtained by the target terminal device and obtained by performing audio role recognition on the multimedia data.
此时,目标用户可以针对Z个对象封面数据中的目标对象封面数据(例如,对象a对应的对象封面数据1)执行触发操作。响应该触发操作,该目标终端设备可以在图11所示的界面1102J中播放对象封面数据1对应的对象a所对应的多媒体片段数据。如图11所示,该目标终端设备还可以在界面1102J所显示的多媒体数据对应的播放进度条中突出显示对象a对应的多媒体片段数据所对应的播放进度,以便目标用户可以更加快速且准确的找到自身所感兴趣的对象a对应的多媒体片段数据的下一片段。At this time, the target user can perform a triggering operation on the target object cover data (for example, the object cover data 1 corresponding to object a) among the Z pieces of object cover data. In response to the triggering operation, the target terminal device can play the multimedia segment data corresponding to the object a corresponding to the object cover data 1 in the interface 1102J shown in FIG. 11 . As shown in Figure 11, the target terminal device can also highlight the playback progress corresponding to the multimedia segment data corresponding to object a in the playback progress bar corresponding to the multimedia data displayed on the interface 1102J, so that the target user can more quickly and accurately Find the next segment of multimedia segment data corresponding to the object a that you are interested in.
需要说明的是,图11中所展示的界面以及控件仅仅是一些可供参考的表现形式。在实际业务场景中,开发人员可以根据产品需求来进行相关设计,本申请实施例对涉及到的界面和控件的具体形式不做限制。It should be noted that the interface and controls shown in Figure 11 are only some representations for reference. In actual business scenarios, developers can carry out relevant designs according to product requirements. The embodiments of this application do not limit the specific forms of the interfaces and controls involved.
进一步地,该计算机设备在获取到每个业务对象分别对应的多媒体片段数据时,还可以将其应用在合并剪辑的场景中。例如,该计算机设备通过对多媒体数据中的音频数据进行归类,分辨出每一句音频台词对应的业务角色,整理整个多媒体数据中每个业务角色对应的台词语音集合(即音频聚类簇),以作为生产素材,提供给智能生产视频团队,作为剪辑的备选信息。比如,该计算机设备可以对同一业务对象在不同多媒体数据中的多个多媒体片段数据进行隔空混剪。又比如,该计算机设备可以对不同业务对象的分别对应的多媒体片段数据进行合并剪辑。Further, when the computer device obtains the multimedia segment data corresponding to each business object, it can also apply it in the merged and edited scene. For example, the computer device classifies the audio data in the multimedia data, distinguishes the business role corresponding to each audio line, and organizes the line voice collection (i.e., audio clustering cluster) corresponding to each business role in the entire multimedia data. Use it as production material and provide it to the intelligent production video team as alternative information for editing. For example, the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data. For another example, the computer device can merge and edit corresponding multimedia segment data of different business objects.
这里的多媒体数据可以包括第一多媒体数据和第二多媒体数据。该第一多媒体数据与第二多媒体数据均包括待剪辑对象。这里的待剪辑对象属于计算机设备进行音频角色识别出所得到的P个业务对象。比如,这里的第一多媒体数据可以为待剪辑对象所参演的战争题材的电视剧。这里的第二多媒体数据可以为待剪辑对象所参演的仙侠题材的电视剧。The multimedia data here may include first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited. The objects to be edited here belong to the P business objects obtained through audio role recognition by the computer equipment. For example, the first multimedia data here can be a war-themed TV series in which the subject to be edited participates. The second multimedia data here can be the TV series with the theme of fairy tales in which the subject to be edited participates.
该计算机设备可以基于与第一多媒体数据相关联的对象角色映射表,获取待剪辑对象对应的第一目标业务角色,进而可以从第一多媒体数据中获取与第一目标业务角色相关联的第一多媒体片段数据。其中,这里的第一多媒体片段数据是该计算机设备基于待剪辑对象在第一多媒体数据中的业务播放时间所确定的。同理,该计算机设备还可以基于与第二多媒体数据相关联的对象角色映射表,获取待剪辑对象对应的第二目标业务角色,从第二多媒体数据中获取与第二目标业务角色相关联的第二多媒体片段数据。其中,这里的第二多媒体片段数据可以是计算机设备基于待剪辑对象在第二多媒体数据中的业务播放时间所确定的。此时,该计算机设备可以对第一多媒体片段数据和第二多媒体片段数据进行合并剪辑处理,得到待剪辑对象对应的合并剪辑数据。这里的合并剪辑数据可以用于上传至客户端所在的业务数据平台,以使访问该客户端的对象能够在对应终端设备上进行查阅。The computer device can obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and further can obtain the first target business role related to the first multimedia data. The first multimedia segment data of the connection. The first multimedia segment data here is determined by the computer device based on the service playback time of the object to be edited in the first multimedia data. In the same way, the computer device can also obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role from the second multimedia data. Second multimedia segment data associated with the character. The second multimedia segment data here may be determined by the computer device based on the service playback time of the object to be edited in the second multimedia data. At this time, the computer device can perform merge and edit processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited. The merged clip data here can be used to upload to the business data platform where the client is located, so that objects accessing the client can check it on the corresponding terminal device.
在本申请实施例中,具有音频角色识别功能的计算机设备可以通过结合从视频帧中自动识别出的图片特征信息以及自适应聚类的M个音频聚类簇,将声音与角色关联识别,从而可以准确识别出与对象角 色映射表相关联的P个音频聚类簇分别对应的业务角色。这种音频角色识别方式无需人工标注每一句音频台词所归属的业务角色,而是能够在多媒体数据上架前自动化的将业务角色与音频台词信息进行识别写入,从而能够快速为下游业务(例如,用户特色服务业务、合并剪辑业务等)进行赋能。本申请实施例在音频角色识别过程中采用音频语义特征聚类的方法,不仅可以减少消耗的人力时间,还能够解决相似音色识别错误的情况,以至于提高了识别的精确度以及效率,与此同时使得整个音频角色识别系统更具通用性,可适用不同多媒体数据中业务对象不同的场景,从而有效提高了识别的适用性。In the embodiment of the present application, a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby Can accurately identify the angle with the object The P audio clusters associated with the color mapping table correspond to business roles respectively. This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. Instead, it can automatically identify and write the business role and audio line information before the multimedia data is put on the shelf, so that it can quickly provide downstream services (for example, User characteristic service business, merged editing business, etc.) are empowered. The embodiment of the present application adopts the audio semantic feature clustering method in the audio character recognition process, which can not only reduce the manpower time consumed, but also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition. In addition, At the same time, the entire audio character recognition system is more versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of recognition.
进一步地,请参见图12,图12是本申请实施例提供的一种数据处理装置的结构示意图。如图12所示,该数据处理装置1可以包括:图片信息获取模块100、聚类处理模块200以及音频角色识别模块300。Further, please refer to FIG. 12 , which is a schematic structural diagram of a data processing device provided by an embodiment of the present application. As shown in Figure 12, the data processing device 1 may include: a picture information acquisition module 100, a clustering processing module 200, and an audio character recognition module 300.
该图片信息获取模块100用于从多媒体数据的视频帧中识别图片特征信息。图片特征信息包括视频帧中的角色图片所属的M个业务对象,M为正整数。The picture information acquisition module 100 is used to identify picture feature information from video frames of multimedia data. The picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer.
该聚类处理模块200用于从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象。The clustering processing module 200 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
该音频角色识别模块300用于基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中P为小于或者等于M的正整数,对象角色映射表包括与列表业务对象具有映射关系的业务角色,列表业务对象与M个业务对象之间存在P个重合的业务对象。The audio role recognition module 300 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data. Role, where P is a positive integer less than or equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
其中,该图片信息获取模块100、聚类处理模块200以及音频角色识别模块300的具体实现方式可以参见上述图3所对应实施例中对步骤S101-步骤S103的描述,这里将不再继续进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。For the specific implementation of the picture information acquisition module 100, the clustering processing module 200 and the audio character recognition module 300, please refer to the description of steps S101 to S103 in the embodiment corresponding to Figure 3 above, and will not be described again here. . In addition, the description of the beneficial effects of using the same method will not be described again.
进一步地,请参见图13,图13是本申请实施例提供的一种数据处理装置的另一个结构示意图。如图13所示,该数据处理装置2可以包括:图片信息获取模块11、聚类处理模块12、音频角色识别模块13、业务时间确定模块14、片段数据确定模块15、多媒体数据播放模块16、对象列表显示模块17、片段数据播放模块18、第一片段数据获取模块19、第二片段数据获取模块20以及合并剪辑模块21。Further, please refer to FIG. 13 , which is another schematic structural diagram of a data processing device provided by an embodiment of the present application. As shown in Figure 13, the data processing device 2 may include: a picture information acquisition module 11, a clustering processing module 12, an audio role recognition module 13, a business time determination module 14, a segment data determination module 15, a multimedia data playback module 16, Object list display module 17, segment data playback module 18, first segment data acquisition module 19, second segment data acquisition module 20 and merge editing module 21.
该图片信息获取模块11用于从多媒体数据的视频帧中识别图片特征信息,其中,图片特征信息包括视频帧中的角色图片所属的M个业务对象,M为正整数。The picture information acquisition module 11 is used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, and M is a positive integer.
其中,图片信息获取模块11包括:视频帧获取单元111、图片切割单元112、图片编码单元113、向量匹配单元114以及图片信息获取单元115。Among them, the picture information acquisition module 11 includes: a video frame acquisition unit 111, a picture cutting unit 112, a picture encoding unit 113, a vector matching unit 114 and a picture information acquisition unit 115.
该视频帧获取单元111用于从多媒体数据中获取视频帧。The video frame acquisition unit 111 is used to acquire video frames from multimedia data.
该图片切割单元112用于对视频帧中包含角色关键部位的图片进行切割处理,得到视频帧对应的角色图片。角色图片包括X个角色切割图片,X为大于或者等于M的正整数。The picture cutting unit 112 is used to cut pictures containing key parts of the character in the video frame to obtain the character picture corresponding to the video frame. The character pictures include X character cut pictures, where X is a positive integer greater than or equal to M.
其中,该图片切割单元112包括:位置确定子单元1121以及切割子单元1122。The picture cutting unit 112 includes: a position determining subunit 1121 and a cutting subunit 1122.
该位置确定子单元1121用于对视频帧中的角色关键部位进行检测定位,以确定角色关键部位在视频帧中的位置信息。 The position determination subunit 1121 is used to detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame.
该切割子单元1122用于基于位置信息,在视频帧中切割角色关键部位,得到包含角色关键部位的X个角色切割图片,将X个角色切割图片作为视频帧对应的角色图片。The cutting subunit 1122 is used to cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame.
其中,该位置确定子单元1121以及切割子单元1122的具体实现方式可以参见上述图5所对应实施例中对角色切割图片的描述,这里将不再继续进行赘述。For the specific implementation of the position determination sub-unit 1121 and the cutting sub-unit 1122, please refer to the description of the character cutting picture in the embodiment corresponding to FIG. 5, and will not be described again here.
该图片编码单元113用于获取X个角色切割图片中的角色切割图片Ti,对角色切割图片Ti进行编码处理,得到角色切割图片Ti对应的图片信息向量Li,其中,i为小于或者等于X的正整数。The picture encoding unit 113 is used to obtain the character cut picture Ti among Or a positive integer equal to X.
该向量匹配单元114用于从与候选业务对象相关联的信息向量数据库中,确定与图片信息向量Li相匹配的对象关键信息向量,将匹配到的对象关键信息向量对应的候选业务对象作为角色切割图片Ti对应的业务对象。The vector matching unit 114 is used to determine the object key information vector that matches the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as the role Cut the business object corresponding to picture T i .
其中,该向量匹配单元114包括:数据库获取子单元1141、向量距离确定子单元1142以及对象匹配子单元1143。The vector matching unit 114 includes: a database acquisition subunit 1141, a vector distance determination subunit 1142, and an object matching subunit 1143.
该数据库获取子单元1141用于获取与候选业务对象相关联的信息向量数据库,其中信息向量数据库用于存储Y个候选业务对象分别对应的对象关键信息向量,Y为大于或者等于M的正整数。The database acquisition subunit 1141 is used to acquire an information vector database associated with candidate business objects, where the information vector database is used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
该向量距离确定子单元1142用于分别确定图片信息向量Li与Y个对象关键信息向量中的每个对象关键信息向量之间的向量距离,得到Y个向量距离。The vector distance determination subunit 1142 is used to respectively determine the vector distance between the picture information vector Li and each object key information vector in the Y object key information vectors, to obtain Y vector distances.
该对象匹配子单元1143用于从Y个向量距离中获取小于或者等于距离阈值的最小向量距离,确定最小向量距离对应的对象关键信息向量所对应的候选业务对象,将确定的候选业务对象作为角色切割图片Ti对应的业务对象。The object matching subunit 1143 is used to obtain the minimum vector distance that is less than or equal to the distance threshold from Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as a role Cut the business object corresponding to picture T i .
其中,该数据库获取子单元1141、向量距离确定子单元1142以及对象匹配子单元1143的具体实现方式可以参见上述图4所对应实施例中对角色切割图片进行对象匹配的描述,这里将不再继续进行赘述。Among them, the specific implementation of the database acquisition sub-unit 1141, vector distance determination sub-unit 1142 and object matching sub-unit 1143 can be referred to the description of object matching of character cut pictures in the embodiment corresponding to Figure 4 above, which will not be continued here. Elaborate.
该图片信息获取单元115用于基于获取到的角色切割图片分别对应的业务对象,确定视频帧对应的图片特征信息。The picture information acquisition unit 115 is configured to determine the picture feature information corresponding to the video frame based on the obtained business objects corresponding to the character cut pictures.
其中,该视频帧获取单元111、图片切割单元112、图片编码单元113、向量匹配单元114以及图片信息获取单元115的具体实现方式可以参见上述图3所对应实施例中对步骤S101的描述,这里将不再继续进行赘述。For the specific implementation of the video frame acquisition unit 111, picture cutting unit 112, picture encoding unit 113, vector matching unit 114 and picture information obtaining unit 115, please refer to the description of step S101 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
该聚类处理模块12用于从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象。The clustering processing module 12 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
其中,该聚类处理模块12包括:对象音频帧确定单元121、语义特征提取单元122以及聚类处理单元123。The clustering processing module 12 includes: an object audio frame determination unit 121, a semantic feature extraction unit 122, and a clustering processing unit 123.
该对象音频帧确定单元121用于从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧。The object audio frame determining unit 121 is used to locate and separate audio frames containing human voices from original audio frames of multimedia data to obtain N object audio frames.
其中,该对象音频帧确定单元121包括:原始音频帧获取子单元1211、信源分离子单元1212以及对象音频帧确定子单元1213。 The object audio frame determination unit 121 includes: an original audio frame acquisition subunit 1211, a source separation subunit 1212, and an object audio frame determination subunit 1213.
该原始音频帧获取子单元1211用于从多媒体数据中获取原始音频帧。The original audio frame acquisition subunit 1211 is used to acquire original audio frames from multimedia data.
该信源分离子单元1212用于对原始音频帧进行信源分离,得到包含有人声的待处理音频帧。The source separation subunit 1212 is used to perform source separation on the original audio frame to obtain an audio frame to be processed that contains human voice.
其中,信源分离子单元1212包括:幅度谱生成子单元12121、类型特征生成子单元12122、合并掩码子单元12123以及待处理音频帧确定子单元12124。Among them, the source separation sub-unit 1212 includes: an amplitude spectrum generating sub-unit 12121, a type feature generating sub-unit 12122, a merging mask sub-unit 12123 and an audio frame to be processed determining sub-unit 12124.
该幅度谱生成子单元12121用于将原始音频帧输入至信源分离模型,通过信源分离模型生成原始音频帧对应的频谱幅度谱。信源分离模型包括第一分割网络层和第二分割网络层。The amplitude spectrum generation subunit 12121 is used to input the original audio frame to the source separation model, and generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model. The source separation model includes a first segmentation network layer and a second segmentation network layer.
该类型特征生成子单元12122用于将频谱幅度谱分别输入第一分割网络层以及第二分割网络层,通过第一分割网络层生成频谱幅度谱对应的第一类型特征,通过第二分割网络层生成频谱幅度谱对应的第二类型特征。The type feature generation subunit 12122 is used to input the spectrum amplitude spectrum into the first segmentation network layer and the second segmentation network layer respectively, generate the first type feature corresponding to the spectrum amplitude spectrum through the first segmentation network layer, and generate the first type feature corresponding to the spectrum amplitude spectrum through the second segmentation network layer. Generate second type features corresponding to the spectral amplitude spectrum.
该合并掩码子单元12123用于对第一类型特征和第二类型特征进行合并掩码处理,得到第一类型特征对应的目标掩码图。The merge mask subunit 12123 is used to perform merge mask processing on the first type features and the second type features to obtain a target mask map corresponding to the first type features.
该待处理音频帧确定子单元12124用于基于目标掩码图与频谱幅度谱的对应位置,通过频谱反变换生成目标类型音频帧,将目标类型音频帧作为信源分离模型所输出的包含有人声的待处理音频帧。The audio frame determination subunit 12124 to be processed is used to generate a target type audio frame through spectrum inverse transformation based on the corresponding position of the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the source separation model to output the audio frame containing the human voice. of audio frames to be processed.
其中,该幅度谱生成子单元12121、类型特征生成子单元12122、合并掩码子单元12123以及待处理音频帧确定子单元12124的具体实现方式可以参见上述图7所对应实施例中对待处理音频帧的描述,这里将不再继续进行赘述。Among them, the specific implementation of the amplitude spectrum generation sub-unit 12121, the type feature generation sub-unit 12122, the merging mask sub-unit 12123 and the audio frame to be processed determining sub-unit 12124 can be referred to the audio frame to be processed in the embodiment corresponding to Figure 7. description, which will not be described further here.
该对象音频帧确定子单元1213用于基于用于剔除静音帧的音频边界检测策略,对待处理音频帧中的音频冲击信号帧中的非静音段进行定位和切割,得到N个对象音频帧。The object audio frame determination subunit 1213 is used to locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, to obtain N object audio frames.
其中,该原始音频帧获取子单元1211、信源分离子单元1212以及对象音频帧确定子单元1213的具体实现方式可以参见上述图3所对应实施例中对原始音频帧进行对象定位和分离处理的描述,这里将不再继续进行赘述。For the specific implementation of the original audio frame acquisition sub-unit 1211, the source separation sub-unit 1212 and the object audio frame determination sub-unit 1213, please refer to the object positioning and separation processing of the original audio frame in the embodiment corresponding to Figure 3. description, which will not be described further here.
该语义特征提取单元122用于对N个对象音频帧中的每个对象音频帧进行语义特征提取,得到每个对象音频帧对应的音频语义特征向量。The semantic feature extraction unit 122 is used to extract semantic features from each of the N object audio frames, and obtain an audio semantic feature vector corresponding to each object audio frame.
其中,该语义特征提取单元122包括:音频帧输入子单元1221、频域特征确定子单元1222、时域特征确定子单元1223以及音频特征向量确定子单元1224。The semantic feature extraction unit 122 includes: an audio frame input subunit 1221, a frequency domain feature determination subunit 1222, a time domain feature determination subunit 1223, and an audio feature vector determination subunit 1224.
该音频帧输入子单元1221用于将N个对象音频帧输入至音频语义特征提取模型。音频语义特征提取模型包括频域分支网络层、时域分支网络层以及卷积网络层。The audio frame input subunit 1221 is used to input N object audio frames to the audio semantic feature extraction model. The audio semantic feature extraction model includes frequency domain branch network layer, time domain branch network layer and convolution network layer.
该频域特征确定子单元1222用于通过频域分支网络层,对N个对象音频帧进行特征学习,得到学习的频域特征图。The frequency domain feature determination subunit 1222 is used to perform feature learning on N object audio frames through the frequency domain branch network layer to obtain a learned frequency domain feature map.
该时域特征确定子单元1223用于通过时域分支网络层,对N个对象音频帧进行特征学习,得到学习的时域特征图。学习的频域特征图与学习的时域特征图之间的特征维度相同。The time domain feature determination subunit 1223 is used to perform feature learning on N object audio frames through the time domain branch network layer to obtain a learned time domain feature map. The feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same.
该音频特征向量确定子单元1224用于将学习的频域特征图与学习的时域特征图进行叠加处理,得到叠加特征,将叠加特征输入至卷积网络层,对叠加特征进行最大平均处理,输出每个对象音频帧对应的音频语义特征向量。The audio feature vector determination subunit 1224 is used to superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, and perform maximum average processing on the superimposed features. Output the audio semantic feature vector corresponding to each object audio frame.
其中,该音频帧输入子单元1221、频域特征确定子单元1222、时域特征确定子单元1223以及音频 特征向量确定子单元1224的具体实现方式可以参见上述图8所对应实施例中对对象音频帧进行语义特征提取的描述,这里将不再继续进行赘述。Among them, the audio frame input subunit 1221, frequency domain feature determination subunit 1222, time domain feature determination subunit 1223 and audio For the specific implementation of the feature vector determination sub-unit 1224, please refer to the description of semantic feature extraction of the object audio frame in the embodiment corresponding to FIG. 8, and will not be described again here.
该聚类处理单元123用于将M确定为待聚类的簇心数量,基于簇心数量,对获取到的每个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇。The clustering processing unit 123 is used to determine M as the number of cluster centers to be clustered, and perform clustering processing on the audio semantic feature vector corresponding to each obtained object audio frame based on the number of cluster centers to obtain M audio clusters. Class cluster.
其中,该对象音频帧确定单元121、语义特征提取单元122以及聚类处理单元123的具体实现方式可以参见上述图3所对应实施例中对步骤S102的描述,这里将不再继续进行赘述。For the specific implementation of the object audio frame determination unit 121, the semantic feature extraction unit 122 and the clustering processing unit 123, please refer to the description of step S102 in the embodiment corresponding to Figure 3 above, and will not be described again here.
该音频角色识别模块13用于基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中,P为小于或者等于M的正整数,对象角色映射表包括与列表业务对象具有映射关系的业务角色,列表业务对象与M个业务对象之间存在P个重合的业务对象。The audio role recognition module 13 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data. Role, where P is a positive integer less than or equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
其中,该音频角色识别模块13包括:第一时间提取单元131、第二时间提取单元132、时间重叠度确定单元133以及音频角色识别单元134。The audio character recognition module 13 includes: a first time extraction unit 131, a second time extraction unit 132, a time overlap determination unit 133, and an audio character recognition unit 134.
该第一时间提取单元131用于从M个音频聚类簇中获取音频聚类簇Ck,提取音频聚类簇Ck中包括的音频语义特征向量所对应的对象音频帧在多媒体数据中的一个或多个播放时间,作为所述音频聚类簇Ck的第一播放时间,其中,k为小于或者等于M的正整数。The first time extraction unit 131 is used to obtain the audio cluster C k from the M audio clusters, and extract the object audio frame in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C k One or more playback times are used as the first playback time of the audio cluster C k , where k is a positive integer less than or equal to M.
该第二时间提取单元132用于从对象角色映射表的列表业务对象中,获取与M个业务对象之间存在重合的P个业务对象,基于图片特征信息,提取P个业务对象中的每个业务对象所在的视频帧在多媒体数据中的一个或多个播放时间,作为每个业务对象的第二播放时间。The second time extraction unit 132 is used to obtain P business objects that overlap with the M business objects from the list business objects in the object role mapping table, and extract each of the P business objects based on the picture feature information. One or more playback times in the multimedia data of the video frame where the business object is located are used as the second playback time of each business object.
该时间重叠度确定单元133用于分别确定音频聚类簇Ck的第一播放时间与每个业务对象对应的第二播放时间之间的时间重叠度,将具有最高时间重叠度的第二播放时间所对应的业务对象作为音频聚类簇Ck对应的业务对象。The time overlap determination unit 133 is used to respectively determine the time overlap between the first playback time of the audio cluster C k and the second playback time corresponding to each business object, and assign the second playback time with the highest time overlap to The business object corresponding to the time is used as the business object corresponding to the audio cluster C k .
该音频角色识别单元134用于从对象角色映射表中,获取音频聚类簇Ck对应的业务对象所对应的业务角色,将获取到的业务角色作为音频聚类簇Ck对应的业务角色。The audio role identification unit 134 is used to obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
其中,该第一时间提取单元131、第二时间提取单元132、时间重叠度确定单元133以及音频角色识别单元134的具体实现方式可以参见上述图3所对应实施例中对步骤S103的描述,这里将不再继续进行赘述。For the specific implementation of the first time extraction unit 131, the second time extraction unit 132, the time overlap determination unit 133 and the audio character recognition unit 134, please refer to the description of step S103 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
该业务时间确定模块14用于基于P个音频聚类簇分别在多媒体数据中的第一播放时间,以及P个音频聚类簇分别对应的业务对象在多媒体数据中的第二播放时间,确定P个业务对象中的每个业务对象在多媒体数据中的业务播放时间。The service time determination module 14 is configured to determine P based on the first playback time of the P audio clusters in the multimedia data and the second playback time of the business objects corresponding to the P audio clusters in the multimedia data. The service playback time of each business object in the multimedia data.
该片段数据确定模块15用于基于每个业务对象对应的业务播放时间,从多媒体数据中获取P个业务对象分别对应的多媒体片段数据。多媒体片段数据包括与对应业务对象相关联的音频帧以及与对应业务对象相关联的视频帧。The segment data determination module 15 is used to obtain multimedia segment data corresponding to P business objects from the multimedia data based on the service playback time corresponding to each business object. The multimedia segment data includes audio frames associated with the corresponding business object and video frames associated with the corresponding business object.
该多媒体数据播放模块16用于在业务播放显示界面中播放多媒体数据。业务播放显示界面包括用于触发对象视频数据选择功能的播放选择控件。The multimedia data playing module 16 is used to play multimedia data in the service playing display interface. The service playback display interface includes a playback selection control used to trigger the object video data selection function.
该对象列表显示模块17用于响应针对播放选择控件的触发操作,显示对象播放列表,其中,对象播 放列表包括Z个业务对象分别对应的对象封面数据,Z为小于或等于P的正整数;The object list display module 17 is used to display the object play list in response to the triggering operation of the play selection control, wherein the object play list The playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;
该片段数据播放模块18用于响应针对Z个对象封面数据中的目标对象封面数据的触发操作,在业务播放界面中播放目标多媒体片段数据,其中,目标多媒体片段数据为目标对象封面数据对应的业务对象所对应的多媒体片段数据,目标对象封面数据对应的业务对象属于P个业务对象。The segment data playback module 18 is used to respond to the trigger operation for the target object cover data among the Z object cover data, and play the target multimedia segment data in the service playback interface, where the target multimedia segment data is the service corresponding to the target object cover data. The multimedia fragment data corresponding to the object and the business object corresponding to the cover data of the target object belong to P business objects.
其中,多媒体数据包括第一多媒体数据和第二多媒体数据。第一多媒体数据与第二多媒体数据均包括待剪辑对象。待剪辑对象属于P个业务对象。The multimedia data includes first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited. The object to be edited belongs to P business objects.
该第一片段数据获取模块19用于基于与第一多媒体数据相关联的对象角色映射表,获取待剪辑对象对应的第一目标业务角色,从第一多媒体数据中获取与第一目标业务角色相关联的第一多媒体片段数据;第一多媒体片段数据是基于待剪辑对象在第一多媒体数据中的业务播放时间所确定的。The first segment data acquisition module 19 is configured to obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and obtain the first target business role corresponding to the first multimedia data from the first multimedia data. The first multimedia segment data associated with the target business role; the first multimedia segment data is determined based on the service playback time of the object to be edited in the first multimedia data.
该第二片段数据获取模块20用于基于与第二多媒体数据相关联的对象角色映射表,获取待剪辑对象对应的第二目标业务角色,从第二多媒体数据中获取与第二目标业务角色相关联的第二多媒体片段数据;第二多媒体片段数据是基于待剪辑对象在第二多媒体数据中的业务播放时间所确定的。The second segment data acquisition module 20 is configured to obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role corresponding to the second multimedia data from the second multimedia data. The second multimedia segment data associated with the target business role; the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data.
该合并剪辑模块21用于对第一多媒体片段数据和第二多媒体片段数据进行合并剪辑处理,得到待剪辑对象对应的合并剪辑数据。The merging and editing module 21 is used to perform merging and editing processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.
其中,该图片信息获取模块11、聚类处理模块12、音频角色识别模块13、业务时间确定模块14、片段数据确定模块15、多媒体数据播放模块16、对象列表显示模块17、片段数据播放模块18、第一片段数据获取模块19、第二片段数据获取模块20以及合并剪辑模块21的具体实现方式可以参见上述图10所对应实施例中对步骤S201-步骤S205的描述,这里将不再继续进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。Among them, the picture information acquisition module 11, clustering processing module 12, audio role recognition module 13, business time determination module 14, segment data determination module 15, multimedia data playback module 16, object list display module 17, segment data playback module 18 , the specific implementation of the first fragment data acquisition module 19, the second fragment data acquisition module 20 and the merge editing module 21 can be referred to the description of steps S201 to step S205 in the embodiment corresponding to Figure 10 above, and will not be continued here. Repeat. In addition, the description of the beneficial effects of using the same method will not be described again.
进一步地,请参见图14,图14是本申请实施例提供的一种计算机设备的示意图。如图14所示,该计算机设备1000可以为具有音频角色识别功能的计算机设备,该计算机设备1000可以包括:至少一个处理器1001,例如,CPU,至少一个网络接口1004,存储器1005,至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。网络接口1004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005还可以是至少一个位于远离前述处理器1001的存储装置。如图14所示,作为一种计算机存储介质的存储器1005可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。其中,在一些实施例中,该计算机设备还可以包括图14所示的用户接口1003。比如,若该计算机设备为图1所示的具有音频角色识别功能的终端设备(例如,终端设备100a),则该计算机设备还可以包括该用户接口1003。其中,该用户接口1003可以包括显示屏(Display)、键盘(Keyboard)等。Further, please refer to FIG. 14 , which is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in Figure 14, the computer device 1000 may be a computer device with an audio character recognition function. The computer device 1000 may include: at least one processor 1001, for example, a CPU, at least one network interface 1004, a memory 1005, and at least one communication interface. Bus 1002. Among them, the communication bus 1002 is used to realize connection communication between these components. The network interface 1004 may include standard wired interfaces and wireless interfaces (such as WI-FI interfaces). The memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. As shown in Figure 14, memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program. Wherein, in some embodiments, the computer device may also include the user interface 1003 shown in Figure 14. For example, if the computer device is a terminal device with audio character recognition function shown in FIG. 1 (for example, terminal device 100a), the computer device may also include the user interface 1003. The user interface 1003 may include a display screen (Display), a keyboard (Keyboard), etc.
在图14所示的计算机设备1000中,网络接口1004主要用于进行网络通信,而用户接口1003主要用于为用户提供输入的接口,而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现:In the computer device 1000 shown in Figure 14, the network interface 1004 is mainly used for network communication, and the user interface 1003 is mainly used to provide an input interface for the user, and the processor 1001 can be used to call the device control stored in the memory 1005. application to achieve:
从多媒体数据的视频帧中识别图片特征信息,图片特征信息包括视频帧中的角色图片所属的M个业务对象,M为正整数;Identify picture feature information from video frames of multimedia data. The picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
从多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从N个对象 音频帧中分别提取对应的音频语义特征向量,并对N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象;Locate and separate audio frames containing human voices from the original audio frames of multimedia data to obtain N object audio frames. From N objects The corresponding audio semantic feature vectors are extracted from the audio frames, and the audio semantic feature vectors corresponding to the N object audio frames are clustered to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster Corresponds to a business object;
基于图片特征信息、M个音频聚类簇以及与多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中,P为小于或者等于M的正整数,对象角色映射表包括与列表业务对象具有映射关系的业务角色,列表业务对象与M个业务对象之间存在P个重合的业务对象。Based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data, identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M. The object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
本申请实施例中所描述的计算机设备1000可执行前文图3和图10所对应实施例中对该数据处理方法的描述,也可执行前文图12所对应实施例中对该数据处理装置1和图13所对应实施例中对该数据处理装置2的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。The computer device 1000 described in the embodiment of the present application can execute the data processing method described in the embodiment corresponding to FIG. 3 and FIG. 10, and can also execute the data processing device 1 and the data processing method in the embodiment corresponding to FIG. 12. The description of the data processing device 2 in the embodiment corresponding to Figure 13 will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be described again.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令被处理器执行时实现图3和图10中各个步骤所提供的数据处理方法,具体可参见图3以及图10各个步骤所提供的实现方式,在此不再赘述。Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, the steps in Figure 3 and Figure 10 are implemented. For the data processing method provided, please refer to Figure 3 and the implementation provided in each step of Figure 10 for details, which will not be described again here.
计算机可读存储介质可以是前述任一实施例提供的数据传输装置或者计算机设备的内部存储单元,例如计算机设备的硬盘或内存。该计算机可读存储介质也可以是该计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该计算机设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be the data transmission device provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or memory of the computer device. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc. Further, the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
本申请一方面提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备可执行前文图3或者图10所对应实施例中对数据处理方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。In one aspect, the present application provides a computer program product, which includes a computer program stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device can execute the description of the data processing method in the embodiment corresponding to Figure 3 or Figure 10, where No longer. In addition, the description of the beneficial effects of using the same method will not be described again.
本申请实施例的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包括。例如包括了一系列步骤或单元的过程、方法、装置、产品或设备没有限定于已列出的步骤或模块,而是可以还包括没有列出的步骤或模块,或可以还包括对于这些过程、方法、装置、产品或设备固有的其他步骤单元。The terms “first”, “second”, etc. in the description, claims, and drawings of the embodiments of this application are used to distinguish different objects, rather than describing a specific sequence. Furthermore, the term "includes" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, device, product or equipment that includes a series of steps or units is not limited to the listed steps or modules, but may also include unlisted steps or modules, or may also include steps for these processes, Other units of steps inherent in a method, apparatus, product, or equipment.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of both. In order to clearly illustrate the relationship between hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。 What is disclosed above is only the preferred embodiment of the present application. Of course, it cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application still fall within the scope of the present application.

Claims (15)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, characterized by including:
    从多媒体数据的视频帧中识别图片特征信息,所述图片特征信息包括所述视频帧中的角色图片所属的M个业务对象,M为正整数;Identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, where M is a positive integer;
    从所述多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从所述N个对象音频帧中分别提取对应的音频语义特征向量,并对所述N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象;Locate and separate the audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N The audio semantic feature vectors corresponding to the object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;
    基于所述图片特征信息、所述M个音频聚类簇以及与所述多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中,P为小于或者等于M的正整数,所述对象角色映射表包括与列表业务对象具有映射关系的业务角色,所述列表业务对象与所述M个业务对象之间存在P个重合的业务对象。Based on the picture feature information, the M audio clusters and the object role mapping table associated with the multimedia data, identify the business role corresponding to each of the P audio clusters, Wherein, P is a positive integer less than or equal to M, the object role mapping table includes business roles that have a mapping relationship with the list business object, and there are P overlapping businesses between the list business object and the M business objects. object.
  2. 根据权利要求1所述的方法,其特征在于,所述从多媒体数据的视频帧中识别图片特征信息,包括:The method according to claim 1, characterized in that identifying picture feature information from video frames of multimedia data includes:
    从多媒体数据中获取视频帧;Obtain video frames from multimedia data;
    对所述视频帧中包含角色关键部位的图片进行切割处理,得到所述视频帧对应的角色图片,其中所述角色图片包括X个角色切割图片,X为大于或者等于M的正整数;Carry out cutting processing on the pictures containing the key parts of the character in the video frame to obtain the character picture corresponding to the video frame, wherein the character picture includes X character cut pictures, and X is a positive integer greater than or equal to M;
    获取所述X个角色切割图片中的角色切割图片Ti,对所述角色切割图片Ti进行编码处理,得到所述角色切割图片Ti对应的图片信息向量Li,i为小于或者等于X的正整数;Obtain the character cutting picture Ti among the X character cutting pictures, encode the character cutting picture Ti , and obtain the picture information vector Li corresponding to the character cutting picture Ti , where i is less than or equal to X a positive integer;
    从与候选业务对象相关联的信息向量数据库中,确定与所述图片信息向量Li相匹配的对象关键信息向量,将匹配到的对象关键信息向量对应的候选业务对象作为所述角色切割图片Ti对应的业务对象;From the information vector database associated with the candidate business object, determine the object key information vector matching the picture information vector Li , and use the candidate business object corresponding to the matched object key information vector as the role cutting picture T The business object corresponding to i ;
    基于所述X个角色切割图片分别对应的业务对象,确定所述视频帧对应的图片特征信息。Based on the business objects corresponding to the X character cut pictures, the picture feature information corresponding to the video frame is determined.
  3. 根据权利要求2所述的方法,其特征在于,所述对所述视频帧中包含角色关键部位的图片进行切割处理,得到所述视频帧对应的角色图片,包括:The method according to claim 2, characterized in that cutting the pictures containing key parts of the characters in the video frames to obtain the character pictures corresponding to the video frames includes:
    对所述视频帧中的角色关键部位进行检测定位,以确定所述角色关键部位在所述视频帧中的位置信息;Detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame;
    基于所述位置信息,在所述视频帧中切割所述角色关键部位,得到包含所述角色关键部位的X个角色切割图片,将X个角色切割图片作为所述视频帧对应的角色图片。Based on the position information, the key parts of the character are cut in the video frame to obtain X character cut pictures including the key parts of the character, and the X character cut pictures are used as the character pictures corresponding to the video frames.
  4. 根据权利要求2所述的方法,其特征在于,所述从与候选业务对象相关联的信息向量数据库中,确定与所述图片信息向量Li相匹配的对象关键信息向量,将匹配到的对象关键信息向量对应的候选业务对象作为所述角色切割图片Ti对应的业务对象,包括:The method according to claim 2, characterized in that: from the information vector database associated with the candidate business object, the object key information vector matching the picture information vector Li is determined, and the matched object is The candidate business object corresponding to the key information vector is used as the business object corresponding to the character cutting picture Ti , including:
    获取与候选业务对象相关联的信息向量数据库,所述信息向量数据库用于存储Y个候选对象分别对应的对象关键信息向量,Y为大于或者等于M的正整数;Obtain an information vector database associated with the candidate business object. The information vector database is used to store object key information vectors corresponding to Y candidate objects, where Y is a positive integer greater than or equal to M;
    分别确定所述图片信息向量Li与Y个对象关键信息向量中的每个对象关键信息向量之间的向量距离,得到Y个向量距离;Determine the vector distance between the picture information vector Li and each of the Y object key information vectors, respectively, to obtain Y vector distances;
    从所述Y个向量距离中获取小于或者等于距离阈值的最小向量距离,确定所述最小向量距离对应的 对象关键信息向量所对应的候选业务对象,将确定的候选业务对象作为所述角色切割图片Ti对应的业务对象。Obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, and determine the distance corresponding to the minimum vector distance. The candidate business object corresponding to the object key information vector is used as the business object corresponding to the role cutting picture Ti .
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述从所述N个对象音频帧中分别提取对应的音频语义特征向量,并对所述N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,包括:The method according to any one of claims 1 to 4, characterized in that the corresponding audio semantic feature vectors are respectively extracted from the N object audio frames, and the audio corresponding to the N object audio frames are The semantic feature vectors are clustered to obtain M audio clusters, including:
    对所述N个对象音频帧中的每个对象音频帧进行语义特征提取,得到所述每个对象音频帧对应的音频语义特征向量;Perform semantic feature extraction on each of the N object audio frames to obtain the audio semantic feature vector corresponding to each of the object audio frames;
    将M确定为待聚类的簇心数量,基于所述簇心数量,对获取到的每个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇。M is determined as the number of cluster centers to be clustered. Based on the number of cluster centers, the audio semantic feature vector corresponding to each obtained object audio frame is clustered to obtain M audio clusters.
  6. 根据权利要求5所述的方法,其特征在于,所述从所述多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,包括:The method according to claim 5, characterized in that locating and separating audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames includes:
    从所述多媒体数据中获取原始音频帧;Obtain original audio frames from the multimedia data;
    对所述原始音频帧进行信源分离,得到包含有人声的待处理音频帧;Perform source separation on the original audio frame to obtain an audio frame to be processed containing human voice;
    基于用于剔除静音帧的音频边界检测策略,对所述待处理音频帧中的音频冲击信号帧中的非静音段进行定位和切割,得到N个对象音频帧。Based on the audio boundary detection strategy for eliminating silent frames, the non-silent segments in the audio impact signal frames in the audio frames to be processed are located and cut to obtain N object audio frames.
  7. 根据权利要求6所述的方法,其特征在于,所述对所述原始音频帧进行信源分离,得到包含有人声的待处理音频帧,包括:The method according to claim 6, characterized in that said performing source separation on the original audio frame to obtain an audio frame to be processed containing human voice includes:
    将所述原始音频帧输入至信源分离模型,通过所述信源分离模型生成所述原始音频帧对应的频谱幅度谱,所述信源分离模型包括第一分割网络层和第二分割网络层;The original audio frame is input to the source separation model, and the spectrum amplitude spectrum corresponding to the original audio frame is generated through the source separation model. The source separation model includes a first segmentation network layer and a second segmentation network layer. ;
    将所述频谱幅度谱分别输入所述第一分割网络层以及所述第二分割网络层,通过所述第一分割网络层生成所述频谱幅度谱对应的第一类型特征,通过所述第二分割网络层生成所述频谱幅度谱对应的第二类型特征;The spectrum amplitude spectrum is input into the first segmentation network layer and the second segmentation network layer respectively, and the first type feature corresponding to the spectrum amplitude spectrum is generated through the first segmentation network layer, and the first type feature corresponding to the spectrum amplitude spectrum is generated through the second segmentation network layer. Split the network layer to generate a second type of feature corresponding to the spectrum amplitude spectrum;
    对所述第一类型特征和所述第二类型特征进行合并和掩码处理,得到所述第一类型特征对应的目标掩码图;Merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features;
    基于所述目标掩码图与所述频谱幅度谱的对应位置,通过频谱反变换生成目标类型音频帧,将所述目标类型音频帧作为所述信源分离模型所输出的包含有人声的待处理音频帧。Based on the corresponding position of the target mask map and the spectrum amplitude spectrum, a target type audio frame is generated through spectrum inverse transformation, and the target type audio frame is used as the to-be-processed audio frame containing human voice output by the source separation model. audio frame.
  8. 根据权利要求5所述的方法,其特征在于,所述对所述N个对象音频帧中的每个对象音频帧进行语义特征提取,得到所述每个对象音频帧对应的音频语义特征向量,包括:The method according to claim 5, wherein the semantic feature extraction is performed on each object audio frame in the N object audio frames to obtain the audio semantic feature vector corresponding to each object audio frame, include:
    将所述N个对象音频帧输入至音频语义特征提取模型,其中所述音频语义特征提取模型包括频域分支网络层、时域分支网络层以及卷积网络层;Input the N object audio frames to an audio semantic feature extraction model, wherein the audio semantic feature extraction model includes a frequency domain branch network layer, a time domain branch network layer, and a convolution network layer;
    通过所述频域分支网络层,对所述N个对象音频帧进行特征学习,得到学习的频域特征图;Through the frequency domain branch network layer, feature learning is performed on the N object audio frames to obtain a learned frequency domain feature map;
    通过所述时域分支网络层,对所述N个对象音频帧进行特征学习,得到学习的时域特征图,其中所述学习的频域特征图与所述学习的时域特征图之间的特征维度相同;Through the time domain branch network layer, feature learning is performed on the N object audio frames to obtain a learned time domain feature map, where the difference between the learned frequency domain feature map and the learned time domain feature map is The feature dimensions are the same;
    将所述学习的频域特征图与所述学习的时域特征图进行叠加,得到叠加特征,将所述叠加特征输入至所述卷积网络层,对所述叠加特征进行最大平均处理,输出所述每个对象音频帧对应的音频语义特征 向量。Superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, perform maximum averaging processing on the superimposed features, and output The audio semantic features corresponding to each object audio frame vector.
  9. 根据权利要求1至4任一项所述的方法,其特征在于,所述基于所述图片特征信息、所述M个音频聚类簇以及与所述多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,包括:The method according to any one of claims 1 to 4, characterized in that, based on the picture feature information, the M audio clusters and the object role mapping table associated with the multimedia data, the identification The business roles corresponding to each of the P audio clusters include:
    从所述M个音频聚类簇中获取音频聚类簇Ck,提取所述音频聚类簇Ck中包括的音频语义特征向量所对应的对象音频帧在所述多媒体数据中的一个或多个播放时间,作为所述音频聚类簇Ck的第一播放时间,k为小于或者等于M的正整数;Obtain audio cluster C k from the M audio clusters, and extract one or more object audio frames in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C k Playing time, as the first playing time of the audio cluster C k , k is a positive integer less than or equal to M;
    从所述对象角色映射表的列表业务对象中,获取与M个业务对象之间存在重合的P个业务对象,基于所述图片特征信息,提取所述P个业务对象中的每个业务对象所在的视频帧在所述多媒体数据中的一个或多个播放时间,作为每个业务对象的第二播放时间;From the list of business objects in the object role mapping table, P business objects that overlap with M business objects are obtained, and based on the picture feature information, the location of each business object in the P business objects is extracted. One or more playback times of the video frames in the multimedia data are used as the second playback time of each business object;
    分别确定所述音频聚类簇Ck的第一播放时间与所述每个第二播放时间之间的时间重叠度,将具有最高时间重叠度的第二播放时间所对应的业务对象作为所述音频聚类簇Ck对应的业务对象;Determine the time overlap degree between the first playback time and each second playback time of the audio cluster C k respectively, and use the business object corresponding to the second playback time with the highest time overlap degree as the The business object corresponding to the audio cluster C k ;
    从所述对象角色映射表中,获取所述音频聚类簇Ck对应的业务对象所对应的业务角色,将获取到的业务角色作为所述音频聚类簇Ck对应的业务角色。From the object role mapping table, obtain the business role corresponding to the business object corresponding to the audio cluster C k , and use the obtained business role as the business role corresponding to the audio cluster C k .
  10. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, characterized in that the method further includes:
    基于所述P个音频聚类簇分别在所述多媒体数据中的第一播放时间,以及所述P个音频聚类簇分别对应的业务对象在所述多媒体数据中的第二播放时间,确定P个业务对象中的每个业务对象在所述多媒体数据中的业务播放时间;Based on the first playback time of each of the P audio clusters in the multimedia data and the second playback time of the business object corresponding to the P audio clusters in the multimedia data, P is determined The service playback time of each service object in the multimedia data;
    基于所述P个业务对象中的每个业务对象对应的业务播放时间,从所述多媒体数据中获取所述P个业务对象分别对应的多媒体片段数据,所述多媒体片段数据包括与对应业务对象相关联的音频帧以及与所述对应业务对象相关联的视频帧。Based on the service playback time corresponding to each of the P business objects, multimedia segment data corresponding to the P business objects are obtained from the multimedia data. The multimedia segment data includes information related to the corresponding business object. associated audio frames and video frames associated with the corresponding business object.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:
    在业务播放显示界面中播放所述多媒体数据,所述业务播放显示界面包括用于触发对象视频数据选择功能的播放选择控件;Play the multimedia data in a service playback display interface, which includes a playback selection control for triggering an object video data selection function;
    响应针对所述播放选择控件的触发操作,显示对象播放列表,其中所述对象播放列表包括Z个业务对象分别对应的对象封面数据,Z为小于或等于P的正整数;In response to the triggering operation of the playback selection control, an object playlist is displayed, where the object playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;
    响应针对Z个对象封面数据中的目标对象封面数据的触发操作,在业务播放界面中播放目标多媒体片段数据,其中所述目标多媒体片段数据为所述目标对象封面数据对应的业务对象所对应的多媒体片段数据,所述目标对象封面数据对应的业务对象属于所述P个业务对象。In response to the trigger operation for the target object cover data among the Z object cover data, the target multimedia segment data is played in the service playback interface, where the target multimedia segment data is the multimedia corresponding to the business object corresponding to the target object cover data. Fragment data, the business object corresponding to the cover data of the target object belongs to the P business objects.
  12. 根据权利要求10所述的方法,其特征在于,所述多媒体数据包括第一多媒体数据和第二多媒体数据,所述第一多媒体数据与所述第二多媒体数据均包括待剪辑对象,所述待剪辑对象属于所述P个业务对象;The method according to claim 10, wherein the multimedia data includes first multimedia data and second multimedia data, and the first multimedia data and the second multimedia data both Including objects to be edited, the objects to be edited belong to the P business objects;
    所述方法还包括:The method also includes:
    基于与所述第一多媒体数据相关联的对象角色映射表,获取所述待剪辑对象对应的第一目标业务角色,从所述第一多媒体数据中获取与所述第一目标业务角色相关联的第一多媒体片段数据,所述第一多 媒体片段数据是基于所述待剪辑对象在所述第一多媒体数据中的业务播放时间所确定的;Based on the object role mapping table associated with the first multimedia data, obtain the first target business role corresponding to the object to be edited, and obtain the first target business role from the first multimedia data. first multimedia segment data associated with the character, said first multimedia The media segment data is determined based on the service playback time of the object to be edited in the first multimedia data;
    基于与所述第二多媒体数据相关联的对象角色映射表,获取所述待剪辑对象对应的第二目标业务角色,从所述第二多媒体数据中获取与所述第二目标业务角色相关联的第二多媒体片段数据,所述第二多媒体片段数据是基于所述待剪辑对象在所述第二多媒体数据中的业务播放时间所确定的;Based on the object role mapping table associated with the second multimedia data, obtain the second target business role corresponding to the object to be edited, and obtain the second target business role from the second multimedia data. Second multimedia segment data associated with the character, the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data;
    对所述第一多媒体片段数据和所述第二多媒体片段数据进行合并剪辑处理,得到所述待剪辑对象对应的合并剪辑数据。The first multimedia segment data and the second multimedia segment data are merged and edited to obtain merged editing data corresponding to the object to be edited.
  13. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it includes:
    图片信息获取模块,用于从多媒体数据的视频帧中识别图片特征信息,所述图片特征信息包括所述视频帧中的角色图片所属的M个业务对象;M为正整数;A picture information acquisition module, used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong; M is a positive integer;
    聚类处理模块,用于从所述多媒体数据的原始音频帧中定位和分离包含有人声的音频帧,得到N个对象音频帧,从所述N个对象音频帧中分别提取对应的音频语义特征向量,并对所述N个对象音频帧对应的音频语义特征向量进行聚类处理,得到M个音频聚类簇,其中,N为正整数,一个音频聚类簇对应一个业务对象;A clustering processing module, used to locate and separate audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, and extract corresponding audio semantic features from the N object audio frames. vector, and cluster the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster corresponds to one business object;
    音频角色识别模块,用于基于所述图片特征信息、所述M个音频聚类簇以及与所述多媒体数据相关联的对象角色映射表,识别P个音频聚类簇中的每个音频聚类簇分别对应的业务角色,其中,P为小于或者等于M的正整数;所述对象角色映射表包括与列表业务对象具有映射关系的业务角色;所述列表业务对象与所述M个业务对象之间存在P个重合的业务对象。An audio role recognition module, configured to identify each audio cluster in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table associated with the multimedia data. The business roles corresponding to the clusters respectively, where P is a positive integer less than or equal to M; the object role mapping table includes business roles that have a mapping relationship with the list business object; the list business object and the M business objects There are P overlapping business objects.
  14. 一种计算机设备,其特征在于,包括:处理器和存储器;A computer device, characterized by including: a processor and a memory;
    所述处理器与存储器相连,其中,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述计算机设备执行权利要求1至12任一项所述的方法。The processor is connected to a memory, wherein the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the method described in any one of claims 1 to 12.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行,以使得具有所述处理器的计算机设备执行权利要求1至12任一项所述的方法。 A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the right The method described in any one of claims 1 to 12.
PCT/CN2023/087208 2022-04-13 2023-04-10 Data processing method and apparatus, and computer device and storage medium WO2023197979A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210383918.3 2022-04-13
CN202210383918.3A CN114465737B (en) 2022-04-13 2022-04-13 Data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023197979A1 true WO2023197979A1 (en) 2023-10-19

Family

ID=81418551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087208 WO2023197979A1 (en) 2022-04-13 2023-04-10 Data processing method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN114465737B (en)
WO (1) WO2023197979A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114465737B (en) * 2022-04-13 2022-06-24 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN114895817B (en) * 2022-05-24 2023-08-04 北京百度网讯科技有限公司 Interactive information processing method, network model training method and device
CN115083435B (en) * 2022-07-28 2022-11-04 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN115033734B (en) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN116597828B (en) * 2023-07-06 2023-10-03 腾讯科技(深圳)有限公司 Model determination method, model application method and related device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
WO2020010338A1 (en) * 2018-07-05 2020-01-09 Dts, Inc. Hybrid audio synthesis using neural networks
WO2020119508A1 (en) * 2018-12-14 2020-06-18 深圳壹账通智能科技有限公司 Video cutting method and apparatus, computer device and storage medium
CN111400543A (en) * 2020-03-20 2020-07-10 腾讯科技(深圳)有限公司 Audio segment matching method, device, equipment and storage medium
WO2021073416A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method for generating virtual character video on the basis of neural network, and related device
CN113573161A (en) * 2021-09-22 2021-10-29 腾讯科技(深圳)有限公司 Multimedia data processing method, device, equipment and storage medium
CN114465737A (en) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0406504D0 (en) * 2004-03-23 2004-04-28 British Telecomm Method and system for detecting audio and video scene changes
CN102521340B (en) * 2011-12-08 2014-09-03 中国科学院自动化研究所 Method for analyzing TV video based on role
US9047376B2 (en) * 2012-05-01 2015-06-02 Hulu, LLC Augmenting video with facial recognition
US10185917B2 (en) * 2013-01-31 2019-01-22 Lf Technology Development Corporation Limited Computer-aided decision systems
CN106683661B (en) * 2015-11-05 2021-02-05 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN106021496A (en) * 2016-05-19 2016-10-12 海信集团有限公司 Video search method and video search device
US11417343B2 (en) * 2017-05-24 2022-08-16 Zoominfo Converse Llc Automatic speaker identification in calls using multiple speaker-identification parameters
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN110166818B (en) * 2018-11-30 2021-08-17 腾讯科技(深圳)有限公司 Method for generating audio/video to be matched, computer equipment and storage medium
CN110691258A (en) * 2019-10-30 2020-01-14 中央电视台 Program material manufacturing method and device, computer storage medium and electronic equipment
CN111462758A (en) * 2020-03-02 2020-07-28 深圳壹账通智能科技有限公司 Method, device and equipment for intelligent conference role classification and storage medium
CN113744742B (en) * 2020-05-29 2024-01-30 中国电信股份有限公司 Role identification method, device and system under dialogue scene
CN112565825B (en) * 2020-12-02 2022-05-13 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and medium
CN113192516A (en) * 2021-04-22 2021-07-30 平安科技(深圳)有限公司 Voice role segmentation method and device, computer equipment and storage medium
CN113157965B (en) * 2021-05-07 2022-05-20 杭州网易云音乐科技有限公司 Audio visual model training and audio visual method, device and equipment
CN113327628B (en) * 2021-05-27 2023-12-22 抖音视界有限公司 Audio processing method, device, readable medium and electronic equipment
CN113822142A (en) * 2021-07-28 2021-12-21 腾讯科技(深圳)有限公司 Role recognition method and device, computer equipment and storage medium
CN113808578B (en) * 2021-11-16 2022-04-15 阿里巴巴达摩院(杭州)科技有限公司 Audio signal processing method, device, equipment and storage medium
CN113923521B (en) * 2021-12-14 2022-03-08 深圳市大头兄弟科技有限公司 Video scripting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
WO2020010338A1 (en) * 2018-07-05 2020-01-09 Dts, Inc. Hybrid audio synthesis using neural networks
WO2020119508A1 (en) * 2018-12-14 2020-06-18 深圳壹账通智能科技有限公司 Video cutting method and apparatus, computer device and storage medium
WO2021073416A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method for generating virtual character video on the basis of neural network, and related device
CN111400543A (en) * 2020-03-20 2020-07-10 腾讯科技(深圳)有限公司 Audio segment matching method, device, equipment and storage medium
CN113573161A (en) * 2021-09-22 2021-10-29 腾讯科技(深圳)有限公司 Multimedia data processing method, device, equipment and storage medium
CN114465737A (en) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114465737A (en) 2022-05-10
CN114465737B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN109117777B (en) Method and device for generating information
KR102148392B1 (en) Video metadata tagging system and method thereof
CN113709561B (en) Video editing method, device, equipment and storage medium
CN108307229B (en) Video and audio data processing method and device
JP2022020647A (en) Video processing method, apparatus, electronic device, storage medium, and program
CN111681678B (en) Method, system, device and storage medium for automatically generating sound effects and matching videos
CN113395578A (en) Method, device and equipment for extracting video theme text and storage medium
CN112738557A (en) Video processing method and device
WO2023197749A9 (en) Background music insertion time point determining method and apparatus, device, and storage medium
CN110750996A (en) Multimedia information generation method and device and readable storage medium
WO2023071578A1 (en) Text-voice alignment method and apparatus, device and medium
Soares et al. An optimization model for temporal video lecture segmentation using word2vec and acoustic features
CN113923521B (en) Video scripting method
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN113936236A (en) Video entity relationship and interaction identification method based on multi-modal characteristics
CN113301382A (en) Video processing method, device, medium, and program product
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
CN116958342A (en) Method for generating actions of virtual image, method and device for constructing action library
CN116708055A (en) Intelligent multimedia audiovisual image processing method, system and storage medium
CN113642536B (en) Data processing method, computer device and readable storage medium
CN111681680B (en) Method, system, device and readable storage medium for acquiring audio frequency by video recognition object
CN111681676B (en) Method, system, device and readable storage medium for constructing audio frequency by video object identification
CN113762056A (en) Singing video recognition method, device, equipment and storage medium
CN114495946A (en) Voiceprint clustering method, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23787623

Country of ref document: EP

Kind code of ref document: A1