WO2023197979A1 - Procédé et appareil de traitement de données, et dispositif informatique et support des stockage - Google Patents

Procédé et appareil de traitement de données, et dispositif informatique et support des stockage Download PDF

Info

Publication number
WO2023197979A1
WO2023197979A1 PCT/CN2023/087208 CN2023087208W WO2023197979A1 WO 2023197979 A1 WO2023197979 A1 WO 2023197979A1 CN 2023087208 W CN2023087208 W CN 2023087208W WO 2023197979 A1 WO2023197979 A1 WO 2023197979A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
business
multimedia data
picture
frames
Prior art date
Application number
PCT/CN2023/087208
Other languages
English (en)
Chinese (zh)
Inventor
冯鑫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023197979A1 publication Critical patent/WO2023197979A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3263Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving certificates, e.g. public key certificate [PKC] or attribute certificate [AC]; Public key infrastructure [PKI] arrangements
    • H04L9/3265Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving certificates, e.g. public key certificate [PKC] or attribute certificate [AC]; Public key infrastructure [PKI] arrangements using certificate chains, trees or paths; Hierarchical trust model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/66Arrangements for connecting between networks having differing types of switching systems, e.g. gateways
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0823Network architectures or network communication protocols for network security for authentication of entities using certificates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions

Definitions

  • the present application relates to the field of computer technology, and in particular, to a data processing method, device, computer equipment and storage medium.
  • Embodiments of the present application provide a data processing method, device, computer equipment and storage medium, which can improve the accuracy, efficiency and applicability of audio character recognition.
  • embodiments of the present application provide a data processing method, including:
  • the picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
  • M audio clusters and the object role mapping table associated with the multimedia data identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
  • a data processing device including:
  • the picture information acquisition module is used to identify picture feature information from the video frame of the multimedia data.
  • the picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer;
  • the clustering processing module is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and perform The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object;
  • the audio role recognition module is used to identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • P is a positive integer less than or equal to M
  • the object role mapping table includes business roles that have a mapping relationship with the list business object; there are P overlapping business objects between the list business object and the M business objects.
  • embodiments of the present application provide a computer device, including: a processor and a memory;
  • the processor is connected to a memory, where the memory is used to store a computer program.
  • the computer program is executed by the processor, the computer device executes the method provided by the embodiment of the present application.
  • inventions of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application. Examples provide methods.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a computer program.
  • the computer program is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium.
  • the processor executes the computer program, so that the computer device executes the method in the embodiment of the present application.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic flow diagram of a system for audio character recognition provided by an embodiment of the present application
  • Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application
  • Figure 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application.
  • Figure 6 is a schematic architectural diagram of an audio semantic feature clustering provided by an embodiment of the present application.
  • Figure 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a scene for audio character recognition provided by an embodiment of the present application.
  • Figure 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of a scene for displaying multimedia segment data provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Figure 13 is another structural schematic diagram of a data processing device provided by an embodiment of the present application.
  • Figure 14 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the embodiment of the present application provides a character recognition method based on audio semantic feature clustering, which can be applied to the field of artificial intelligence.
  • Artificial Intelligence is the theory, method, technology and technology that uses digital computers or digital computer-controlled calculations to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system.
  • artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, autonomous driving, smart transportation and other major directions.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to machine vision such as using cameras and computers to replace human eyes to identify and measure targets, and Further graphics processing is performed to make the computer processing into an image more suitable for human eye observation or to be transmitted to the instrument for detection.
  • computer vision studies related theories and technologies trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous positioning and mapping Construction, autonomous driving, smart transportation and other technologies also include common biometric identification technologies such as facial recognition and fingerprint recognition.
  • the key technologies of speech technology include automatic speech recognition technology, speech synthesis technology and voiceprint recognition technology. Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • Natural Language Processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning often include artificial neural networks, Belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • Figure 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a server 1OF and a terminal device cluster.
  • the terminal device cluster may include one or more terminal devices, and there will be no limit on the number of terminal devices here.
  • the terminal device cluster may specifically include terminal devices 100a, terminal devices 100b, terminal devices 100c,..., terminal devices 100n.
  • the terminal device 100a, the terminal device 100b, the terminal device 100c, ..., the terminal device 100n can each have a network connection with the above-mentioned server 10F, so that each terminal device can perform data interaction with the server 10F through the network connection.
  • the network connection here is not limited to a connection method. It can be connected directly or indirectly through wired communication, or directly or indirectly through wireless communication. It can also be connected through other methods. This application does not limit it here.
  • Each terminal device in the terminal device cluster may include: smart phones, tablets, laptops, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart TVs and other smart terminals with audio role recognition functions.
  • Each terminal device in the terminal device cluster as shown in Figure 1 can be installed with a target application (for example, a client). When the client is running in each terminal device, it can perform data interaction with the server 10F shown in FIG. 1 .
  • the client may include a social client, a multimedia client (for example, a video client), an entertainment client (for example, a game client), an information flow client, an education client, a live broadcast client, and other clients.
  • the client can be an independent client or an embedded sub-client integrated in a certain client (for example, a social client, an education client, a multimedia client, etc.), which is not limited here.
  • the server 10F in the embodiment of the present application can be the server corresponding to the client.
  • the server 10F can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
  • the embodiment of this application will not limit the number of servers.
  • one terminal device may be selected as the target terminal device among the multiple terminal devices shown in FIG. 1 .
  • the terminal device 100a shown in FIG. 1 can be used as a target terminal device, and the target terminal device can be integrated with a target application (for example, a client).
  • the target terminal device can realize data interaction with the server 10F through the business data platform corresponding to the client.
  • the client here may have a frame sequence (for example, frame animation sequence) loading and playback function, which is used to play video frames, audio frames and text (for example, lines) in the service playback display interface provided by the client.
  • Multimedia data here refers to the interface displayed by the terminal device for playing multimedia data.
  • the data type of the multimedia data may include film and television drama types, animation types, variety show types, etc. The data type of multimedia data will not be limited here.
  • a computer device with an audio character recognition function obtains multimedia data (for example, TV series A), it can identify picture feature information from video frames of the multimedia data.
  • the picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer.
  • the picture feature information may indicate which actor plays the character in a certain character picture including key parts of the character (for example, the character's face) in the TV series A.
  • the computer device can also extract corresponding audio semantic feature vectors from the N object audio frames, and then perform clustering processing on the audio semantic feature vectors corresponding to the N object audio frames to obtain M audio clusters. Class cluster.
  • N is a positive integer
  • the N object audio frames here are obtained by the computer device locating and separating the audio frames containing human voices from the original audio frames in the multimedia data.
  • the computer device performs object positioning and separation processing on the original audio frame in order to reduce the interference caused by the silent frames in the environmental audio track and the object audio track (for example, the vocal track) in subsequent clustering processing, so as to improve the clustering accuracy, thereby improving the accuracy of character voice recognition.
  • the computer device can be based on the picture feature information, M audio clusters, and object roles associated with the multimedia data.
  • the mapping table identifies the business role corresponding to each of the P audio clusters.
  • P can be a positive integer less than or equal to M.
  • the object role mapping table here (for example, the cast list of TV series A) may include business roles (roles) that have a mapping relationship with the list business objects (actors). There are P overlapping business objects between the list business objects in the object role mapping table and the M business objects recognized by the computer.
  • the object mapping table may be an initial object role mapping table provided by the business editor of the multimedia data acquired by the computer device (for example, the editing user of TV series A), or it may be an initial object extracted by the target user of the access client based on the business editing. What is updated by the role mapping table will not be limited here.
  • the target user can add a mapping relationship between a certain business role in TV series A (for example, waiter in a restaurant) and a certain business object (for example, actor 1) in the initial object role mapping table, that is, the waiter in the restaurant is played by actor 1.
  • the computer device in the embodiment of the present application can combine the sound and character by combining the picture feature information (for example, face information) automatically recognized from the video frame and the M audio clusters of adaptive clustering. Association identification, so that the business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. It can not only reduce the time consumed by manpower, but also solve the problem of similar timbre recognition errors, thereby improving the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • Figure 2 is a schematic flow chart of a system for audio character recognition provided by an embodiment of the present application.
  • the computer device in the embodiment of the present application may be a computer device with audio character recognition function.
  • the computer device may be any terminal device in the terminal device cluster shown in FIG. 1 , for example, the terminal device 100a, or may be the server 10F shown in FIG. 1 .
  • Computer equipment will not be limited here.
  • the audio character recognition system may include three modules. Specifically, it may include a first module (for example, a key image recognition module), a second module 202 (for example, an audio semantic feature clustering model). ) and the third module 203 (for example, character recognition module).
  • the multimedia data 20S in the embodiment of the present application may be multimedia data acquired by the computer device that requires audio character recognition.
  • the multimedia data 20S can be multimedia data corresponding to a certain episode in a certain TV series, multimedia data corresponding to a certain movie, or multimedia data corresponding to a certain variety show, which will not be discussed one by one here.
  • the multimedia data 20S is composed of video data including original video frames and audio data including original audio frames.
  • the computer device can obtain video frames from video data including raw video frames.
  • the video frame here may refer to a video frame sequence obtained by deleting the beginning and end of the original video frame in the video data.
  • the computer device can identify picture feature information from the video frames of the multimedia data 20S through the first module 201 shown in FIG. 2 .
  • the first module 201 may include a key part detection model 210w and a picture encoding model 220w.
  • the key part detection model 210w can be used to detect character pictures in video frames.
  • the character picture here refers to a picture including the key parts of the character (for example, the character's face).
  • the picture encoding model 220w can be used to encode each character cut picture in the character picture to obtain picture vector information corresponding to the character cut picture.
  • the computer device may also obtain the information vector database 200K shown in FIG. 2 from its internal memory or externally, for example.
  • the information vector database 200K can be an information index database established by the computer device in advance based on a large amount of material data (for example, multimedia data belonging to film and television drama types, variety show types, etc.) through the same key image recognition method, and is specially used for Information base for key image recognition.
  • the information vector database 200K can be used to store object key information vectors respectively corresponding to Y candidate business objects.
  • the object here is related to
  • the key information vector may also be determined through the picture encoding model 220w, and Y is a positive integer greater than or equal to M.
  • the information vector database 200K may also include object information of each candidate business object, for example, the object attribute type of the candidate business object (including singing and dancing singers, modern idol dramas, ancient palace dramas, fairy tale dramas, war-themed dramas, etc. ).
  • the computer device can obtain the picture feature information shown in Figure 2 based on the information vector database 200K and the picture information vector output by the picture coding model 220w.
  • the computer device can also obtain audio clustering results associated with the N object audio frames in the multimedia data 20S through the second module 202 shown in FIG. 2 .
  • the N object audio frames here are obtained by subjecting the original audio frames in the multimedia data to object positioning and separation processing, and N is a positive integer.
  • the second module 202 here may include a source separation model 230w and an audio semantic feature extraction model 240w.
  • the source separation model 230w here can be used to perform source separation on the original audio frame to obtain the object sound segment (or object audio track) (for example, the vocal segment (or vocal track)) and the environmental sound segment (or ambient sound track) (e.g., background sound segment (or backing track)).
  • the audio semantic feature extraction model 240w here can be used to perform frame-level semantic feature extraction on each object audio frame when N object audio frames in the object segment are obtained, so as to obtain the corresponding information of each object audio frame. Audio semantic feature vector. Further, the computer device can perform clustering processing on N audio semantic feature vectors to obtain M audio clusters, and then these M audio clusters can be used as the audio clustering results obtained by the second module 202 . Among them, an audio cluster can correspond to a business object.
  • the computer device can identify each audio in the P audio clusters based on the picture feature information, the M audio clusters, and the object role mapping table 200B associated with the multimedia data 20S shown in FIG. 2 Cluster clusters correspond to business roles respectively.
  • P is a positive integer less than or equal to M.
  • the object role mapping table 200B here may include business roles that have a mapping relationship with the list business objects. There are P overlapping business objects between the list business object and the M business objects.
  • the computer device can perform audio character recognition on the output information of the first two modules through the third module 203.
  • the computer device determines the position of the video frame where the P overlapping business objects are located in the multimedia data 20S.
  • the playback time ie, the second playback time
  • the playback time ie, the first playback time
  • the computer device can determine the audio clusters corresponding to the P business objects by comparing the two playback times, and further determine the business roles corresponding to each of the P audio clusters. .
  • the computer device in the embodiment of the present application can combine the picture feature information (for example, face information) output by the first module 201 and the audio clustering result output by the second module 202, in the third module 203.
  • the audio and business roles are associated and identified, so that the business roles respectively corresponding to the P audio clusters associated with the object role mapping table 200B can be accurately identified.
  • This audio character recognition method not only improves the accuracy and efficiency of recognition, but also improves the applicability of recognition.
  • the computer device with the audio character recognition function recognizes the object character by combining the picture feature information (for example, face information) automatically recognized from the video frame of the multimedia data and the M audio clusters of adaptive clustering.
  • picture feature information for example, face information
  • M audio clusters of adaptive clustering For specific implementation methods of the service roles corresponding to the P audio clusters associated with the mapping table, please refer to the embodiments corresponding to Figures 3 to 11 below.
  • Figure 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the method can be performed by a computer device with audio character recognition capabilities.
  • the computer device may be a terminal device (for example, any terminal device in the terminal device cluster shown in Figure 1 above, for example, the terminal device 100a), or it may be a server (for example, the server 10F shown in Figure 1 above), No limitation is made here.
  • the embodiment of the present application uses this method to use the service with the audio character recognition function to Taking server execution as an example for illustration, the method may at least include the following steps S101 to S103:
  • Step S101 Identify picture feature information from video frames of multimedia data.
  • the picture feature information here may include M business objects to which the character pictures in the video frames belong, where M is a positive integer.
  • the computer device can obtain video frames from multimedia data, and can then perform picture cutting processing on the key parts of the character in the video frame (cutting the pictures containing the key parts of the character in the video frame) to obtain the video The character picture corresponding to the frame.
  • the character pictures here may include X character cut pictures, where X is a positive integer greater than or equal to M.
  • the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti .
  • i is a positive integer less than or equal to X.
  • the computer device can determine the object key information vector matching the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as The business object corresponding to the role cutting picture T i . Further, the computer device can determine the picture feature information corresponding to the video frame based on the business objects corresponding to the X character cut pictures.
  • the picture recognition system when the computer device detects and recognizes the key parts of the character in the video frame can be composed of a detection sub-module and a recognition sub-module, or it can be an integrated system that detects and recognizes the key parts of the character. Detection and recognition network will not be limited here.
  • the computer device when determining the character picture corresponding to the video frame, can detect and locate the key parts of the character in the video frame, thereby determining the position information of the key parts of the character in the video frame. Further, the computer device can cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame. Then, the computer device can obtain the character cutting picture Ti among the X character cutting pictures, and encode the character cutting picture Ti to obtain the picture information vector Li corresponding to the character cutting picture Ti . Among them, i here is a positive integer less than or equal to X.
  • the computer device can obtain the information vector database associated with the candidate business object from its internal memory or externally to find the candidate business object that has a matching relationship with the picture information vector Li .
  • the information vector database here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • the computer device When the computer device obtains the information vector database, it can directly search for candidate business objects that have a matching relationship with the picture information vector Li from the information vector database. Wherein, the computer device can respectively determine the vector distance between the picture information vector Li and each of the Y object key information vectors, and obtain Y vector distances. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture.
  • the business object corresponding to Ti The distance threshold here is a value set in advance by the computer device to ensure that the found candidate business object has a matching relationship with the character cut picture. It can be dynamically adjusted according to the actual situation, and will not be limited here.
  • the computer device can obtain the object role mapping table associated with the multimedia data, and use the object role mapping table and the information vector database to find candidate business objects that have a matching relationship with the picture information vector Li .
  • Table 1 is an object role mapping table associated with multimedia data provided by an embodiment of the present application. As shown in Table 1:
  • the business roles in the object role mapping table shown in Table 1 may include H, where H is a positive integer greater than or equal to M.
  • H is a positive integer greater than or equal to M.
  • both role 1 and role 2 may have a mapping relationship with the same business object (for example, object a). That is, both role 1 and role 2 are played by object a.
  • Role 3 has a mapping relationship with object b
  • role 4 has a mapping relationship with object c
  • role 5 has a mapping relationship with object d.
  • the computer device can select the object key information vector corresponding to the list business object in the object role mapping table from the information vector database according to the above Table 1, for example, the object key information vector of object a, the object key information vector of object b, and Object key information vector of object c. Further, the computer device can respectively determine the vector distance between the picture information vector Li and each of the selected three object key information vectors. Furthermore, the computer device can obtain the minimum vector distance that is less than or equal to the distance threshold from the three vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as the role cutting picture. The business object corresponding to Ti .
  • the computer device does not need to determine the vector distance between the key information vector of each object in the information vector database, but selects through the object role mapping table, which greatly reduces the matching time. , thereby improving the matching efficiency of finding candidate business objects with matching relationships from the information vector database.
  • FIG. 4 is a schematic diagram of an architecture for obtaining image feature information from video frames according to an embodiment of the present application.
  • the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the first module 201 in the embodiment corresponding to FIG. 2 .
  • the video frame 4V shown in Figure 4 may be a video frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to Figure 2 described above).
  • the key part detection model 410w shown in Figure 4 can be used to detect key parts in the video frame 4V.
  • the key part detection model 410w may be the key part detection model 210w in the embodiment corresponding to FIG. 2 mentioned above.
  • the picture coding model 420w may be the picture coding model 420w in the embodiment corresponding to FIG. 2 described above.
  • the information vector database 400K shown in Figure 4 may be the information vector database 200K in the embodiment corresponding to Figure 2 described above.
  • the video frame 4V can be input to the key part detection model 410w shown in Figure 4, and through the key part detection model 410w , detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information). Further, the computer device can cut the key parts of the character in the video frame 4V based on the position information marked in the area 40Q, and obtain the character cutting picture including the key parts of the character as shown in Figure 4 (for example, as shown in Figure 4 Character cutting picture 400T).
  • the key part detection model 410w shown in Figure 4
  • the key part detection model 410w detect and locate the key parts of the character in the video frame 4V (for example, the facial features of the character) to determine the position information of the key parts of the character in the video frame 4V (for example, the areas marked in the area 40Q shown in Figure 4 facial features position information).
  • the key part detection model 410w shown in Figure 4 may be a network structure used to detect and locate key parts of a character (for example, a character's face), for example, a face detection model (Multi-task Cascaded Convolutional Networks, MTCNN for short) network).
  • a character for example, a character's face
  • a face detection model Multi-task Cascaded Convolutional Networks, MTCNN for short
  • FIG. 5 is a model architecture diagram of a key part detection model provided by an embodiment of the present application.
  • the key part detection model in the embodiment of the present application may be the key part detection model 410w in the embodiment corresponding to Figure 4.
  • This key part detection model can be used to detect key parts in the video frame 5V shown in Figure 5, where the video frame 5V It may be the video frame 4V in the embodiment corresponding to FIG. 4 mentioned above.
  • the key part detection model may include three network layers, which may specifically include a filtering network layer 5W 1 (for example, Proposal Network, P-Net for short), a fine-tuning network layer 5W 2 (for example, Refinement network, referred to as R-Net) and the output network layer 5W 3 (for example, Output network, referred to as O-Net).
  • a filtering network layer 5W 1 for example, Proposal Network, P-Net for short
  • a fine-tuning network layer 5W 2 for example, Refinement network, referred to as R-Net
  • the output network layer 5W 3 for example, Output network, referred to as O-Net
  • the computer device in the embodiment of the present application can adjust the image size of the video frame 5V, so that the image pyramid corresponding to the video frame 5V can be obtained.
  • the computer device can obtain the resizing coefficient (for example, 0.7) from its internal memory or externally, and adjust the video frame 5V multiple times based on the resizing coefficient until the picture size of the adjusted video frame 5V is consistent with the filtering network layer 5W 1 matches the image size threshold associated with it (for example, 12*12*3).
  • the computer device can form a picture pyramid corresponding to the video frame 5V based on the video frames 5V with different picture sizes after multiple adjustments.
  • the size adjustment coefficient here may be dynamically set by the computer device according to the distribution of the key parts of the character in the video frame. If the size adjustment coefficient is set too large, it is easy to extend the time for detecting and locating the key parts of the character. If the size adjustment coefficient is set too small, the key parts of the character with a small distribution area in the video frame may be missed (for example, small and medium-sized faces). Based on this, the size adjustment coefficient in the embodiment of the present application can be set between 0.7-0.8.
  • the picture pyramid here may include the original picture (for example, the video frame 5V shown in Figure 5), the first adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V), the second adjusted picture (that is, the picture obtained by adjusting the picture size of the video frame 5V) The picture obtained by adjusting the picture size of the first adjusted picture), ..., and the Nth adjusted picture (that is, the picture obtained by adjusting the picture size of the N-1th adjusted picture).
  • the image size of the Nth adjusted image here may be the image size threshold associated with the filtering network layer 5W 1 (for example, 12*12).
  • the computer device can input the picture pyramid corresponding to the video frame 5V to the screening network layer 5W 1 shown in Figure 5, so that a large number of candidates can be obtained.
  • the picture obtained by cutting the video frame 5V by filtering the bounding box position information obtained by the network layer 5W 1 is called the first cut picture.
  • the computer device can input the pictures in the picture pyramid to the filtering network layer 5W 1 to obtain the output features (m, n, 16).
  • m and n here can be used to characterize the length and width of the image, and 16 is the dimension of the channel.
  • the computer device can screen out a large portion of candidates, thereby obtaining one or more first candidates.
  • the computer device then calibrates the bounding box (bbox for short) based on the obtained four offsets, and obtains the position information of the calibrated bounding box (for example, the coordinate information of the upper left and lower right).
  • the computer device can screen these first candidates again according to the Intersection over Union (iou), that is, by performing Non-Maximum Suppression (Non-Maximum Suppression, NMS algorithm) to screen out the first candidates.
  • NMS Non-Maximum Suppression
  • the computer device can sort the classification scores (for example, in descending order) to obtain a tensor of (num_left, 4), that is, the upper left and lower right absolute coordinates of num_left bboxes. Further, the computer device can determine the iou with the bounding box coordinates and remaining coordinates of the maximum score value after sorting each time, and can further filter out the iou that is greater than the intersection-to-union ratio threshold (for example, 0.6, the intersection-to-union ratio threshold is the computer (preset by the device) and move this maximum score value to the final result. In this embodiment of the present application, the above operation may be called a filtering operation.
  • the computer device repeats this filtering operation to filter out many bounding boxes with a large number of overlapping parts, and finally obtains (num_left_after_nms, 16) candidates. These candidates need to cut the video frame 5V according to the position information of the bounding box, so that the picture size is 24*24, and the picture used to be input to the fine-tuning network layer 5W 2 shown in Figure 5 (i.e. the first cut picture).
  • the first cut picture here may be a square with the maximum side length of the bounding box captured by the computer device in the video frame 5V, thereby effectively ensuring that no deformation occurs during size adjustment and that more details of key parts of the character are retained.
  • the computer device can fine-tune the first cut picture through the fine-tuning network layer 5W 2 to obtain the result shown in Figure 5 of the second cut picture.
  • the fine-tuned network layer 5W 2 can output 2 outputs corresponding to the two-class one-hot, 4 outputs corresponding to the coordinate offset of the bounding box, and 10 outputs corresponding to the turning point (landmark).
  • the fine-tuned network layer 5W 2 can filter out most candidates that do not include key parts of the character (for example, the character's face) according to the binary classification score. After adjusting the bounding box according to the offset, repeat the filtering operation in the above filtering network layer 5W 1 again to obtain (num_left_after_Rnet, 16) candidates.
  • the computer device can accurately output the position information of the character's key parts in the video frame 5V through the output network layer 5W 3 , including the coordinate information of the bounding box and the coordinate information of the turning point.
  • the computer device in the output network layer 5W 3 , after classification screening and bounding box adjustment NMS screening, not only outputs the coordinate information of the bounding box, but also outputs the coordinate information of the turning point, thereby obtaining the key parts of the character in the video frame
  • the position information in 5V is used to subsequently cut the key parts of the character in the video frame 5V, thereby obtaining a picture including the key parts of the character (for example, the character cut picture 400T shown in Figure 4).
  • the computer device can input the character cutting picture 400T to the picture coding model 420w shown in FIG. 4.
  • the picture coding model 420w is a model based on Residual Network (Resnet).
  • Residual Network This series of networks can be widely used in fields such as target classification and as the backbone of classic neural networks for computer vision tasks.
  • typical networks include Resnet50, Resnet101, etc.
  • the picture coding model 420w in the embodiment of this application may be a Resnet50 network model.
  • the Resnet50 network model can include 5 stages, which can specifically include the first stage (for example, Stage 0), the second stage (for example, Stage 1), the third stage (for example, Stage 2), and the third stage.
  • Four stages e.g., Stage 3) and fifth stage (e.g., Stage 4).
  • the structure of Stage 0 is relatively simple. It can be regarded as the preprocessing of the character cutting image 400T.
  • the last four stages are all composed of bottleneck layers (Bottleneck), and the structures are relatively similar. Among them, Stage 1 can contain 3 Bottlenecks, Stage 2 can contain 4 Bottlenecks, Stage 3 can contain 6 Bottlenecks, and Stage 4 can contain 3 Bottlenecks.
  • the computer device inputs the character cutting picture 400T into the picture encoding model 420w.
  • the character cutting picture 400T can be converted into a picture information vector with 2048 dimensions.
  • the picture information vector can be used Semantic feature information used to represent key parts of the character (for example, face).
  • the computer device may obtain the information vector database 400K associated with the candidate business object shown in FIG. 4 .
  • the information vector database 400K here can be used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • each object key information vector in the information vector database 400K can be extracted by the computer device using the same encoding processing method as the character cutting picture 400T.
  • An object key information vector can be used to represent a candidate business object corresponding to Key part identification (for example, Face ID).
  • the computer device can respectively determine the vector distance between the picture information vector corresponding to the character cutting picture 400T and each of the Y object key information vectors, thereby obtaining Y vector distances.
  • the computer device can set a distance threshold in advance. If the minimum vector distance determined by the computer device is greater than the distance threshold, it can be considered that the computer device has not matched the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it has not matched the character cutting picture 400T. The corresponding business object. If the computer equipment determines the minimum If the vector distance is less than or equal to the distance threshold, it can be considered that the computer device can match the object key information vector corresponding to the character cutting picture 400T in the information vector database 400K, that is, it can successfully match the business object corresponding to the character cutting picture 400T. .
  • the computer device when it obtains the minimum vector distance that is less than or equal to the distance threshold from the Y vector distances, it can determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and then the determined candidate business object can be As the business object corresponding to the character cutting picture 400T.
  • the computer device performs image recognition on each video frame in the multimedia data, it can refer to the specific implementation of key part identification of the video frame 5V shown in Figure 5 to obtain X key pictures of the character containing the key parts of the character. , which will not be described further here. Wherein, if a video frame includes key parts of multiple different characters, the computer device can cut out a corresponding number of key parts of the characters from the video frame.
  • the computer device can refer to the specific implementation of object matching for the character cutting pictures 400T in the corresponding embodiment of FIG. 4, perform object matching on each of the X character cutting pictures, and then can perform object matching based on The business objects corresponding to the obtained character cut pictures are determined to determine the picture feature information corresponding to the video frames in the multimedia data.
  • Step S102 Locate and separate audio frames containing human voices from the original audio frames of the multimedia data to obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and compare the N object audio frames with each other.
  • the audio semantic feature vectors corresponding to the frames are clustered to obtain M audio clusters.
  • N object audio frames are obtained after the computer device performs object positioning and separation processing on the original audio frames in the multimedia data, where N is a positive integer.
  • An audio cluster can correspond to a business object.
  • the computer device can obtain original audio frames from multimedia data, and can then perform object positioning and separation processing on the original audio frames to obtain N object audio frames.
  • the computer device can perform semantic feature extraction on each of the N object audio frames to obtain an audio semantic feature vector corresponding to each object audio frame.
  • the computer device can determine M as the number of cluster centers to be clustered, and based on the number of cluster centers, perform clustering processing on the audio semantic feature vector corresponding to each acquired audio frame of the object, so that M audio files can be obtained Cluster clusters. Audio semantic characteristics can be understood as the characteristics of the speaker’s voiceprint.
  • the embodiment of the present application innovatively uses the number M of business objects indicated by the picture feature information as the selection of the number of cluster centers.
  • This method of using picture feature information as prior knowledge enables the system to know the number of business objects in the multimedia data, thereby giving audio clustering a priori setting of cluster center data, which can be automatically set
  • the number of cluster centers improves the convergence speed of the entire system and the overall recognition performance, and saves computer resources.
  • FIG. 6 is a schematic architectural diagram of audio semantic feature clustering provided by an embodiment of the present application.
  • the schematic architectural diagram in the embodiment of the present application may be the schematic architectural diagram corresponding to the second module 202 in the embodiment corresponding to FIG. 2 .
  • the original audio frame shown in FIG. 6 may be an original audio frame in multimedia data (for example, the multimedia data 20S in the embodiment corresponding to FIG. 2 mentioned above).
  • the source separation model 630w shown in Figure 6 can be used to perform source separation on the original audio frame.
  • the information source separation model 630w may be the information source separation model 230w in the embodiment corresponding to FIG. 2 described above.
  • the audio semantic feature extraction model 640w shown in Figure 6 can be used to extract semantic features for each object audio frame.
  • the audio semantic feature extraction model 640w may be the audio semantic feature extraction model 240w in the embodiment corresponding to FIG. 2 described above.
  • the architectural schematic diagram in the embodiment of the present application may include three nodes, namely an audio paragraph cutting node, an audio semantic feature extraction node, and a clustering processing node.
  • the computer device when the computer device is at the audio segment cutting node, the computer device can obtain the original audio frame from the multimedia data to perform source separation on the original audio frame, thereby obtaining the business object-containing audio frame. Audio frame to be processed. Further, the computer device can locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, so that N object audio frames can be obtained.
  • source separation refers to separating mixed audio signals mixed with multiple audio signals through signal processing or other algorithms, extracting specified types of audio signal sequences from the mixed signals, and finally generating separate audio files.
  • the audio frame to be processed for the business object that is, the object segment
  • the object segment is extracted from the original audio frame.
  • the source separation model 630w can be used to perform source separation on the original audio frame to obtain the object segment (or object track) and ambience segments (or ambience tracks). Since there may be a large number of silent segments in the target sound segment, and these silent segments will cause interference to the audio clustering results of subsequent clustering processing, and will also cause a waste of computing resources, at this time, the computer device can determine the target sound segment Is the audio frame to be processed for the business object. The computer device can then obtain the audio boundary detection strategy.
  • the audio boundary detection strategy here can be the VAD (Voice Activity Detection) algorithm.
  • VAD Voice Activity Detection
  • the VAD algorithm here can be widely used in speech coding, noise reduction and ASR scenarios.
  • a VAD system can usually include two parts, feature extraction and speech/non-speech decision. Further, based on the audio boundary detection strategy, the computer device can locate and cut the audio impact signal frame in the audio frame to be processed, that is, accurately locate the non-silent segment, so that N object audio frames shown in Figure 6 can be obtained, N is a positive integer.
  • the computer device may input the N object audio frames to the audio semantic feature extraction model 640w shown in FIG. 6 .
  • the audio semantic feature extraction model 640w can be an audio neural network (for example, PANNS network) based on a large audio data set and training, which is usually used for audio pattern recognition or audio frame level embedding, and serves as the front end of many models. Coding network.
  • the computer device can extract semantic features for each of the N object audio frames through the audio semantic feature extraction model 640w, and obtain the audio semantic feature vector corresponding to each object audio frame. As shown in Figure 6, it may specifically include audio semantic feature vector 1, audio semantic feature vectors 1,..., and audio semantic feature vector N.
  • the clustering strategy used for clustering processing in the embodiment of the present application may be a k-means clustering algorithm (k-means clustering algorithm, referred to as k-means clustering algorithm).
  • the k-means clustering algorithm is an iterative clustering analysis algorithm.
  • the computer device may divide N audio semantic feature vectors into M initial clusters in advance. Furthermore, the computer device can randomly select M audio semantic feature vectors as initial cluster centers of the M initial clusters. Then, for each audio semantic feature vector (i.e., a vector to be attributed) in the audio semantic feature vector set except the M audio semantic feature vectors selected as cluster centers, the computer device may determine that each vector to be attributed is consistent with The vector distance between the cluster centers of each initial clustering cluster, and the vector to be attributed is divided into the initial clustering cluster with the minimum vector distance. At this time, the computer device can update the cluster centers of the divided initial clusters. By analogy, the computer device can determine M audio clusters shown in FIG. 6 . The M audio clusters may specifically include audio clusters C 1 , audio clusters C 2 , ..., and audio clusters C M .
  • the embodiment of this application uses the audio semantic feature clustering method to classify N audio semantic feature vectors instead of training voiceprint classification through neural networks, thereby getting rid of the dependence on the actor's voiceprint ID and avoiding privacy violations.
  • the embodiments of this application can
  • the object audio frames in the multimedia data are directly used to extract the audio semantic feature vector corresponding to each object audio frame. This is deeply decoupled from the personal voiceprint ID of the business object and thus related to the voice of the character itself.
  • the pattern information is correlated so that business characters voiced by professional voice actors can be identified. That is to say, the embodiment of the present application can still accurately identify the line character information even when the business character is not dubbed by the business object himself, thus improving the accuracy of audio character recognition.
  • the embodiment of the present application uses the audio semantic feature clustering method to cluster N audio semantic feature vectors to perform audio character recognition, which makes the entire system portable and makes the entire audio character recognition system more efficient. It is versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of identification.
  • FIG. 7 is a model architecture diagram of a source separation model provided by an embodiment of the present application.
  • the information source separation model in the embodiment of the present application may be the information source separation model 630w in the embodiment corresponding to Figure 6.
  • the source separation model may include a split network layer 7W 1 (ie, a first split network layer, for example, VACAL-Unet) and a split network layer 7W 2 (ie, a second split network layer, for example, BGM-Unet).
  • Unet is one of the algorithms that uses a fully convolutional network for semantic segmentation, using a symmetric U-shaped structure containing a compression path and an expansion path.
  • the typical feature of the Unet network is that it has a U-shaped symmetrical structure and can contain 4 convolutional layers and corresponding 4 upsampling layers. Therefore, when implementing, you can either implement the network from scratch and initialize the weights, and then train the model, or you can borrow the convolutional layer structure of some networks and the corresponding trained weight files, plus subsequent upsampling. layer, perform training calculations, etc. Since the trained weight model files can be used in deep learning model training, the speed of Unet training is greatly accelerated.
  • Another feature is that the feature map obtained by each convolutional layer of the Unet network will be connected to the corresponding upsampling layer, so that the feature map of each layer can be effectively used in subsequent calculations, that is, skip connection (skip-connection). It can effectively solve the problem of gradient dissipation and improve the efficiency of model training.
  • Unet avoids direct supervision and loss calculation in high-level feature maps, but combines the features in low-level feature maps, so that the final result can be achieved.
  • the obtained feature map contains both first-level features (i.e., high-level features) and many second-level features (i.e., low-level features), achieving feature fusion at different levels, thereby improving The accuracy of the model’s results.
  • the computer device When the computer device inputs the original audio frame into the source separation model, it can generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model shown in Figure 7. For example, the computer device can perform spectrum conversion on the audio track of the original audio frame to obtain the audio track spectrum corresponding to the original audio frame, and then can generate the spectrum amplitude spectrum corresponding to the original audio frame by eliminating the phase of the audio track spectrum.
  • the computer device can input the spectrum amplitude spectrum into the segmentation network layer 7W 1 and the segmentation network layer 7W 2 respectively, so as to generate the first type of features (for example, object track features) corresponding to the spectrum amplitude spectrum through the segmentation network layer 7W 1 , the second type of features (for example, ambient track features) corresponding to the spectral amplitude spectrum are generated by segmenting the network layer 7W 2 .
  • first type of features for example, object track features
  • second type of features for example, ambient track features
  • the computer device can merge and mask the first type features and the second type features to obtain a target mask map corresponding to the first type features (ie, the first mask map). Furthermore, the computer device can generate a target type audio frame (i.e., an audio frame in the object segment) based on the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the output of the source separation model for the business object (including (with voice) audio frame to be processed. For example, when the computer device generates the first type features and the second type features shown in Figure 7, it can perform splicing processing on the first type features and the second type features to obtain spliced type features.
  • a target type audio frame i.e., an audio frame in the object segment
  • the computer device performs two types of mask calculations on the splicing type features, so that a first mask image corresponding to the first type feature and a second mask image corresponding to the second type feature can be obtained.
  • the mask calculation is, for example, by comparing the feature values of the points with the merged values after the splicing process.
  • the The computer device can perform corresponding position calculation (for example, multiplication) on the spectrum amplitude spectrum corresponding to the first mask image and the original audio frame, and then generate the first type audio frame (i.e., the audio frame in the object segment) through inverse spectrum transformation. .
  • the computer device can also calculate the corresponding position of the spectrum amplitude spectrum corresponding to the second mask image and the original audio frame, and then generate the second type of audio frame (ie, the audio frame in the environmental sound segment through inverse spectrum transformation ). Since the corresponding first type features and the amplitude spectrum of the first type features can be obtained after the mask and amplitude spectrum calculations above, the one-dimensional features of the first type features and the second type features can be obtained after the inverse spectrum transformation.
  • the sampling point is the audio signal.
  • the computer device can separate environmental sounds (for example, BGM sounds) from the original audio frames of multimedia data through the source separation model shown in Figure 7 to eliminate the impact of environmental sounds on subsequent clustering, thereby improving Clustering accuracy.
  • environmental sounds for example, BGM sounds
  • FIG. 8 is a schematic diagram of the model architecture of an audio semantic feature extraction model provided by an embodiment of the present application.
  • the audio semantic feature extraction model in the embodiment of the present application may be the audio semantic feature extraction model 640w in the embodiment corresponding to Figure 6.
  • the audio semantic feature extraction model shown in Figure 8 can be the Wavegram_Logmel128_Cnn14 model.
  • the biggest feature of this audio semantic feature extraction model is that the input of the model uses the original audio sampling point sequence of the audio, that is, the input of the entire network is audio N object audio frames of the signal. This eliminates the need to extract basic audio features in advance. Since the extraction of basic audio features is very time-consuming, and using basic audio features as input will occupy a particularly large amount of hardware resources, by using this audio semantic feature extraction model to process N object audio frames of the input audio signal, computers can be saved. resources and improve computing efficiency.
  • the audio semantic feature extraction model may include a time domain branch network layer, a frequency domain branch network layer and a convolution network layer.
  • the time domain branch network layer here may include a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolutional layer), a max-pooling layer 803w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 804w (e.g., a one-dimensional convolutional layer including a basis block), a max-pooling layer 805w (e.g., a max-pooling layer with a stride of 4), a convolutional layer 806w (e.g., a one-dimensional convolutional layer including a basis block), a convolution layer 801w (for example, a one-dimensional convolution layer with a convolution size of 1 and a stride of 5), a convolution layer 802w (for example, a basic block including a one-dimensional convolution
  • the computer device can directly learn the time domain characteristics of the audio signal in the time domain signal through these large one-dimensional convolution layers, especially information such as audio loudness and sampling point amplitude. After a large number of one-dimensional convolutional layers, a two-dimensional wavegram is obtained to represent the learned time domain feature map, so that the output of the time domain branch and the frequency domain branch can be combined.
  • the computer device can also perform feature learning on N object audio frames through the frequency domain branch network layer to obtain the learned frequency domain feature map (frequency domain learning feature).
  • the frequency domain branch network layer here may include a convolution layer 809w (for example, a two-dimensional convolution layer including a basic block).
  • the computer device can input N object audio frames to the frequency domain branch network layer and generate frequency domain spectra corresponding to the N object audio frames (for example, using Mel frequency to generate a log-mel spectrum).
  • the computer device inputs the frequency domain spectrum to the convolution layer 809w shown in Figure 8, so as to obtain the same characteristics as the learned time domain feature map through multiple two-dimensional convolution layers in the convolution layer 809w.
  • Frequency domain feature maps for learning of feature dimensions are provided.
  • the computer device can superimpose (for example, splice) the learned frequency domain feature map and the learned time domain feature map, so that the superimposed feature can be obtained.
  • the computer device then inputs the superimposed features into the convolutional network layer, and performs maximum averaging processing on the superimposed features, Output the audio semantic feature vector corresponding to each object audio frame.
  • the convolutional network layer here may include a convolutional layer 810w (for example, a two-dimensional convolutional layer) and an activation layer 811w.
  • the computer device can splice the feature map used to represent the frequency domain feature map of learning and the feature map used to represent the time domain feature map of learning to form a set of two-dimensional frequency domain feature maps used to identify superimposed features. .
  • the computer device can input the two-dimensional frequency domain feature map used to represent the superimposed features into the convolution layer 810w shown in Figure 8, and then separately use two-dimensional pooling on the features output by the convolution layer 810w. (pooling) Perform maximum processing and average processing to extract the maximum representation and average representation of the current feature. Furthermore, the computer device can determine the maximum processed feature as the first sub-feature, and determine the average processed feature as the second sub-feature. At this time, the computer device can merge the first sub-feature and the second sub-feature, and then input the merged feature to the activation layer 811w shown in Figure 8 to finally generate an audio semantic feature vector set with 2048 dimensions.
  • the audio semantic feature vector set may include an audio semantic feature vector corresponding to each of the N object audio frames.
  • the computer device can quickly perform audio semantic feature extraction on each of the N object audio frames through the audio semantic feature extraction model shown in Figure 8, so as to obtain each object more quickly and accurately The audio semantic feature vectors corresponding to the audio frames respectively.
  • Step S103 Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • the object role mapping table (for example, the object role mapping table shown in Table 1 above) may include business roles that have a mapping relationship with the list business object, and there are P overlapping business objects between the list business object and the M business objects.
  • the computer device may obtain the audio cluster C k from the M audio clusters.
  • the computer device can extract the first playing time of the audio cluster C k in the multimedia data, where k is a positive integer less than or equal to M.
  • the first playback time of the audio cluster C k in the multimedia data is one or more playback times in the multimedia data of the object audio frame corresponding to the audio semantic feature vector included in the audio cluster C k .
  • the computer device can obtain P business objects that overlap with the M business objects from the list of business objects in the object role mapping table associated with the multimedia data. Furthermore, the computer device can extract the second playback time of each of the P business objects in the multimedia data based on the picture feature information. The second playback time of each of the P business objects in the multimedia data is one or more playback times in the multimedia data of the video frame in which each of the P business objects is located. At this time, the computer device can respectively determine the time overlap between the first playback time and each second playback time of the audio cluster C k . Furthermore, the computer device can use the business object corresponding to the second playback time with the highest degree of time overlap as the business object corresponding to the audio cluster C k . Further, the computer device can obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
  • the embodiments of this application start from the perspective of audio, identify characters in multimedia data, and classify each audio line into roles. This can supplement accurate lines when there is no information about key parts of the character in some other character shots and scenes. Role information, thus improving the accuracy of role recognition.
  • FIG. 9 is a schematic diagram of a scenario for audio character recognition provided by an embodiment of the present application.
  • the computer device executes step S101, it can determine through the image feature information recognized by the first module 201 that the number M of business objects to which the character pictures in the video frames of the multimedia data belong is 3. Specifically, it can include objects a, object b and object c.
  • the computer device executes step S102, it can determine that there are three audio clusters through the audio processing results clustered by the second module 202. Specifically, it may include audio clustering cluster C 1 , audio clustering cluster C 2 and audio clustering cluster C 3 shown in FIG. 9 .
  • the N object audio frames in the embodiment of the present application may include segment 1, segment 2, segment 3, segment 4, segment 5, and segment 6 shown in FIG. 9 .
  • these 6 segments are arranged according to playing time.
  • the object audio frames corresponding to audio cluster C 1 may include object audio frames in segment 1 and segment 3 .
  • the object audio frames corresponding to audio cluster C 2 may include object audio frames in segment 2, segment 4, and segment 6.
  • the object audio frame corresponding to audio cluster C 3 may include the object audio frame in segment 5 .
  • the computer device can obtain, from the list of business objects in the object role mapping table shown in Table 1, business objects that overlap with the M business objects obtained by the computer device in the first module.
  • the list business objects in Table 1 above may include four business objects: object a, object b, object c, and object d.
  • the M business objects obtained by the computer device in the embodiment of the present application may include object a, object b, object c, and object d.
  • the computer device can extract the playback time (ie, the second playback time) of each of the three overlapping business objects in the multimedia data based on the picture feature information.
  • the second playback time of object a in the multimedia data is playback time T 1 (for example, 00:00-10:00) and playback time T 3 (for example, 30:45-38:00); object b is in the multimedia data
  • the second playback time in the data is playback time T 2 (for example, 10:05-28:33), playback time T 4 (for example, 40:05-55:39), and playback time T 6 (for example, 100:03 -113:57);
  • the second playback time of object c in the multimedia data is the playback time T 5 (for example, 80:30-88:50).
  • the computer device can obtain the audio cluster C 1 from these three audio clusters, and then can extract the playback time of the audio cluster C 1 in the multimedia data (ie, the first playback time of the audio cluster C 1 ).
  • the first playback time of the audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to the sound segment 1 (for example, 00:30-10:10) and the playback time t 3 (for example, 00:30-10:10) corresponding to the sound segment 3 ( For example, 35:08-40:52).
  • the computer device can respectively determine the time overlap between the audio cluster C 1 and the second playback time corresponding to each business object.
  • the time overlap between the first playback time of audio cluster C 1 and the second playback time of object a is 98%
  • the time overlap with the second playback time of object b is 5%
  • the time overlap with the second playback time of object b is 5%
  • the temporal overlap between the second playback times of object c is 1%.
  • the computer device can determine the second playback time with the highest time overlap degree from the three time overlap degrees, that is, the second playback time of object a. Further, the computer device can use object a as an audio clustering cluster.
  • the business object corresponding to C 1 , and the business roles (ie, role 1 and role 2) that have a mapping relationship with object a are obtained from the above Table 1 as the business role corresponding to the audio cluster C 1 .
  • the computer device can refer to the audio role identification method of the business role corresponding to the audio cluster C 1 and determine that the business role corresponding to the audio cluster C 2 can be the role 3 that has a mapping relationship with the object b.
  • the business role corresponding to cluster C 3 may be role 4 that has a mapping relationship with object c.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby The business roles corresponding to the P audio clusters associated with the object role mapping table can be accurately identified.
  • This audio role identification method does not require manual annotation of the business role to which each audio line belongs, and can not only reduce the consumption of manpower and time , can also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the embodiments of the present application can adopt the audio semantic feature clustering method in the audio character recognition process, making the entire audio character recognition system more versatile and applicable to different scenarios of business objects in different multimedia data, thereby effectively improving the recognition applicability.
  • FIG. 10 is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • This method can be executed by a terminal device with an audio role recognition function (for example, any terminal device in the terminal device cluster shown in FIG. 1 above, for example, the terminal device 100a), or by a server with an audio role recognition function (for example, the server 10F) shown in FIG. 1 can also be executed interactively by a target terminal device with a multimedia data playback function and a server with an audio character recognition function, which is not limited here.
  • the method may at least include the following steps S201 to S205:
  • Step S201 Identify picture feature information from video frames of multimedia data.
  • Step S202 locate and separate the audio frames containing human voices from the original audio frames of the multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames in the multimedia data, and compare The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters.
  • Step S203 Identify the business role corresponding to each of the P audio clusters based on the picture feature information, M audio clusters and the object role mapping table associated with the multimedia data.
  • step S201 to step S203 please refer to the description of step S101 to step S103 in the embodiment corresponding to FIG. 3, which will not be described again here.
  • Step S204 based on the first playback time of the P audio clusters (specifically, the object audio frames corresponding to the P audio clusters) in the multimedia data and the business objects corresponding to the P audio clusters (specifically, The second playback time in the multimedia data of the video frame in which the P audio clusters respectively correspond to the business objects respectively determines the service playback time in the multimedia data of each of the P business objects.
  • the computer device can obtain the target audio cluster from the P audio clusters, and further can determine the first playback time of the target audio cluster in the multimedia data, and the first playback time of the target audio cluster corresponding to the target audio cluster.
  • the second playback time of the business object in the multimedia data Further, the computer device can determine the time intersection or time union of the first playback time and the second playback time of the target audio cluster, and then can use the determined time intersection or time union as the target audio cluster.
  • the service playback time of the business object corresponding to the cluster in the multimedia data is obtained until the service playback time of each of the P business objects in the multimedia data is obtained.
  • the embodiment of the present application uses the audio semantic feature clustering method to perform audio character recognition, which can make up for the problem that there is no character facial information or object information in some video frames, but the character cannot be recognized when audio appears, and can automatically identify the character based on the object audio.
  • the semantic features of frames are used to cluster the business roles corresponding to the current audio clustering cluster, thus filling the shortcomings of using image recognition for role recognition and ensuring the integrity of the role's time positioning information in the entire multimedia data.
  • the first playback time of audio cluster C 1 in the multimedia data may include the playback time t 1 corresponding to segment 1 (for example, 00:30-10:10) and the playback time corresponding to segment 3 t 3 (e.g., 35:08-40:52).
  • the second playback time of the business object (for example, object a) corresponding to the audio cluster C 1 in the multimedia data is the playback time T 1 (for example, 00:00-10:00) and the playback time T 3 (for example, 30 :45-38:00).
  • the computer device uses a time intersection method to determine the service playback time
  • the service playback time of object a determined by the computer device can be 00:30-10:00 and 35:08-38:00.
  • the service playback time of object a determined by the computer device may be 00:00-10:10 and 30:45-40:52.
  • Step S205 Based on the service playback time corresponding to each of the P business objects, obtain the multimedia segment data corresponding to the P business objects from the multimedia data.
  • the multimedia segment data here may include audio frames associated with the corresponding business object and audio frames associated with the corresponding business object. video frames.
  • the computer device when it obtains the service playback time of object a, the service playback time of object b, and the service playback time of object c, it can respectively obtain the multimedia segment data corresponding to these three service objects.
  • the computer device can obtain multimedia segment data that matches the service playback time of object a from the multimedia data (that is, including video frames associated with object a and audio frames associated with object a) as the object
  • the multimedia segment data corresponding to a for example, multimedia segment data 1).
  • the computer device can obtain the multimedia segment data that matches the service playback time of object b (that is, including the video frame associated with object b and the audio frame associated with object b) as the multimedia segment data corresponding to object b.
  • Segment data (for example, multimedia segment data 2); obtain the multimedia segment data that matches the service playback time of object c (that is, including the video frame associated with object c and the audio frame associated with object c), as the object c corresponding multimedia segment data (for example, multimedia segment data 3).
  • the fully automatic audio character recognition solution based on the audio semantic feature clustering method provided by the embodiments of the present application can automatically combine picture feature information (for example, character facial information) to identify services in multimedia data.
  • Character recognition can save a lot of manual annotation costs and time costs, and accelerate the implementation of video applications.
  • the computer device when the computer device obtains the multimedia fragment data corresponding to each business object, it can be applied in the "watch TA only" user-specific service in the multimedia data playback scenario, and can target the business objects in the multimedia data. (or business role) to select storyboards, so that when the target user triggers this user-specific service, the multimedia segment data that is not selected by the user will be automatically skipped, so that the computer device can more clearly locate the business objects that the user likes. multimedia segment data.
  • the computer device can play multimedia data in a business playback display interface.
  • the service playback display interface may include a playback selection control for triggering a target video data selection function.
  • the computer device may display the object playlist in response to the triggering operation.
  • the object playlist here can be displayed in the bottom area of the business playback display interface in a floating window form, a masked form, or a translucent form, or it can also be displayed on a shrinkable interface that can change the display size through drag and drop operations.
  • the size is smaller than the service playback display interface.
  • the object playlist here may include object cover data corresponding to Z business objects respectively; and Z is a positive integer less than or equal to P.
  • the target multimedia segment data here may be the multimedia segment data corresponding to the business object corresponding to the target object cover data, and the business object corresponding to the target object cover data belongs to P business objects.
  • the triggering operations here may include contact operations such as clicks and long presses, and may also include non-contact operations such as voice and gestures, which will not be limited here.
  • FIG. 11 is a schematic diagram of a scene for displaying multimedia segment data according to an embodiment of the present application.
  • the computer device in the embodiment of the present application may be a target terminal device used by the target user.
  • the target terminal device may be any terminal device in the terminal device cluster in the embodiment corresponding to Figure 1, for example, the terminal device 100a.
  • the interface 1101J and the interface 1102J shown in Figure 11 are both service playback display interfaces at different times provided by a client with a multimedia data playback function.
  • the target terminal device used by the target user can display multimedia data in the interface 1101J.
  • the multimedia data here can be the multimedia data 20S in the embodiment corresponding to Figure 2.
  • the interface 1101J may include a control 11U, which is a playback selection control used to trigger the target video data selection function.
  • the target terminal device may display the object playlist 11B shown in FIG. 11 in response to the triggering operation.
  • the object playlist 11B here may include object cover data corresponding to the Z business objects and cover data corresponding to the multimedia data (for example, "watch the complete video").
  • the object playlist 11B may specifically include object cover data 1 corresponding to object a (for example, "watch only the clips of object a"), and object cover data 2 corresponding to object b (for example, "watch only clips of object b"). ”) and the object cover data 3 corresponding to object c (for example, “see only the fragment of object c”).
  • the object a, the object b and the object c here all belong to the P business objects obtained by the target terminal device and obtained by performing audio role recognition on the multimedia data.
  • the target user can perform a triggering operation on the target object cover data (for example, the object cover data 1 corresponding to object a) among the Z pieces of object cover data.
  • the target terminal device can play the multimedia segment data corresponding to the object a corresponding to the object cover data 1 in the interface 1102J shown in FIG. 11 .
  • the target terminal device can also highlight the playback progress corresponding to the multimedia segment data corresponding to object a in the playback progress bar corresponding to the multimedia data displayed on the interface 1102J, so that the target user can more quickly and accurately Find the next segment of multimedia segment data corresponding to the object a that you are interested in.
  • the computer device when the computer device obtains the multimedia segment data corresponding to each business object, it can also apply it in the merged and edited scene. For example, the computer device classifies the audio data in the multimedia data, distinguishes the business role corresponding to each audio line, and organizes the line voice collection (i.e., audio clustering cluster) corresponding to each business role in the entire multimedia data. Use it as production material and provide it to the intelligent production video team as alternative information for editing. For example, the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data. For another example, the computer device can merge and edit corresponding multimedia segment data of different business objects.
  • the computer device can perform air-to-air mixing and cutting of multiple multimedia segment data of the same business object in different multimedia data.
  • the computer device can merge and edit corresponding multimedia segment data of different business objects.
  • the multimedia data here may include first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited.
  • the objects to be edited belong to the P business objects obtained through audio role recognition by the computer equipment.
  • the first multimedia data here can be a war-themed TV series in which the subject to be edited participates.
  • the second multimedia data here can be the TV series with the theme of fairy tales in which the subject to be edited participates.
  • the computer device can obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and further can obtain the first target business role related to the first multimedia data.
  • the first multimedia segment data here is determined by the computer device based on the service playback time of the object to be edited in the first multimedia data.
  • the computer device can also obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role from the second multimedia data.
  • Second multimedia segment data associated with the character The second multimedia segment data here may be determined by the computer device based on the service playback time of the object to be edited in the second multimedia data.
  • the computer device can perform merge and edit processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.
  • the merged clip data here can be used to upload to the business data platform where the client is located, so that objects accessing the client can check it on the corresponding terminal device.
  • a computer device with an audio character recognition function can associate sounds with characters by combining picture feature information automatically recognized from video frames and M audio clusters of adaptive clustering, thereby Can accurately identify the angle with the object
  • the P audio clusters associated with the color mapping table correspond to business roles respectively.
  • This audio role recognition method does not require manual annotation of the business role to which each audio line belongs. Instead, it can automatically identify and write the business role and audio line information before the multimedia data is put on the shelf, so that it can quickly provide downstream services (for example, User characteristic service business, merged editing business, etc.) are empowered.
  • the embodiment of the present application adopts the audio semantic feature clustering method in the audio character recognition process, which can not only reduce the manpower time consumed, but also solve the problem of similar timbre recognition errors, so as to improve the accuracy and efficiency of recognition.
  • the entire audio character recognition system is more versatile and can be applied to different scenarios of business objects in different multimedia data, thus effectively improving the applicability of recognition.
  • FIG. 12 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device 1 may include: a picture information acquisition module 100, a clustering processing module 200, and an audio character recognition module 300.
  • the picture information acquisition module 100 is used to identify picture feature information from video frames of multimedia data.
  • the picture feature information includes M business objects to which the character pictures in the video frame belong, and M is a positive integer.
  • the clustering processing module 200 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
  • the audio role recognition module 300 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data.
  • Role where P is a positive integer less than or equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
  • FIG. 13 is another schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device 2 may include: a picture information acquisition module 11, a clustering processing module 12, an audio role recognition module 13, a business time determination module 14, a segment data determination module 15, a multimedia data playback module 16, Object list display module 17, segment data playback module 18, first segment data acquisition module 19, second segment data acquisition module 20 and merge editing module 21.
  • the picture information acquisition module 11 is used to identify picture feature information from video frames of multimedia data, where the picture feature information includes M business objects to which character pictures in the video frames belong, and M is a positive integer.
  • the picture information acquisition module 11 includes: a video frame acquisition unit 111, a picture cutting unit 112, a picture encoding unit 113, a vector matching unit 114 and a picture information acquisition unit 115.
  • the video frame acquisition unit 111 is used to acquire video frames from multimedia data.
  • the picture cutting unit 112 is used to cut pictures containing key parts of the character in the video frame to obtain the character picture corresponding to the video frame.
  • the character pictures include X character cut pictures, where X is a positive integer greater than or equal to M.
  • the picture cutting unit 112 includes: a position determining subunit 1121 and a cutting subunit 1122.
  • the position determination subunit 1121 is used to detect and locate the key parts of the character in the video frame to determine the position information of the key parts of the character in the video frame.
  • the cutting subunit 1122 is used to cut the key parts of the character in the video frame based on the position information, obtain X character cut pictures containing the key parts of the character, and use the X character cut pictures as the character pictures corresponding to the video frame.
  • the picture encoding unit 113 is used to obtain the character cut picture Ti among Or a positive integer equal to X.
  • the vector matching unit 114 is used to determine the object key information vector that matches the picture information vector Li from the information vector database associated with the candidate business object, and use the candidate business object corresponding to the matched object key information vector as the role Cut the business object corresponding to picture T i .
  • the vector matching unit 114 includes: a database acquisition subunit 1141, a vector distance determination subunit 1142, and an object matching subunit 1143.
  • the database acquisition subunit 1141 is used to acquire an information vector database associated with candidate business objects, where the information vector database is used to store object key information vectors corresponding to Y candidate business objects, where Y is a positive integer greater than or equal to M.
  • the vector distance determination subunit 1142 is used to respectively determine the vector distance between the picture information vector Li and each object key information vector in the Y object key information vectors, to obtain Y vector distances.
  • the object matching subunit 1143 is used to obtain the minimum vector distance that is less than or equal to the distance threshold from Y vector distances, determine the candidate business object corresponding to the object key information vector corresponding to the minimum vector distance, and use the determined candidate business object as a role Cut the business object corresponding to picture T i .
  • the picture information acquisition unit 115 is configured to determine the picture feature information corresponding to the video frame based on the obtained business objects corresponding to the character cut pictures.
  • step S101 for the specific implementation of the video frame acquisition unit 111, picture cutting unit 112, picture encoding unit 113, vector matching unit 114 and picture information obtaining unit 115, please refer to the description of step S101 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
  • the clustering processing module 12 is used to locate and separate audio frames containing human voices from the original audio frames of multimedia data, obtain N object audio frames, extract corresponding audio semantic feature vectors from the N object audio frames, and The audio semantic feature vectors corresponding to N object audio frames are clustered to obtain M audio clusters, where N is a positive integer, and one audio cluster corresponds to one business object.
  • the clustering processing module 12 includes: an object audio frame determination unit 121, a semantic feature extraction unit 122, and a clustering processing unit 123.
  • the object audio frame determining unit 121 is used to locate and separate audio frames containing human voices from original audio frames of multimedia data to obtain N object audio frames.
  • the object audio frame determination unit 121 includes: an original audio frame acquisition subunit 1211, a source separation subunit 1212, and an object audio frame determination subunit 1213.
  • the original audio frame acquisition subunit 1211 is used to acquire original audio frames from multimedia data.
  • the source separation subunit 1212 is used to perform source separation on the original audio frame to obtain an audio frame to be processed that contains human voice.
  • the source separation sub-unit 1212 includes: an amplitude spectrum generating sub-unit 12121, a type feature generating sub-unit 12122, a merging mask sub-unit 12123 and an audio frame to be processed determining sub-unit 12124.
  • the amplitude spectrum generation subunit 12121 is used to input the original audio frame to the source separation model, and generate the spectrum amplitude spectrum corresponding to the original audio frame through the source separation model.
  • the source separation model includes a first segmentation network layer and a second segmentation network layer.
  • the type feature generation subunit 12122 is used to input the spectrum amplitude spectrum into the first segmentation network layer and the second segmentation network layer respectively, generate the first type feature corresponding to the spectrum amplitude spectrum through the first segmentation network layer, and generate the first type feature corresponding to the spectrum amplitude spectrum through the second segmentation network layer. Generate second type features corresponding to the spectral amplitude spectrum.
  • the merge mask subunit 12123 is used to perform merge mask processing on the first type features and the second type features to obtain a target mask map corresponding to the first type features.
  • the audio frame determination subunit 12124 to be processed is used to generate a target type audio frame through spectrum inverse transformation based on the corresponding position of the target mask map and the spectrum amplitude spectrum, and use the target type audio frame as the source separation model to output the audio frame containing the human voice. of audio frames to be processed.
  • the specific implementation of the amplitude spectrum generation sub-unit 12121, the type feature generation sub-unit 12122, the merging mask sub-unit 12123 and the audio frame to be processed determining sub-unit 12124 can be referred to the audio frame to be processed in the embodiment corresponding to Figure 7. description, which will not be described further here.
  • the object audio frame determination subunit 1213 is used to locate and cut the non-silent segments in the audio impact signal frame in the audio frame to be processed based on the audio boundary detection strategy for eliminating silent frames, to obtain N object audio frames.
  • the source separation sub-unit 1212 and the object audio frame determination sub-unit 1213 please refer to the object positioning and separation processing of the original audio frame in the embodiment corresponding to Figure 3. description, which will not be described further here.
  • the semantic feature extraction unit 122 is used to extract semantic features from each of the N object audio frames, and obtain an audio semantic feature vector corresponding to each object audio frame.
  • the semantic feature extraction unit 122 includes: an audio frame input subunit 1221, a frequency domain feature determination subunit 1222, a time domain feature determination subunit 1223, and an audio feature vector determination subunit 1224.
  • the audio frame input subunit 1221 is used to input N object audio frames to the audio semantic feature extraction model.
  • the audio semantic feature extraction model includes frequency domain branch network layer, time domain branch network layer and convolution network layer.
  • the frequency domain feature determination subunit 1222 is used to perform feature learning on N object audio frames through the frequency domain branch network layer to obtain a learned frequency domain feature map.
  • the time domain feature determination subunit 1223 is used to perform feature learning on N object audio frames through the time domain branch network layer to obtain a learned time domain feature map.
  • the feature dimensions between the learned frequency domain feature map and the learned time domain feature map are the same.
  • the audio feature vector determination subunit 1224 is used to superimpose the learned frequency domain feature map and the learned time domain feature map to obtain superimposed features, input the superimposed features to the convolution network layer, and perform maximum average processing on the superimposed features. Output the audio semantic feature vector corresponding to each object audio frame.
  • the audio frame input subunit 1221 frequency domain feature determination subunit 1222, time domain feature determination subunit 1223 and audio
  • the feature vector determination sub-unit 1224 please refer to the description of semantic feature extraction of the object audio frame in the embodiment corresponding to FIG. 8, and will not be described again here.
  • the clustering processing unit 123 is used to determine M as the number of cluster centers to be clustered, and perform clustering processing on the audio semantic feature vector corresponding to each obtained object audio frame based on the number of cluster centers to obtain M audio clusters. Class cluster.
  • step S102 For the specific implementation of the object audio frame determination unit 121, the semantic feature extraction unit 122 and the clustering processing unit 123, please refer to the description of step S102 in the embodiment corresponding to Figure 3 above, and will not be described again here.
  • the audio role recognition module 13 is used to identify services corresponding to each of the P audio clusters based on picture feature information, M audio clusters, and object role mapping tables associated with multimedia data.
  • Role where P is a positive integer less than or equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and M business objects.
  • the audio character recognition module 13 includes: a first time extraction unit 131, a second time extraction unit 132, a time overlap determination unit 133, and an audio character recognition unit 134.
  • the first time extraction unit 131 is used to obtain the audio cluster C k from the M audio clusters, and extract the object audio frame in the multimedia data corresponding to the audio semantic feature vector included in the audio cluster C k
  • One or more playback times are used as the first playback time of the audio cluster C k , where k is a positive integer less than or equal to M.
  • the second time extraction unit 132 is used to obtain P business objects that overlap with the M business objects from the list business objects in the object role mapping table, and extract each of the P business objects based on the picture feature information.
  • One or more playback times in the multimedia data of the video frame where the business object is located are used as the second playback time of each business object.
  • the time overlap determination unit 133 is used to respectively determine the time overlap between the first playback time of the audio cluster C k and the second playback time corresponding to each business object, and assign the second playback time with the highest time overlap to The business object corresponding to the time is used as the business object corresponding to the audio cluster C k .
  • the audio role identification unit 134 is used to obtain the business role corresponding to the business object corresponding to the audio cluster C k from the object role mapping table, and use the obtained business role as the business role corresponding to the audio cluster C k .
  • step S103 For the specific implementation of the first time extraction unit 131, the second time extraction unit 132, the time overlap determination unit 133 and the audio character recognition unit 134, please refer to the description of step S103 in the embodiment corresponding to Figure 3 above. Here No further details will be given.
  • the service time determination module 14 is configured to determine P based on the first playback time of the P audio clusters in the multimedia data and the second playback time of the business objects corresponding to the P audio clusters in the multimedia data. The service playback time of each business object in the multimedia data.
  • the segment data determination module 15 is used to obtain multimedia segment data corresponding to P business objects from the multimedia data based on the service playback time corresponding to each business object.
  • the multimedia segment data includes audio frames associated with the corresponding business object and video frames associated with the corresponding business object.
  • the multimedia data playing module 16 is used to play multimedia data in the service playing display interface.
  • the service playback display interface includes a playback selection control used to trigger the object video data selection function.
  • the object list display module 17 is used to display the object play list in response to the triggering operation of the play selection control, wherein the object play list
  • the playlist includes object cover data corresponding to Z business objects, where Z is a positive integer less than or equal to P;
  • the segment data playback module 18 is used to respond to the trigger operation for the target object cover data among the Z object cover data, and play the target multimedia segment data in the service playback interface, where the target multimedia segment data is the service corresponding to the target object cover data.
  • the multimedia fragment data corresponding to the object and the business object corresponding to the cover data of the target object belong to P business objects.
  • the multimedia data includes first multimedia data and second multimedia data. Both the first multimedia data and the second multimedia data include objects to be edited.
  • the object to be edited belongs to P business objects.
  • the first segment data acquisition module 19 is configured to obtain the first target business role corresponding to the object to be edited based on the object role mapping table associated with the first multimedia data, and obtain the first target business role corresponding to the first multimedia data from the first multimedia data.
  • the first multimedia segment data associated with the target business role; the first multimedia segment data is determined based on the service playback time of the object to be edited in the first multimedia data.
  • the second segment data acquisition module 20 is configured to obtain the second target business role corresponding to the object to be edited based on the object role mapping table associated with the second multimedia data, and obtain the second target business role corresponding to the second multimedia data from the second multimedia data.
  • the second multimedia segment data associated with the target business role; the second multimedia segment data is determined based on the service playback time of the object to be edited in the second multimedia data.
  • the merging and editing module 21 is used to perform merging and editing processing on the first multimedia segment data and the second multimedia segment data to obtain merged editing data corresponding to the object to be edited.
  • the picture information acquisition module 11, clustering processing module 12, audio role recognition module 13, business time determination module 14, segment data determination module 15, multimedia data playback module 16, object list display module 17, segment data playback module 18 , the specific implementation of the first fragment data acquisition module 19, the second fragment data acquisition module 20 and the merge editing module 21 can be referred to the description of steps S201 to step S205 in the embodiment corresponding to Figure 10 above, and will not be continued here. Repeat. In addition, the description of the beneficial effects of using the same method will not be described again.
  • the computer device 1000 may be a computer device with an audio character recognition function.
  • the computer device 1000 may include: at least one processor 1001, for example, a CPU, at least one network interface 1004, a memory 1005, and at least one communication interface. Bus 1002.
  • the communication bus 1002 is used to realize connection communication between these components.
  • the network interface 1004 may include standard wired interfaces and wireless interfaces (such as WI-FI interfaces).
  • the memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001.
  • memory 1005 which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the computer device may also include the user interface 1003 shown in Figure 14.
  • the computer device may also include the user interface 1003.
  • the user interface 1003 may include a display screen (Display), a keyboard (Keyboard), etc.
  • the network interface 1004 is mainly used for network communication
  • the user interface 1003 is mainly used to provide an input interface for the user
  • the processor 1001 can be used to call the device control stored in the memory 1005. application to achieve:
  • the picture feature information includes M business objects to which the character pictures in the video frames belong, and M is a positive integer;
  • N object audio frames Locate and separate audio frames containing human voices from the original audio frames of multimedia data to obtain N object audio frames. From N objects The corresponding audio semantic feature vectors are extracted from the audio frames, and the audio semantic feature vectors corresponding to the N object audio frames are clustered to obtain M audio clustering clusters, where N is a positive integer, and one audio clustering cluster Corresponds to a business object;
  • M audio clusters and the object role mapping table associated with the multimedia data identify the business role corresponding to each of the P audio clusters, where P is less than or A positive integer equal to M.
  • the object role mapping table includes business roles that have a mapping relationship with the list business object. There are P overlapping business objects between the list business object and the M business objects.
  • the computer device 1000 described in the embodiment of the present application can execute the data processing method described in the embodiment corresponding to FIG. 3 and FIG. 10, and can also execute the data processing device 1 and the data processing method in the embodiment corresponding to FIG. 12.
  • the description of the data processing device 2 in the embodiment corresponding to Figure 13 will not be repeated here.
  • the description of the beneficial effects of using the same method will not be described again.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the steps in Figure 3 and Figure 10 are implemented.
  • the steps in Figure 3 and Figure 10 are implemented.
  • the data processing method provided please refer to Figure 3 and the implementation provided in each step of Figure 10 for details, which will not be described again here.
  • the computer-readable storage medium may be the data transmission device provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • the present application provides a computer program product, which includes a computer program stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device can execute the description of the data processing method in the embodiment corresponding to Figure 3 or Figure 10, where No longer.
  • the description of the beneficial effects of using the same method will not be described again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Les modes de réalisation de la présente demande divulguent un procédé et un appareil de traitement de données, ainsi qu'un dispositif informatique et un support de stockage, qui peuvent être appliqués à une scène d'intelligence artificielle. Le procédé consiste à : identifier des informations de caractéristiques d'image d'une trame vidéo de données multimédias, les informations de caractéristiques d'image comprenant M objets de service auxquels appartiennent les images de rôle dans la trame vidéo ; positionner et séparer, à partir d'une trame audio d'origine des données multimédias, les trames audio qui comprennent une voix humaine de façon à obtenir N trames audio d'objet, extraire respectivement les vecteurs de caractéristiques sémantiques audio correspondants à partir des N trames audio d'objet, puis effectuer un traitement de regroupement sur les vecteurs de caractéristiques sémantiques audio correspondant aux N trames audio d'objet de façon à obtenir M groupes audio ; et d'après les informations de caractéristiques d'image, les M groupes audio et une table de mappage de rôles d'objet associée aux données multimédias, identifier un rôle de service correspondant à chacun des P groupes audio. Au moyen des modes de réalisation de la présente demande, la précision, l'efficacité et l'applicabilité de l'identification des rôles audio peuvent être améliorées.
PCT/CN2023/087208 2022-04-13 2023-04-10 Procédé et appareil de traitement de données, et dispositif informatique et support des stockage WO2023197979A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210383918.3A CN114465737B (zh) 2022-04-13 2022-04-13 一种数据处理方法、装置、计算机设备及存储介质
CN202210383918.3 2022-04-13

Publications (1)

Publication Number Publication Date
WO2023197979A1 true WO2023197979A1 (fr) 2023-10-19

Family

ID=81418551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087208 WO2023197979A1 (fr) 2022-04-13 2023-04-10 Procédé et appareil de traitement de données, et dispositif informatique et support des stockage

Country Status (2)

Country Link
CN (1) CN114465737B (fr)
WO (1) WO2023197979A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117872354A (zh) * 2024-03-11 2024-04-12 陕西欧卡电子智能科技有限公司 一种多毫米波雷达点云的融合方法、装置、设备及介质
CN117872354B (zh) * 2024-03-11 2024-05-31 陕西欧卡电子智能科技有限公司 一种多毫米波雷达点云的融合方法、装置、设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114465737B (zh) * 2022-04-13 2022-06-24 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN114895817B (zh) * 2022-05-24 2023-08-04 北京百度网讯科技有限公司 交互信息处理方法、网络模型的训练方法及装置
CN115083435B (zh) * 2022-07-28 2022-11-04 腾讯科技(深圳)有限公司 音频数据处理方法、装置、计算机设备和存储介质
CN115033734B (zh) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及存储介质
CN116597828B (zh) * 2023-07-06 2023-10-03 腾讯科技(深圳)有限公司 模型确定方法、模型应用方法和相关装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337532A (zh) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 演出片段的标注方法、视频播放方法、装置及系统
WO2020010338A1 (fr) * 2018-07-05 2020-01-09 Dts, Inc. Synthèse audio hybride utilisant des réseaux neuronaux
WO2020119508A1 (fr) * 2018-12-14 2020-06-18 深圳壹账通智能科技有限公司 Procédé et et appareil de découpage vidéo, dispositif informatique et support de stockage
CN111400543A (zh) * 2020-03-20 2020-07-10 腾讯科技(深圳)有限公司 音频片段的匹配方法、装置、设备及存储介质
WO2021073416A1 (fr) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Procédé pour la génération d'une vidéo de personnage virtuel basé sur un réseau neuronal, et dispositif associé
CN113573161A (zh) * 2021-09-22 2021-10-29 腾讯科技(深圳)有限公司 多媒体数据处理方法、装置、设备及存储介质
CN114465737A (zh) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0406504D0 (en) * 2004-03-23 2004-04-28 British Telecomm Method and system for detecting audio and video scene changes
CN102521340B (zh) * 2011-12-08 2014-09-03 中国科学院自动化研究所 一种基于角色的电视剧视频分析方法
US9047376B2 (en) * 2012-05-01 2015-06-02 Hulu, LLC Augmenting video with facial recognition
US10185917B2 (en) * 2013-01-31 2019-01-22 Lf Technology Development Corporation Limited Computer-aided decision systems
CN106683661B (zh) * 2015-11-05 2021-02-05 阿里巴巴集团控股有限公司 基于语音的角色分离方法及装置
CN106021496A (zh) * 2016-05-19 2016-10-12 海信集团有限公司 视频搜索方法及视频搜索装置
US11417343B2 (en) * 2017-05-24 2022-08-16 Zoominfo Converse Llc Automatic speaker identification in calls using multiple speaker-identification parameters
CN109376603A (zh) * 2018-09-25 2019-02-22 北京周同科技有限公司 一种视频识别方法、装置、计算机设备及存储介质
CN110166818B (zh) * 2018-11-30 2021-08-17 腾讯科技(深圳)有限公司 待配音视频的生成方法、计算机设备及存储介质
CN110691258A (zh) * 2019-10-30 2020-01-14 中央电视台 一种节目素材制作方法、装置及计算机存储介质、电子设备
CN111462758A (zh) * 2020-03-02 2020-07-28 深圳壹账通智能科技有限公司 智能会议角色分类的方法、装置、设备及存储介质
CN113744742B (zh) * 2020-05-29 2024-01-30 中国电信股份有限公司 对话场景下的角色识别方法、装置和系统
CN112565825B (zh) * 2020-12-02 2022-05-13 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备以及介质
CN113192516B (zh) * 2021-04-22 2024-05-07 平安科技(深圳)有限公司 语音角色分割方法、装置、计算机设备及存储介质
CN113157965B (zh) * 2021-05-07 2022-05-20 杭州网易云音乐科技有限公司 音频可视化模型训练及音频可视化方法、装置及设备
CN113327628B (zh) * 2021-05-27 2023-12-22 抖音视界有限公司 音频处理方法、装置、可读介质和电子设备
CN113822142A (zh) * 2021-07-28 2021-12-21 腾讯科技(深圳)有限公司 角色识别方法、装置、计算机设备和存储介质
CN113808578B (zh) * 2021-11-16 2022-04-15 阿里巴巴达摩院(杭州)科技有限公司 音频信号处理方法、装置、设备及存储介质
CN113923521B (zh) * 2021-12-14 2022-03-08 深圳市大头兄弟科技有限公司 一种视频的脚本化方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337532A (zh) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 演出片段的标注方法、视频播放方法、装置及系统
WO2020010338A1 (fr) * 2018-07-05 2020-01-09 Dts, Inc. Synthèse audio hybride utilisant des réseaux neuronaux
WO2020119508A1 (fr) * 2018-12-14 2020-06-18 深圳壹账通智能科技有限公司 Procédé et et appareil de découpage vidéo, dispositif informatique et support de stockage
WO2021073416A1 (fr) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Procédé pour la génération d'une vidéo de personnage virtuel basé sur un réseau neuronal, et dispositif associé
CN111400543A (zh) * 2020-03-20 2020-07-10 腾讯科技(深圳)有限公司 音频片段的匹配方法、装置、设备及存储介质
CN113573161A (zh) * 2021-09-22 2021-10-29 腾讯科技(深圳)有限公司 多媒体数据处理方法、装置、设备及存储介质
CN114465737A (zh) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117872354A (zh) * 2024-03-11 2024-04-12 陕西欧卡电子智能科技有限公司 一种多毫米波雷达点云的融合方法、装置、设备及介质
CN117872354B (zh) * 2024-03-11 2024-05-31 陕西欧卡电子智能科技有限公司 一种多毫米波雷达点云的融合方法、装置、设备及介质

Also Published As

Publication number Publication date
CN114465737B (zh) 2022-06-24
CN114465737A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
WO2023197979A1 (fr) Procédé et appareil de traitement de données, et dispositif informatique et support des stockage
CN109117777B (zh) 生成信息的方法和装置
KR102148392B1 (ko) 동영상 메타데이터 태깅 시스템 및 그 방법
CN108307229B (zh) 一种影音数据的处理方法及设备
JP2022020647A (ja) ビデオ処理方法、装置、電子デバイス、記憶媒体、及びプログラム
CN111681678B (zh) 自动生成音效并匹配视频的方法、系统、装置及存储介质
CN113395578A (zh) 一种提取视频主题文本的方法、装置、设备及存储介质
CN112738557A (zh) 视频处理方法及装置
WO2023197749A1 (fr) Procédé et appareil de détermination de point temporel d'insertion de musique de fond, dispositif et support de stockage
CN110750996A (zh) 多媒体信息的生成方法、装置及可读存储介质
WO2023071578A1 (fr) Procédé et appareil d'alignement texte-voix, dispositif et support
CN116958342A (zh) 虚拟形象的动作生成方法、动作库的构建方法及装置
Soares et al. An optimization model for temporal video lecture segmentation using word2vec and acoustic features
CN113923521B (zh) 一种视频的脚本化方法
CN113642536B (zh) 数据处理方法、计算机设备以及可读存储介质
CN111488813A (zh) 视频的情感标注方法、装置、电子设备及存储介质
CN113936236A (zh) 一种基于多模态特征的视频实体关系及交互识别方法
CN113301382A (zh) 视频处理方法、设备、介质及程序产品
CN116708055A (zh) 智能多媒体视听图像处理方法、系统及存储介质
CN111681680B (zh) 视频识别物体获取音频方法、系统、装置及可读存储介质
CN111681676B (zh) 视频物体识别构建音频方法、系统、装置及可读存储介质
CN113762056A (zh) 演唱视频识别方法、装置、设备及存储介质
CN114495946A (zh) 声纹聚类方法、电子设备和存储介质
CN115705705A (zh) 基于机器学习的视频识别方法、装置、服务器和存储介质
CN111681677B (zh) 视频物体音效构建方法、系统、装置及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23787623

Country of ref document: EP

Kind code of ref document: A1