CN116975823A - Data processing method, device, computer equipment, storage medium and product - Google Patents

Data processing method, device, computer equipment, storage medium and product Download PDF

Info

Publication number
CN116975823A
CN116975823A CN202310714946.3A CN202310714946A CN116975823A CN 116975823 A CN116975823 A CN 116975823A CN 202310714946 A CN202310714946 A CN 202310714946A CN 116975823 A CN116975823 A CN 116975823A
Authority
CN
China
Prior art keywords
voiceprint
identified
audio
processing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310714946.3A
Other languages
Chinese (zh)
Inventor
朱鸿宁
赵楚涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Network Information Technology Co Ltd
Original Assignee
Shenzhen Tencent Network Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Network Information Technology Co Ltd filed Critical Shenzhen Tencent Network Information Technology Co Ltd
Priority to CN202310714946.3A priority Critical patent/CN116975823A/en
Publication of CN116975823A publication Critical patent/CN116975823A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25875Management of end-user data involving end-user authentication

Abstract

The application provides a data processing method, a data processing device, computer equipment, a storage medium and a product. The method comprises the following steps: if the current opened service platform is detected to be the target service platform, acquiring audio data of an object to be identified in the target service platform; performing voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified; carrying out identity recognition processing on the object to be recognized based on the voiceprint characteristics so as to determine the object type of the object to be recognized; and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. The application can efficiently and conveniently identify the object to be identified based on the voiceprint characteristics in the target service platform, thereby more efficiently managing the service of the object.

Description

Data processing method, device, computer equipment, storage medium and product
Technical Field
The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer device, a computer readable storage medium, and a computer program product.
Background
With the continuous development of internet technology, related applications in service platforms such as a game platform, a live broadcast platform, an advertisement platform and the like are endless, and before corresponding applications are executed in the service platform, identity recognition of a user is often involved, for example, identity verification needs to be performed before a game is logged in; as another example, identity recognition is also required before live viewing.
At present, the identification is mainly performed in a service platform in a face recognition mode, namely, the identification is performed by collecting face data of a user. The identity recognition mode needs to be matched with a user to collect complete face data to perform identity recognition, is not convenient enough, and has low recognition efficiency.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, computer equipment, a storage medium and a product, which can efficiently and conveniently identify an object to be identified based on voiceprint characteristics in a target service platform, so that service management can be performed on the object more efficiently.
In one aspect, an embodiment of the present application provides a data processing method, where the method includes:
if the current opened service platform is detected to be the target service platform, acquiring audio data of an object to be identified in the target service platform;
Performing voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified;
carrying out identity recognition processing on the object to be recognized based on the voiceprint characteristics so as to determine the object type of the object to be recognized;
and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform.
In one aspect, an embodiment of the present application provides a data processing apparatus, including:
the acquisition unit is used for acquiring the audio data of the object to be identified in the target service platform if the currently opened service platform is detected to be the target service platform;
the processing unit is used for carrying out voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified;
the processing unit is also used for carrying out identity recognition processing on the object to be recognized based on the voiceprint characteristics so as to determine the object type of the object to be recognized;
the processing unit is further used for acquiring a service management rule corresponding to the target service platform and performing service management on the object to be identified in the target service platform based on the service management rule and the object type.
In one possible implementation manner, the processing unit performs voiceprint extraction processing on the audio data of the object to be identified, so as to obtain voiceprint characteristics of the object to be identified, and the voiceprint extraction processing is used for performing the following operations:
performing feature extraction processing on the audio data to obtain audio features of the audio data;
acquiring a voiceprint recognition model, wherein the voiceprint recognition model is used for carrying out voiceprint recognition on any audio feature;
and carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
In one possible implementation manner, the processing unit performs feature extraction processing on the audio data to obtain audio features of the audio data, and is configured to perform the following operations:
preprocessing the audio data of the object to be identified to obtain preprocessed audio data;
extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data;
wherein the pretreatment comprises at least one of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
In one possible implementation, the audio features include mel-spectrum cepstral coefficient features; the processing unit performs feature extraction on the preprocessed audio data to obtain audio features of the audio data, and the processing unit is used for executing the following operations:
Carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames;
performing frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame;
and respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data.
In one possible implementation manner, the processing unit performs recognition processing on the audio feature based on the voiceprint recognition model to obtain a voiceprint feature corresponding to the object to be recognized, and is configured to perform the following operations:
carrying out convolution processing on the audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features;
performing aggregation treatment on the high-level semantic features to obtain aggregated audio features;
and carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the objects to be identified.
In a possible implementation manner, the processing unit performs identification processing on the object to be identified based on the voiceprint feature to determine an object type to which the object to be identified belongs, and is configured to perform the following operations:
Acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type;
performing similarity calculation on voiceprint characteristics of an object to be identified and each voiceprint label in a voiceprint database to obtain a plurality of voiceprint similarities;
and determining the object type of the object to be identified based on the obtained voiceprint similarity.
In a possible implementation manner, the processing unit determines an object type to which the object to be identified belongs based on the obtained multiple voiceprint similarities, including any one of the following:
determining the object type indicated by the voiceprint label corresponding to the maximum voiceprint similarity in the plurality of voiceprint similarities as the object type of the object to be identified;
and acquiring object types corresponding to one or more voiceprint similarities meeting a similarity threshold in the plurality of voiceprint similarities, and determining the object type of the object to be identified based on the acquired one or more object types.
In one possible implementation manner, the target service platform is a game platform, and the processing unit performs service management on the object to be identified based on the service management rule and the object type in the target service platform, and is configured to perform the following operations:
If the object type is the first type, determining a processing rule corresponding to the first type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources;
if the object type is the second type, determining a processing rule corresponding to the second type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data.
In one possible implementation, the processing unit is further configured to perform the following operations:
collecting voiceprint data of the object to be identified in the process of carrying out service management on the object to be identified according to the processing rule corresponding to the second type;
voiceprint extraction processing is carried out on voiceprint data of the object to be identified, so that verification characteristics of the object to be identified are obtained;
Performing identity verification processing on the object to be identified based on the verification characteristics;
and if the authentication is not passed, freezing the game account of the object to be identified.
In one possible implementation, the processing unit is further configured to perform the following operations:
acquiring a training data set, wherein the training data set comprises a plurality of sample audios and sample labels of each sample audio, and any sample audio is obtained by preprocessing source audio;
acquiring an initial neural network model, and training the initial neural network model based on a training data set;
and determining the trained initial neural network model as a voiceprint recognition model until the trained initial neural network model meets the model convergence condition.
In one possible implementation, the processing unit trains the initial neural network model based on the training data set for performing the following operations:
calculating a first loss for each sample audio in the training dataset using the first loss function;
calculating a second loss for each sample audio in the training dataset using a second loss function;
based on the first loss and the second loss of each sample audio, jointly adjusting model parameters of an initial neural network model;
The first loss or the second loss is determined based on a sample label and a voiceprint recognition result of each sample audio, and the voiceprint recognition result is obtained by performing recognition processing on the sample audio by the initial neural network model.
In one aspect, an embodiment of the present application provides a computer apparatus, where the computer apparatus includes a memory and a processor, and the memory stores a computer program, and when the computer program is executed by the processor, causes the processor to execute the data processing method described above.
In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program that, when read and executed by a processor of a computer device, causes the computer device to perform the above-described data processing method.
In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the data processing method described above.
In the embodiment of the application, the audio data of the object to be identified in the target service platform can be obtained, and then the audio data is subjected to characteristic extraction processing to obtain the voiceprint characteristics of the object to be identified; then, the identification processing can be carried out on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified; and finally, acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data processing scheme according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a data processing system according to an embodiment of the present application;
FIG. 3 is a flow chart of a data processing method according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a voiceprint recognition model according to an embodiment of the present application;
FIG. 5 is a flowchart of another data processing method according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of model training according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of cosine distance measurement learning according to an embodiment of the present application;
fig. 8 is a schematic view of a scenario of a data processing method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
1. Principle of data processing scheme:
the application provides a data processing scheme which can be applied to service platforms such as a game platform, a live broadcast platform, a video playing platform and the like, and particularly provides a scheme which can be used for carrying out identity recognition and service management based on voiceprint features in a target service platform, and can be used for carrying out service management on objects to be recognized in the target service platform more conveniently and efficiently. Referring to fig. 1, fig. 1 is a schematic diagram of a data processing scheme according to an embodiment of the application. Next, a flow involved in the principle of the data processing scheme provided by the present application will be roughly described with reference to fig. 1:
(1) And constructing a voiceprint recognition model. The application can construct a voiceprint recognition model, train the constructed voiceprint recognition model, and the trained voiceprint recognition model can be used for carrying out voiceprint recognition on the audio data of the user so as to obtain the voiceprint characteristics of the user.
(2) Audio features are extracted. Specifically, the application can perform feature extraction processing on the audio data of the user to obtain the audio features of the user.
(3) Voiceprint recognition. Specifically, in a scenario in which user identification is required, firstly, audio data of an object to be identified can be obtained from a target service platform; and then, voice print extraction processing can be carried out on the audio data of the user based on the trained voice print recognition model, so that voice print characteristics of the object to be recognized are obtained. Alternatively, when audio data of a user is identified based on a model, audio features of the user may be input to a voiceprint identification model for voiceprint identification.
(4) And (5) service management. Specifically, a service management rule corresponding to the target service platform can be obtained, and service management is performed on the object to be identified in the target service platform based on the service management rule and the object type. For example, if the target service platform is a game platform, different game management measures can be respectively adopted for the object to be identified of the teenager type and the adult type based on the service management rule, for example, if the object to be identified is an teenager, the related game management of preventing the game addiction can be carried out for the teenager.
Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
2. Related art terms related to data processing schemes:
(1) Target service platform:
the target service platform refers to APP (Application) or a client or applet (installation-free Application) capable of providing a service. For example, the target business platform may be a gaming platform, which may then be used to provide gaming services (e.g., gaming refill services, gaming establishment services, gaming experience services, etc.); as another example, the target service platform may be a live platform, which may be used to provide live services (e.g., a host object may live on the live platform, a viewer object may watch live pictures on the live platform, etc.); also, for example, the target service platform may be a video playing platform, and the video playing platform may be used to provide a video playing service, etc., and the embodiment of the present application does not specifically limit the type of the target service platform.
(2) Audio data of the object to be identified:
the audio data refers to audio data of an object to be identified in the target service platform, and the audio data may be data generated during operation of a service provided by the target service platform or may be audio data generated before operation of the service provided by the target service platform. For example, if the target service platform is a game platform, the object to be identified may be a game player, and the audio data may include: the verification command recorded by the game player before playing the game, the game command sent by the game player in the game process, call data and other data; for another example, if the target service platform is a live broadcast platform, the object to be identified may be a main broadcast object, and the audio data may be live broadcast data generated by the main broadcast object in the live broadcast process; if the target service platform is a video playing platform, the object to be identified may be a viewing object, and the audio data may be a verification instruction that the viewing object inputs before viewing the video.
(3) Voiceprint features:
voiceprint features are features similar to fingerprint features for uniquely identifying an object, and as their name suggests, voiceprint features are features for describing the sound of an object to be identified, for example voiceprint features may include, but are not limited to: wavelength, frequency, intensity, cadence, pitch, timbre, etc. In general, voiceprint features are obtained after processing based on a series of feature extraction steps, for example, in the data processing scheme provided by the present application, voiceprint features may be extracted based on a voiceprint recognition model, where the voiceprint features are generally represented as a vector, a matrix, or a data structure, and for example, the voiceprint features may be represented as an n-dimensional vector: (x 1, x2, x3...xn), another example voiceprint feature can also be expressed as an n x m matrix:
(4) Object type:
an object type may be a concept that describes an object to be identified based on one or more dimensions. For example, when an object type is used to indicate the type of object to be identified in the age dimension, then the object type may include, but is not limited to: infant type, adolescent type, adult type, middle-aged type, elderly type; as another example, when an object type is used to indicate a type of object to be identified in the gender dimension, then the object type may include, but is not limited to: male and female.
(5) Artificial intelligence:
artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The data processing scheme provided by the embodiment of the application mainly relates to the combination of machine learning technology in the field of artificial intelligence. For example, the initial neural network model may be trained using machine learning techniques, such that the trained initial neural network model is applied as a voiceprint recognition model to each service platform (e.g., gaming platform, live platform) to voiceprint recognize a user based on the voiceprint recognition model. Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. So-called voiceprint recognition (Voiceprint Recognition, VPR): also known as speaker recognition (Speaker Recognition), is one type of biometric technology. Voiceprint recognition includes speaker recognition (Speaker Identification) and speaker verification (Speaker Verification). Wherein, the speaker identifies which of a plurality of objects a certain voice is said to be, which is a 'one-more' problem; speaker verification is a "one-to-one discrimination" problem for confirming whether a piece of speech is spoken by a specified subject. In the application, the speaker confirmation is mainly involved, for example, in a game scene, the speaker can be judged to determine whether the current speaker is an account object corresponding to the current game account, so as to avoid situations such as game playing instead of playing.
(6) Cloud technology:
cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
According to the embodiment of the application, the processes of voiceprint extraction processing of the audio data of the object to be identified, identification processing of the object to be identified based on voiceprint characteristics and the like involve a large amount of data calculation and data storage service, and the processes require a large amount of computer operation cost, so that the application can realize related operation flows related to the processing processes based on a cloud computing technology. Among them, so-called cloud computing (cloud computing) is a computing mode that distributes computing tasks over a resource pool made up of a large number of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.
(7) Blockchain techniques:
blockchains are novel application modes of computer technologies such as distributed data storage, peer-to-Peer (P2P) transmission, consensus mechanisms, encryption algorithms, and the like. A blockchain is essentially a de-centralized database, which is a series of data blocks (also referred to as blocks) that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeiting) of the information and generating the next data block. The blockchain cryptographically ensures that the data is not tamperable and counterfeitable.
In the present application, the data processing process involves such steps as: optionally, the application can send the data to the blockchain for storage, and the security of the data processing process can be improved based on the characteristics of the blockchain such as non-falsification, traceability and the like so as to prevent the related data from being leaked or falsified.
It should be noted that, in the present application, related data in the data processing process is referred to, for example: audio data, voiceprint characteristics of the object to be identified, business management rules, and the like. When the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and related data collection, use and processing processes need to comply with related laws and regulations and standards of countries and regions, comply with legal, legal and necessary principles, and do not relate to obtaining data types prohibited or limited by laws and regulations. In some alternative embodiments, the related data related to the embodiments of the present application is obtained after the object is individually authorized, and in addition, the related data related to the use is indicated to the object when the object is individually authorized.
With reference to FIG. 2, FIG. 2 is a block diagram illustrating an architecture of a data processing system according to an embodiment of the present application. The system architecture diagram of the data processing comprises: server 204 and a terminal device cluster, wherein the terminal device cluster comprises: a plurality of terminal devices such as terminal device 201, terminal device 202, and terminal device 203. Of course, the number of terminal devices in the terminal device cluster is only used as an example, and the embodiment of the present application does not limit the number of terminal devices. Any one of the terminal devices in the terminal device cluster may be directly or indirectly connected to the server 204 through a wired or wireless communication manner.
Each terminal device in the terminal device cluster may be a cell phone, tablet computer, notebook computer, palm computer, mobile internet device (MID, mobile internet device), vehicle-mounted device, roadside device, aircraft, wearable device, smart device such as smart watch, smart bracelet, pedometer, etc., virtual reality device. It will be appreciated that the types of the terminal devices in the cluster of terminal devices may be the same or different, for example: terminal device 201 may be a mobile phone and terminal device 202 may also be a mobile phone. And the following steps: terminal device 201 may be a tablet computer and terminal device 203 may be a vehicle-mounted device. The application does not limit the number and types of terminal devices in the terminal device cluster.
The server 204 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
The following describes in detail the data interaction procedure between the server 204 and the terminal device 201:
(1) if it is detected that the currently opened service platform is the target service platform, the terminal device 201 may acquire audio data of the object to be identified in the target service platform.
(2) The terminal device 201 may transmit the audio data of the object to be recognized to the server 204.
(3) The server 204 may perform voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint features of the object to be identified; then, the server 204 performs identification processing on the object to be identified based on the voiceprint characteristics to determine the object type to which the object to be identified belongs; finally, the server 204 may return the identified object type to the terminal device 201.
(4) The terminal device 201 may obtain a service management rule corresponding to the target service platform, and perform service management on the object to be identified in the target service platform based on the service management rule and the object type.
It should be understood that the above-described interaction procedure is only for example and does not specifically limit the execution steps of the terminal device 201 and the server 204. In one possible implementation manner, the step of performing voiceprint extraction processing on the audio data of the object to be identified to obtain the voiceprint feature of the object to be identified may also be performed by the terminal device 201; in another possible implementation, the above steps may also be performed by any terminal device or server 204 in the terminal device cluster alone.
In one possible implementation, the data processing system provided by the embodiment of the present application may be deployed at a node of a blockchain, for example, each of the terminal devices (e.g., the terminal device 201, the terminal device 202, and the terminal device 203) included in the server 204 and the terminal device may be used as a node device of the blockchain, to jointly form a blockchain network. Therefore, the data processing process of the audio data (namely, voiceprint extraction processing is carried out on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified, and identification processing is carried out on the object to be identified based on the voiceprint characteristics to determine the object type of the object to be identified) can be carried out on the blockchain, so that fairness and fairness of the data processing flow can be guaranteed, the data processing flow can be traceable, and meanwhile, the data security in the data processing process can be guaranteed, so that the security and reliability of the whole data processing flow are improved.
The data processing system provided by the embodiment of the application can acquire the audio data of the object to be identified in the target service platform, and then perform feature extraction processing on the audio data to obtain the voiceprint features of the object to be identified; then, the identification processing can be carried out on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified; and finally, acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
It may be understood that the schematic diagram of the system architecture described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.
Based on the foregoing description of the data processing scheme and the data processing system of the present application, specific embodiments related to the data processing scheme will be described in detail below with reference to the accompanying drawings.
Referring to fig. 3, fig. 3 is a flowchart of a data processing method according to an embodiment of the application. The data processing method is performed by a computer device, such as any of the terminal devices or servers shown in fig. 2.
The data processing method mainly comprises, but is not limited to, the following steps S301 to S304:
s301: and if the current opened service platform is detected to be the target service platform, acquiring the audio data of the object to be identified in the target service platform.
The target service platform may be any type of service platform, for example, the target service platform may include any of the following: game platform, live platform, video play platform. If the target service platform is a game platform, the object to be identified can be a game player in the game platform, and the audio data can comprise game data generated by the game player in the game process; if the target service platform is a live broadcast platform, the object to be identified can be a main broadcast object in the live broadcast platform, and the audio data can comprise live broadcast data generated by the main broadcast object in a live broadcast process; if the target service platform is a video playing platform, the object to be identified may be a viewing object in the video playing platform, and the audio data may include video interaction data generated by the viewing object in the process of viewing video.
In one possible implementation manner, the audio data of the object to be identified in the target service platform may be obtained according to a preset instruction, that is, the object to be identified may generate corresponding audio data in the target service platform according to the preset instruction, where the preset instruction may be an instruction of specifying a text for a fixed duration, and for example, the preset instruction may specifically be: the experience of this game is truly too excellent.
In another possible implementation manner, a section of audio data corresponding to the object to be identified may be obtained randomly in the target service platform, where the audio data may be data in a specified time period or a preset duration. For example, in a game platform, audio data for a period of time from the start of a game to the end of the game may be acquired by a game player; as another example, data having a duration of 10s may be intercepted from audio emitted by a game player as audio data.
S302: and carrying out voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified.
In one possible implementation manner, the computer device performs voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified, and specifically includes the following steps: firstly, carrying out feature extraction processing on audio data to obtain audio features of the audio data; then, a voiceprint recognition model is obtained, and the voiceprint recognition model is used for carrying out voiceprint recognition on any audio feature; and finally, carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
In another possible implementation manner, the computer device may acquire a voiceprint recognition model, and directly perform voiceprint recognition processing on the audio data of the object to be recognized based on the voiceprint recognition model, so as to obtain voiceprint features of the object to be recognized.
S303: and carrying out identification processing on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified.
Specifically, since the voiceprint feature can be used to uniquely identify an object, the identification process can be performed on the object to be identified based on the voiceprint feature, and the object type to which the object to be identified belongs can be determined after the identification process. For example, in a gaming platform, the type of object (e.g., teenager, adult, etc.) of a game player may be identified based on the voiceprint characteristics of the game player to determine the game character to which the game player belongs; as another example, in a live platform, the object type of a host object (game host, music host, wild host, etc.) may be identified based on the voiceprint characteristics of the host object.
In one possible implementation manner, the computer device performs identification processing on the object to be identified based on the voiceprint feature to determine the object type to which the object to be identified belongs, and the method may include the following steps: firstly, acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type; then, similarity calculation is carried out on voiceprint characteristics of the object to be identified and each voiceprint label in a voiceprint database, so that a plurality of voiceprint similarities are obtained; and finally, determining the object type of the object to be identified based on the obtained voiceprint similarity. The similarity calculation method may include, but is not limited to: cosine similarity algorithm, PLDA (Probabilistic Linear Discriminant Analysis ) algorithm, euclidean distance similarity algorithm, and the like.
Specifically, any voiceprint label in the voiceprint database is generated based on voiceprint extraction processing of audio data of a target service object in a target service platform, for example, if the target service platform is a game platform, game data can be collected from some open-source game databases, and voiceprint feature extraction processing can be performed on the game data subsequently, so that the voiceprint label corresponding to the game data is obtained. The specific process of performing the voiceprint feature extraction processing on the game data may refer to the voiceprint extraction processing manner of the audio data in step S302, which is not described herein in detail.
The voiceprint tag is used to indicate a voiceprint feature of an object type, which may be used to indicate the type of object to be identified in one or more dimensions. For example, the object types are: the voiceprint tag can be used to indicate the type in the age dimension of infant type, adolescent type, adult type, middle-aged type, elderly type, etc.: infant-type voiceprint features, teenager-type voiceprint features, adult-type voiceprint features, middle-aged-type voiceprint features, elderly-type voiceprint features. As another example, the object type is: male, female, voiceprint labels can be used to indicate: voiceprint characteristics of men, voiceprint characteristics of women.
S304: and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform.
Specifically, if the target service platform is a game platform, the service management rule may be a game management rule, where the game management rule may include, for example: game system, game notice, game recharging information, etc.; if the target service platform is a live broadcast platform, the service management rule may be a live broadcast management rule, where the live broadcast management rule may include the following contents: a live broadcast standard system, live broadcast notes, live broadcast reminding information and the like.
Further, the business management rule may further include a plurality of processing rules, one for each object type. In particular, the game management rules in the game platform may include: teenager-type processing rules and adult-type processing rules; the live management rules in the live platform may include: processing rules of the anchor object and processing rules of the audience object.
In one possible implementation manner, the computer device performs service management on the object to be identified in the target service platform based on the service management rule and the object type, which may include: and acquiring matched processing rules from the target service platform according to the object type, and carrying out service management on the object to be identified according to the processing rules matched with the object type. Wherein traffic management may include, but is not limited to: rights management, account management, resource management, etc. Based on the mode, the service management can be carried out on the object to be identified in the target service platform according to the processing rule corresponding to the object type, so that the service management is more efficient and convenient.
In the embodiment of the application, the audio data of the object to be identified in the target service platform can be obtained, and then the audio data is subjected to characteristic extraction processing to obtain the voiceprint characteristics of the object to be identified; then, the identification processing can be carried out on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified; and finally, acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
The process of extracting audio features is described in detail below:
in one possible implementation manner, the computer device performs feature extraction processing on the audio data to obtain audio features of the audio data, and specifically includes the following steps: firstly, preprocessing audio data of an object to be identified to obtain preprocessed audio data; then, extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data; wherein the pretreatment comprises at least one of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
(1) Data preprocessing:
specifically, the preprocessing steps described above may be performed based on a speech preprocessing tool (e.g., a Kaldi tool). Wherein, (1) denoising: removing noise from the audio data, for example, if the audio data is game data, background sounds, environmental noise, and the like in the game process can be used as noise for the elimination processing; (2) volume enhancement processing: increasing the volume of the collected audio data, for example, the volume of the audio data may be increased to a specified volume (e.g., 100); (3) audio clip processing: cutting the audio data into voice segments with fixed length so as to facilitate subsequent feature extraction and voiceprint recognition, wherein in general, the audio clipping process cuts the audio data into voice segments with length of 1-3 seconds, and the length can be adjusted according to specific application scenes (for example, the voice segments cut into 1 second in a game scene and the voice segments cut into 2 seconds in a live broadcast scene); (4) audio alignment processing: the voice fragments clipped in the previous steps are aligned to the same length so as to facilitate subsequent feature extraction and voiceprint recognition. Since different speech segments may be of different lengths, they need to be aligned so that they have the same length, and in particular, the implementation of the audio alignment process is generally two: one is alignment processing based on linear interpolation, i.e. the speech segments are linearly interpolated so that they have the same length; the other is an alignment process based on dynamic time warping (Dynamic Time Warping, DTW), i.e. the alignment of speech segments to the same length by means of dynamic programming.
(2) The extraction process of the audio features comprises the following steps:
the audio features referred to in the embodiments of the present application mainly refer to frequency domain features of audio data, for example, the audio features may include, but are not limited to: mel spectral cepstral coefficient (Mel-Frequency Cepstral Coefficients, MFCC) characteristics, FBank (Filterbank) characteristics, LPC (Linear Prediction Coefficient, linear prediction coefficients) characteristics. In the voiceprint recognition scenario according to the embodiment of the present application, MFCC features are generally used as the extracted audio features. Among them, the mel spectrum is a spectrum representation method generally used for speech signal processing, which is obtained by weighting the spectrum of a sound signal so as to more conform to the perceptual characteristics of the human ear; the human ear has different sound sensitivity to different frequencies, and the Mel frequency spectrum is obtained by filtering the frequency domain signal by a Mel filter bank, so that the resolution of the high frequency part is reduced, and the resolution of the low frequency part is improved, thereby better simulating the perception characteristic of the human ear.
In one possible implementation manner, the computer device performs feature extraction on the preprocessed audio data to obtain mel-spectrum cepstrum coefficient features of the audio data, and specifically includes the following steps: firstly, carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames; then, carrying out frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame; and finally, respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data. Specifically, the calculation process of mel spectrum cepstrum coefficient features can be divided into the following steps:
(1) Pre-emphasis: the audio data is subjected to a high-pass filtering process to enhance the energy of the high-frequency portion. Optionally, the embodiment of the application can perform high-pass filtering processing on the preprocessed audio data, so as to increase the energy of the high-frequency signal of the audio data.
(2) Framing: the pre-processed audio data is divided into a number of fixed length audio frames, typically one audio frame having a duration of 20-30 milliseconds.
(3) Windowing: and windowing each audio frame obtained by framing so as to reduce abrupt changes at the endpoints of the audio frame.
(4) Fourier transform: the windowed speech signal is fourier transformed for the purpose of converting the time domain signal into a frequency domain signal.
(5) Mel filter bank: and filtering the obtained frequency domain signal through a set of Mel filters to simulate the perception mode of human ears on sound, and converting the frequency spectrum (frequency domain signal) of the signal into Mel frequency spectrum, thereby obtaining the characteristics of the cepstrum coefficient of Mel frequency spectrum.
The following describes the extraction process of voiceprint features in detail:
in one possible implementation manner, the computer device performs recognition processing on the audio feature based on the voiceprint recognition model to obtain a voiceprint feature corresponding to the object to be recognized, and may include the following steps: firstly, carrying out convolution processing on audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features; then, carrying out aggregation treatment on the high-level semantic features to obtain aggregated audio features; and finally, carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the objects to be identified.
The voiceprint recognition model may be a neural network model, which may specifically include: the neural network model may specifically include: CNN (Convolutional neural networks, convolutional neural Network) model, DR-Res2net (DR-Residual Network) model, GMM (Gaussian Mixture Model, gaussian mixture) model, TDNN (Time Deep Neural Network, time-lapse neural Network) model, ECAPA-TDNN model (Emphasized Channel Attention Propagation and Aggregation TDNN, an improved time-lapse neural Network), and the like, the model structure of the voiceprint recognition model is not particularly limited by the present application.
It should be noted that the DR-Res2net model is a voiceprint recognition model based on a depth residual network and Res2net, and the DR-Res2net model is larger in terms of model depth and parameter amount compared to the ECAPA-TDNN model, and therefore requires more computing resources and time to train and optimize. And, the GMM model is a voiceprint recognition model based on a Gaussian mixture model, which models and recognizes sounds by establishing a Gaussian mixture model, which requires more artificial feature extraction and preprocessing than ECAPA-TDNN model, and which is less accurate and robust. Therefore, in the embodiment of the application, the ECAPA-TDNN model is selected as a voiceprint recognition model, is a voice recognition model based on a deep neural network, is improved on the basis of the traditional TDNN model, and is added with an enhanced channel attention mechanism, an information propagation mechanism and an aggregation mechanism. The ECAPA-TDNN model mainly has the following characteristics:
1. Enhanced channel attention mechanism: in the conventional TDNN model, the feature weights of all channels are equal, and the importance of different channels cannot be distinguished. The ECAPA-TDNN model can adaptively learn the importance of different channels by introducing a channel attention mechanism and weight the characteristics, so that the modeling capability of the voiceprint recognition model on different audio characteristics is improved.
2. Information propagation and aggregation mechanism: the ECAPA-TDNN model adopts an information propagation and aggregation mechanism, so that the model can better capture time sequence information. The mechanism allows each layer to pass and aggregate information from the previous layer by adding multiple TDNN layers in the network and concatenating them together. The model can thus better capture timing information in the audio data, thereby improving the performance of the model.
3. Random frame masking mechanism: to further increase the robustness of the model, the ECAPA-TDNN model also introduces a random frame masking mechanism. The mechanism may mask some of the input frames randomly, thereby enabling the model to better accommodate noise and other disturbances.
4. Efficient calculation: the ECAPA-TDNN model is computationally very efficient and can process large amounts of audio data in a relatively short time. This is very useful for real-time applications such as speech recognition.
Specifically, the ECAPA-TDNN model may include: an input layer, an ECAPA-TDNN layer, a residual connection layer, a polymer layer, and an output layer. In detail, referring to fig. 4, fig. 4 is a schematic structural diagram of a voiceprint recognition model according to an embodiment of the present application, and the following description refers to the steps executed by each layer in the ECAPA-TDNN model in the voiceprint feature extraction process with reference to fig. 4:
(1) input layer: the extracted MFCC features are used as input parameters of the model, and the MFCC features are input into the ECAPA-TDNN model. Typically, the MFCC characteristics are organized into a two-dimensional matrix, where each row represents a time frame and each column represents a MFCC coefficient.
(2) ECAPA-TDNN layer: the TDNN is mainly used for extracting the characteristics of the voice signals and consists of a series of convolution layers, nonlinear activation functions and pooling layers, and can extract the local information of the voice signals and convert the local information into high-level semantic characteristics, so that the performance of an ECAPA-TDNN model is improved. The ECAPA-TDNN layer in the present application combines the timing characteristics of TDNN with the channel attention mechanism of ECAPA. Specifically, one-dimensional convolution is applied in the time dimension to process the input MFCC characteristics, capturing the timing characteristics of the audio data part; meanwhile, a channel attention module is introduced, which is used for adaptively adjusting the weight among different channels, strengthening or inhibiting the characteristics of the different channels, and the high-level semantic characteristics can be obtained through the convolution processing and the self-attention processing of the layer.
(3) Residual connection layer: the residual connection layer can directly add the output of the front layer (i.e., ECAPA-TDNN layer) to the input of the back layer (i.e., aggregate layer), i.e., as a connection and bridge between the two layers, thereby improving the training stability and performance of the model. That is, the residual connection layer may input the high-level semantic features output by the ECAPA-TDNN layer into the aggregation layer for processing.
(4) Polymeric layer: after passing through a series of ECAPA-TDNN layers, the model aggregates the extracted high-level semantic features. The polymerization process is generally referred to as: the features of the previous layers are connected together and then the mean or variance is calculated in the time dimension. Thus, the global time sequence characteristics of the audio data can be captured, and the richness of the local characteristics is reserved, so that the aggregated audio characteristics are output.
(5) Output layer: the output layer can generate final output, namely voiceprint characteristics of an object to be identified, through one full connection layer and one activation function.
Based on the description, the ECAPA-TDNN model can better capture time sequence information in audio data by introducing technologies such as channel attention, information propagation and aggregation mechanisms, random frame masks and the like, improves robustness and recognition performance of the model, has good performance in aspects such as model depth, parameter quantity, accuracy and robustness, and is a relatively excellent voiceprint recognition model.
The following describes a specific procedure how to determine the object type to which the object to be identified belongs:
(1) obtaining a voiceprint database: the voiceprint database comprises a plurality of voiceprint labels, and each voiceprint label in the voiceprint database can be stored in an associated mode according to the object and the object type. For example, the storage format of the voiceprint database can be expressed as shown in table 1 below:
TABLE 1 storage format of voiceprint database
Object(s) Object type Voiceprint label
User 1 Teenager type lable1
User 2 Adult type lable2
User 3 Male men lable3
User 4 Female woman lable4
... ... ...
(2) Calculating voiceprint similarity: and respectively carrying out similarity calculation on the voiceprint characteristics of the object to be identified and each voiceprint label in the voiceprint database to obtain a plurality of voiceprint similarities. The calculation formula of the voiceprint similarity can be as follows:
wherein Vm in the above formula refers to the voiceprint feature of the object to be identified, and Vn refers to any voiceprint tag in the voiceprint database. Based on the formula, a plurality of voiceprint similarities can be calculated, and one voiceprint similarity corresponds to one voiceprint label in the voiceprint database.
(3) Determining the object type:
mode one: and determining the object type indicated by the voiceprint label corresponding to the maximum voiceprint similarity in the plurality of voiceprint similarities as the object type of the object to be identified. Specifically, in the embodiment of the application, the voiceprint labels are respectively based on: the calculated voiceprint similarities between the voiceprint features of the object to be identified and the voiceprint features of the object to be identified are expressed as follows: sim1, sim2, sim3, sim4. Assuming that sim1 > sim2 > sim3 > sim4, the object type (teenager type) corresponding to the voiceprint tag table 1 may be determined as the object type of the object to be identified.
Mode two: and acquiring object types corresponding to one or more voiceprint similarities meeting a similarity threshold in the plurality of voiceprint similarities, and determining the object type of the object to be identified based on the acquired one or more object types. Specifically, in the embodiment of the application, the voiceprint labels are respectively based on: the calculated voiceprint similarities between the voiceprint features of the object to be identified and the voiceprint features of the object to be identified are expressed as follows: the voice print labels reaching the similarity threshold are table 1 and table 3 if the similarity threshold is 90%, the voice print label is table 1 and table 3, and the object type (for example, teenager) of the object to be identified can be determined based on the object type (teenager type) corresponding to the voice print label table 1 and the object type (for example, male) corresponding to the voice print label table 3.
Based on the above description, in the embodiment of the application, the type of the object to be identified can be determined based on the voiceprint similarity between the voiceprint features and the voiceprint tag, and the determined type of the object can be more accurate and reliable because the voiceprint features are unique features for identifying the object and the type of the object is determined based on the similarity between the voiceprint features.
In one possible implementation manner, taking an example that the target service platform is a game platform, a detailed description is given of a related process of performing service management on an object to be identified based on a service management rule and an object type in the game platform:
(1) if the object type is the first type, determining a processing rule corresponding to the first type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources. For example, the first type may be a teenager type, and performing service management on the object to be identified according to the processing rule of the teenager type may include: limiting a portion of the play authority of the teenager (e.g., a game refill authority), setting a play mode for preventing the teenager (e.g., a portion of the play function may be limited or turned on, etc.), limiting a play duration (e.g., one hour), setting an upper limit value of a play resource in a play account of the teenager, etc.
(2) If the object type is the second type, determining a processing rule corresponding to the second type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data. For example, the second type may be an adult type, and performing service management on the object to be identified according to the processing rule of the adult type may include: more gaming jurisdictions (e.g., gaming refill jurisdictions) are opened for the adult, gaming operations are subject to canonical checks, gaming data during the course of a game is analyzed to facilitate optimization of subsequent play functions, and so forth.
In one possible implementation manner, in the process of performing service management on the object to be identified according to the processing rule corresponding to the second type, collecting voiceprint data of the object to be identified; voiceprint extraction processing is carried out on voiceprint data of the object to be identified, so that verification characteristics of the object to be identified are obtained; performing identity verification processing on the object to be identified based on the verification characteristics; and if the authentication is not passed, freezing the game account of the object to be identified. Specifically, when the object to be identified is identified as an adult type, the identity of the object can be continuously identified in a verification manner in the game process, wherein the specific verification manner is as follows:
1. collecting voiceprint data: the voice print data of the object to be identified can be acquired according to preset time intervals (such as 1 minute, 3 minutes and the like) in the game process;
2. identifying voiceprint data: the voiceprint data may be identified based on a voiceprint identification model (e.g., ECAPA-TDNN model) to obtain verification features (the model identification process may refer to step S302 in detail, and the embodiments of the present application are not described herein in detail);
3. and (3) identity authentication: firstly, voiceprint features of an object registered by a current game account can be obtained from a target service platform, wherein the voiceprint features refer to that when the registered object performs game registration, a section of audio data is required to be input according to a preset instruction, and then the corresponding voiceprint features are extracted based on the audio data; then, similarity calculation can be carried out on the voiceprint features and the verification features to obtain feature similarity; if the feature similarity reaches a similarity threshold, the object to be identified can be judged to be a registered object of the current game account, namely, the identity verification of the object to be identified is passed.
By the method, detection of situations such as game playing, game training and the like in the game process can be avoided, so that game behaviors can be standardized, and the user experience of a game platform is improved.
Referring to fig. 5, fig. 5 is a flowchart illustrating another data processing method according to an embodiment of the application. The data processing method is performed by a computer device, such as any of the terminal devices or servers shown in fig. 2.
The data processing method mainly comprises, but is not limited to, the following steps S501 to S505:
s501: and if the current opened service platform is detected to be the target service platform, acquiring the audio data of the object to be identified in the target service platform.
In one possible implementation manner, the audio data of the object to be identified in the target service platform may be obtained according to a preset instruction, that is, the object to be identified may generate corresponding audio data in the target service platform according to the preset instruction, where the preset instruction may be an instruction of specifying a text for a fixed duration, and for example, the preset instruction may specifically be: the experience of this game is truly too excellent.
In another possible implementation manner, a section of audio data corresponding to the object to be identified may be obtained randomly in the target service platform, where the audio data may be data in a specified time period or a preset duration. For example, in a game platform, audio data for a period of time from the start of a game to the end of the game may be acquired by a game player; as another example, data having a duration of 10s may be intercepted from audio emitted by a game player as audio data.
S502: and carrying out feature extraction processing on the audio data to obtain the audio features of the audio data.
In one possible implementation manner, the computer device performs feature extraction processing on the audio data to obtain audio features of the audio data, and may include the following steps: firstly, preprocessing audio data of an object to be identified to obtain preprocessed audio data; then, extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data; wherein the pretreatment comprises at least one of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
Specifically, if the audio features include mel-spectrum cepstral coefficient features; the computer device performs feature extraction on the preprocessed audio data to obtain audio features of the audio data, and may include the following steps: firstly, carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames; then, carrying out frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame; and finally, respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data.
It should be noted that, in the embodiment of the present application, the specific implementation steps of extracting the mel-spectrum cepstrum coefficient features may refer to the related steps in step S302 in the above embodiment, and are not described herein again.
S503: and carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
In one possible implementation, a computer device may obtain a training data set including a plurality of sample audio and a sample tag for each sample audio, any sample audio being obtained by preprocessing source audio; then, acquiring an initial neural network model, and training the initial neural network model based on a training data set; and determining the trained initial neural network model as a voiceprint recognition model until the trained initial neural network model meets the model convergence condition.
In particular, the training of the initial neural network model by the computer device based on the training data set may comprise the steps of: firstly, calculating a first loss of each sample audio in a training data set by adopting a first loss function; then, calculating a second loss of each sample audio in the training dataset using a second loss function; finally, based on the first loss and the second loss of each sample audio, jointly adjusting model parameters of an initial neural network model; wherein the first loss or the second loss is: and determining based on the sample label and the voiceprint recognition result of each sample audio, wherein the voiceprint recognition result is obtained by performing recognition processing on the sample audio by the initial neural network model.
Among these, the so-called model convergence conditions may include any of the following: when the training times of the initial neural network model reach a preset training threshold, for example, 100 times, the initial neural network model meets a model convergence condition; when the error between the sample label of any sample audio and the voiceprint recognition result of the sample audio is smaller than an error threshold, the initial neural network model meets a model convergence condition; when the change between voiceprint recognition results obtained by training the initial neural network model twice is smaller than a change threshold, the initial neural network model meets the model convergence condition.
Next, a training process of the voiceprint recognition model will be described in detail:
1) And (3) data acquisition: the first step in training the voiceprint recognition model is to collect enough speech data as training data (sample audio). During data collection, it may be selected to obtain source data from a voice database (e.g., an Aishell open source chinese voice database). The Aishell open-source Chinese voice database comprises a plurality of voice data, wherein the voice data cover data obtained by recording the microphone of speaking objects from different ages, sexes, dialects and accents, and noise and echo are avoided in the recording process of the voice data so as to facilitate subsequent model training.
2) Data preprocessing: after the speech data (source data) is collected, it needs to be preprocessed to generate sample audio for model training. In particular, the data preprocessing step may be accomplished using a speech processing tool, and the preprocessing may include at least one of: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
3) Feature extraction: feature extraction is performed on the training data obtained after the preprocessing, thereby extracting mel-spectrum cepstrum coefficient features (MFCC features) capable of characterizing audio features. The details of the extraction process of the MFCC features may refer to the relevant steps involved in step S302 in the above embodiment, and the embodiments of the present application are not described herein.
4) Model training: the model training task of the embodiment of the application is a task of classifying a speaking object. When the initial neural network model is trained, the MFCC characteristics of the training data can be used as model input, and after model processing, which speaking object the MFCC characteristics belong to can be identified. Referring to fig. 6, fig. 6 is a schematic flow chart of model training according to an embodiment of the application. As shown in fig. 6, in the training of the initial neural network model based on the training data set, the input sample audio may be added with: noise, background sounds, and adding masks to MFCC features for model training. Specifically:
First, the initial neural network model may be adapted using a first loss function (e.g., an AAM softmax loss function), which is a modified softmax loss function that better optimizes the classification performance of the model when training the initial neural network model (e.g., a deep neural network model). The following is the formula for the AAM softmax loss function:
wherein sj is the scaling factor of the j-th class for controlling the relative importance of the different classes; k refers to the number of classes that the initial neural network model can classify.
The adjusted initial neural network model may then be re-trimmed using a second loss function (a larger marginal softmax loss function, such as angle prototype loss function).
Finally, a triple loss function can be used for cosine distance measurement learning on the adjusted initial neural network model. Referring to fig. 7, fig. 7 is a schematic flow chart of cosine distance measurement learning provided in the embodiment of the present application, as shown in fig. 7, the cosine distance measurement is used for performing similarity calculation on a voiceprint recognition result (typically, voiceprint embedded vectors, for example, emmbedding 1, emmbedding 2, emmbedding 3) of the adjusted initial neural network model, where a similarity matrix between the voiceprint embedded vectors is calculated by using a cosine similarity algorithm, so that the similarity degree of two emmbeddings can be more accurately distinguished. For example, the voiceprint similarity between emmbedding 1 and emmbedding 2 may be calculated based on a cosine similarity algorithm; as another example, the voiceprint similarity between emmbedding 1 and emmbedding 3 may be calculated based on a cosine similarity algorithm.
5) Model evaluation: after model training is completed, the initial neural network model after training can be evaluated. In particular, standard evaluation metrics such as error rate, recall, precision, and accuracy may be employed to evaluate the performance of the model. If the trained initial neural network model meets the evaluation conditions, the trained initial neural network model can be determined to meet the model convergence conditions, and the trained initial neural network model can be used as a voiceprint recognition model. Through practice, specific model evaluation effects in the present application can be shown in the following table 2:
TABLE 2 model evaluation Effect
In the evaluation of the model, table 2 shows: and (5) a plurality of evaluation indexes including recall rate, accuracy rate and error rate. In general, the present application focuses on the equal error rate of a model, which refers to the probability that two different sample audios are erroneously recognized as the same speaking object by the model in the voiceprint recognition process. If the error rate is higher, the model is easy to be confused when the audio data of different speaking objects are recognized, and the recognition result is not accurate and reliable enough; similarly, if the error rate is low, the recognition result of the model is relatively accurate and reliable. The embodiment of the application pays attention to the error rate and the like, and can find out the problem that the model is easy to be confused when the audio data of different speaking objects are recognized, thereby reducing the situations of misrecognition and missing recognition and improving the accuracy of the model recognition. As can be seen from table 2, in the training process of model adjustment based on the AAM softmax loss function, the model gradually tends to converge (e.g. the error rate of the model gradually decreases, the recall rate of the model gradually increases, and the accuracy rate of the model also continuously increases) by performing fine tuning processing multiple times by using other larger marginal loss functions (e.g. angle prototype loss, large margin fine-tune, CSML, etc.), so that the voiceprint recognition model obtained by final training can have better model performance.
In the embodiment of the application, the AAM softmax function is used as the loss function of model training, so that the accuracy and the efficiency of model training can be improved, and the method has the following characteristics: (1) and the classification performance is improved: the AAM softmax function can adaptively adjust the weights of different classes, which is very useful for some complex classification tasks, so that the classification performance of the model can be better optimized. (2) Reducing overfitting: because the AAM softmax function can scale weights of different categories, the overfitting situation of the model can be reduced, which is very helpful for training a large deep neural network model. (3) Better deal with class imbalance problems: in some training data sets, there may be a large difference in sample audio of different classes, which may result in poor classification performance of the model for a few classes, and the AAM softmax function may better address the class imbalance problem by adjusting the weights of the different classes.
Further, the trained voiceprint recognition model is used for voiceprint recognition of any audio feature. Specifically, the computer device performs recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized, and the method comprises the following steps: firstly, carrying out convolution processing on audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features; then, carrying out aggregation treatment on the high-level semantic features to obtain aggregated audio features; and finally, carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the objects to be identified.
It should be noted that, the specific implementation steps of extracting the voiceprint features in the embodiment of the present application may refer to the relevant steps in the above embodiment in detail, and will not be described herein.
S504: and carrying out identification processing on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified.
In one possible implementation manner, the computer device performs identification processing on the object to be identified based on the voiceprint feature to determine the object type to which the object to be identified belongs, and the method may include the following steps: firstly, acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type; then, similarity calculation is carried out on voiceprint characteristics of the object to be identified and each voiceprint label in a voiceprint database, so that a plurality of voiceprint similarities are obtained; and finally, determining the object type of the object to be identified based on the obtained voiceprint similarity. The similarity calculation method may include, but is not limited to: cosine similarity algorithm, euclidean distance similarity algorithm, and the like.
S505: and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform.
In one possible implementation manner, if the target service platform is a game platform, the performing, by the computer device, service management on the object to be identified in the game platform based on the service management rule and the object type may include: if the object type is the first type, determining a processing rule corresponding to the first type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources; if the object type is the second type, determining a processing rule corresponding to the second type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data.
It should be noted that, the specific execution process in steps S504-S505 may refer to the related process in steps S303-S304 in the above embodiment, and the embodiments of the present application are not described herein again.
The following describes the data processing scheme provided by the embodiment of the present application with reference to a game scenario:
referring to fig. 8, fig. 8 is a schematic diagram of a game scenario of a data processing method according to an embodiment of the present application. As shown in fig. 8, the game scene is applied to a game system, and the game system may include: a game client and a server; wherein the game client is a device used by a game player (object to be identified) and is used for collecting audio data of the game player; the server is a device that performs data processing tasks on audio data of a game player, in which a voiceprint recognition model can be run. (1) In the authentication scene of the game (such as the anti-addiction login scene of the minors), before a game player logs in the game platform, the game player can select a voiceprint authentication mode to log in the game, and in the voiceprint authentication process, the game player needs to record sound information according to a designated text or number, so that the audio data of the game player is collected in a game client. (2) The game client sends the collected audio data to the server, and the server can firstly preprocess and extract the characteristics of the audio data to obtain audio characteristics, and then transmit the audio characteristics to a trained voiceprint recognition model (such as an ECAPA-TDNN model) for voiceprint recognition. In particular, the pretreatment may include any one or more of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing. (3) And the server identifies and obtains voiceprint characteristics of the game player based on the ECAPA-TDNN model, and performs authentication processing on the game player based on the voiceprint characteristics. Specifically, a voiceprint tag associated with a game account in which the game player logs in currently can be obtained from a voiceprint database, wherein when any game account is registered, after a section of audio data is required to be recorded in the above manner, voiceprint features corresponding to the audio data are used as the voiceprint tag, and the voiceprint tag is stored in the voiceprint database in association with the game account; then, similarity calculation can be performed on the voiceprint features of the current game player and the voiceprint tags associated with the game account in which the game player is currently logged, so as to obtain feature similarity. If the feature similarity is greater than or equal to a similarity threshold, determining that the authentication of the game player is passed; if the feature similarity is less than a similarity threshold, it may be determined that authentication of the game player is not passed. (4) If the authentication is passed, allowing the game player to log in the game; if the identity verification is not passed, freezing treatment can be carried out on the current game account, so that the adverse conditions of borrowing the game account, playing the game instead of playing and the like are avoided, the adverse game air can be regulated and treated, and the game experience is improved.
In the embodiment of the application, an improved time delay neural network model can be trained as a voiceprint recognition model, so that the voiceprint characteristics of an object to be recognized can be extracted based on the trained voiceprint recognition model, and the voiceprint recognition model is improved on the basis of the traditional time delay neural network model, so that the performance (efficiency and accuracy) of extracting the voiceprint characteristics is better, and the extracted voiceprint characteristics are more accurate; the identity verification can be more accurately performed based on more accurate voiceprint characteristics in a subsequent service scene (such as a game scene), so that the reliability of subsequent service management is improved.
The foregoing details of the method according to the embodiment of the present application are set forth in order to better implement the foregoing aspects of the embodiment of the present application, and accordingly, an apparatus according to the embodiment of the present application is provided below, and next, related apparatuses according to the embodiment of the present application are correspondingly described in connection with the foregoing data processing scheme provided by the embodiment of the present application.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 9, the data processing apparatus 900 may be applied to the computer device (e.g., terminal device or server) mentioned in the foregoing embodiment. In particular, the data processing apparatus 900 may be a computer program (comprising program code) running in a computer device, for example the data processing apparatus 900 is an application software; the data processing apparatus 900 may be configured to perform corresponding steps in a data processing method provided in an embodiment of the present application. In particular, the data processing apparatus 900 may specifically include:
An obtaining unit 901, configured to obtain audio data of an object to be identified in a target service platform if it is detected that the currently opened service platform is the target service platform;
the processing unit 902 is configured to perform voiceprint extraction processing on audio data of an object to be identified, so as to obtain voiceprint features of the object to be identified;
the processing unit 902 is further configured to perform identification processing on the object to be identified based on the voiceprint feature, so as to determine an object type to which the object to be identified belongs;
the processing unit 902 is further configured to obtain a service management rule corresponding to the target service platform, and perform service management on the object to be identified in the target service platform based on the service management rule and the object type.
In a possible implementation manner, the processing unit 902 performs voiceprint extraction processing on audio data of an object to be identified, so as to obtain voiceprint features of the object to be identified, and the voiceprint features are used for performing the following operations:
performing feature extraction processing on the audio data to obtain audio features of the audio data;
acquiring a voiceprint recognition model, wherein the voiceprint recognition model is used for carrying out voiceprint recognition on any audio feature;
and carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
In a possible implementation manner, the processing unit 902 performs feature extraction processing on the audio data to obtain audio features of the audio data, and is configured to perform the following operations:
preprocessing the audio data of the object to be identified to obtain preprocessed audio data;
extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data;
wherein the pretreatment comprises at least one of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
In one possible implementation, the audio features include mel-spectrum cepstral coefficient features; the processing unit 902 performs feature extraction on the preprocessed audio data to obtain audio features of the audio data, and is configured to perform the following operations:
carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames;
performing frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame;
and respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data.
In a possible implementation manner, the processing unit 902 performs recognition processing on the audio feature based on the voiceprint recognition model to obtain a voiceprint feature corresponding to the object to be recognized, and is configured to perform the following operations:
carrying out convolution processing on the audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features;
performing aggregation treatment on the high-level semantic features to obtain aggregated audio features;
and carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the objects to be identified.
In a possible implementation manner, the processing unit 902 performs identification processing on the object to be identified based on the voiceprint feature to determine an object type to which the object to be identified belongs, and is configured to perform the following operations:
acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type;
performing similarity calculation on voiceprint characteristics of an object to be identified and each voiceprint label in a voiceprint database to obtain a plurality of voiceprint similarities;
and determining the object type of the object to be identified based on the obtained voiceprint similarity.
In a possible implementation manner, the processing unit 902 determines, based on the obtained multiple voiceprint similarities, an object type to which the object to be identified belongs, including any one of the following:
Determining the object type indicated by the voiceprint label corresponding to the maximum voiceprint similarity in the plurality of voiceprint similarities as the object type of the object to be identified;
and acquiring object types corresponding to one or more voiceprint similarities meeting a similarity threshold in the plurality of voiceprint similarities, and determining the object type of the object to be identified based on the acquired one or more object types.
In a possible implementation manner, the target service platform is a game platform, and the processing unit 902 performs service management on the object to be identified based on the service management rule and the object type in the target service platform, and is configured to perform the following operations:
if the object type is the first type, determining a processing rule corresponding to the first type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources;
if the object type is the second type, determining a processing rule corresponding to the second type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data.
In one possible implementation, the processing unit 902 is further configured to perform the following operations:
collecting voiceprint data of the object to be identified in the process of carrying out service management on the object to be identified according to the processing rule corresponding to the second type;
voiceprint extraction processing is carried out on voiceprint data of the object to be identified, so that verification characteristics of the object to be identified are obtained;
performing identity verification processing on the object to be identified based on the verification characteristics;
and if the authentication is not passed, freezing the game account of the object to be identified.
In one possible implementation, the processing unit 902 is further configured to perform the following operations:
acquiring a training data set, wherein the training data set comprises a plurality of sample audios and sample labels of each sample audio, and any sample audio is obtained by preprocessing source audio;
acquiring an initial neural network model, and training the initial neural network model based on a training data set;
and determining the trained initial neural network model as a voiceprint recognition model until the trained initial neural network model meets the model convergence condition.
In one possible implementation, the processing unit 902 trains the initial neural network model based on the training data set for performing the following operations:
Calculating a first loss for each sample audio in the training dataset using the first loss function;
calculating a second loss for each sample audio in the training dataset using a second loss function;
based on the first loss and the second loss of each sample audio, jointly adjusting model parameters of an initial neural network model;
the first loss or the second loss is determined based on a sample label and a voiceprint recognition result of each sample audio, and the voiceprint recognition result is obtained by performing recognition processing on the sample audio by the initial neural network model.
In the embodiment of the application, the audio data of the object to be identified in the target service platform can be obtained, and then the audio data is subjected to characteristic extraction processing to obtain the voiceprint characteristics of the object to be identified; then, the identification processing can be carried out on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified; and finally, acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the application. The computer device 1000 is configured to perform the steps performed by a computer device (terminal device or server) in the foregoing method embodiment, the computer device 1000 comprising: one or more processors 1001; one or more input devices 1002, one or more output devices 1003, and a memory 1004. The processor 1001, the input device 1002, the output device 1003, and the memory 1004 are connected by a bus 1005. The processor 1001 (or CPU (Central Processing Unit, central processing unit)) is a processing core of a computer device, where the processor 1001 is adapted to implement one or more program instructions, and in particular to load and execute the one or more program instructions to implement the flow of the data processing method described above. The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one memory located remotely from the aforementioned processor. The memory 1004 provides storage space for storing an operating system of the content playback device. And in the memory space is also used for storing a computer program comprising program instructions adapted to be invoked and executed by a processor for carrying out the steps of the data processing method of the application.
Specifically, the memory 1004 is configured to store a computer program, where the computer program includes program instructions, and the processor 1001 is configured to call the program instructions stored in the memory 1004, and perform the following operations:
if the current opened service platform is detected to be the target service platform, acquiring audio data of an object to be identified in the target service platform;
performing voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified;
carrying out identity recognition processing on the object to be recognized based on the voiceprint characteristics so as to determine the object type of the object to be recognized;
and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform.
In one possible implementation manner, the processor 1001 performs a voiceprint extraction process on audio data of an object to be identified, to obtain a voiceprint feature of the object to be identified, and is configured to perform the following operations:
performing feature extraction processing on the audio data to obtain audio features of the audio data;
acquiring a voiceprint recognition model, wherein the voiceprint recognition model is used for carrying out voiceprint recognition on any audio feature;
And carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
In one possible implementation, the processor 1001 performs feature extraction processing on the audio data to obtain audio features of the audio data, for performing the following operations:
preprocessing the audio data of the object to be identified to obtain preprocessed audio data;
extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data;
wherein the pretreatment comprises at least one of the following: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
In one possible implementation, the audio features include mel-spectrum cepstral coefficient features; the processor 1001 performs feature extraction on the preprocessed audio data, to obtain audio features of the audio data, for performing the following operations:
carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames;
performing frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame;
and respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data.
In one possible implementation manner, the processor 1001 performs recognition processing on the audio feature based on the voiceprint recognition model, to obtain a voiceprint feature corresponding to the object to be recognized, and is configured to perform the following operations:
carrying out convolution processing on the audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features;
performing aggregation treatment on the high-level semantic features to obtain aggregated audio features;
and carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the objects to be identified.
In one possible implementation manner, the processor 1001 performs an identification process on the object to be identified based on the voiceprint feature to determine an object type to which the object to be identified belongs, and is configured to perform the following operations:
acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type;
performing similarity calculation on voiceprint characteristics of an object to be identified and each voiceprint label in a voiceprint database to obtain a plurality of voiceprint similarities;
and determining the object type of the object to be identified based on the obtained voiceprint similarity.
In one possible implementation, the processor 1001 determines, based on the obtained plurality of voiceprint similarities, an object type to which the object to be identified belongs, including any one of:
Determining the object type indicated by the voiceprint label corresponding to the maximum voiceprint similarity in the plurality of voiceprint similarities as the object type of the object to be identified;
and acquiring object types corresponding to one or more voiceprint similarities meeting a similarity threshold in the plurality of voiceprint similarities, and determining the object type of the object to be identified based on the acquired one or more object types.
In one possible implementation, the target service platform is a game platform, and the processor 1001 performs service management on the object to be identified based on the service management rule and the object type in the target service platform, and is configured to perform the following operations:
if the object type is the first type, determining a processing rule corresponding to the first type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources;
if the object type is the second type, determining a processing rule corresponding to the second type from the service management rules, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, performing service management on the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data.
In one possible implementation, the processor 1001 is further configured to:
collecting voiceprint data of the object to be identified in the process of carrying out service management on the object to be identified according to the processing rule corresponding to the second type;
voiceprint extraction processing is carried out on voiceprint data of the object to be identified, so that verification characteristics of the object to be identified are obtained;
performing identity verification processing on the object to be identified based on the verification characteristics;
and if the authentication is not passed, freezing the game account of the object to be identified.
In one possible implementation, the processor 1001 is further configured to:
acquiring a training data set, wherein the training data set comprises a plurality of sample audios and sample labels of each sample audio, and any sample audio is obtained by preprocessing source audio;
acquiring an initial neural network model, and training the initial neural network model based on a training data set;
and determining the trained initial neural network model as a voiceprint recognition model until the trained initial neural network model meets the model convergence condition.
In one possible implementation, the processor 1001 trains the initial neural network model based on the training data set for performing the following operations:
Calculating a first loss for each sample audio in the training dataset using the first loss function;
calculating a second loss for each sample audio in the training dataset using a second loss function;
based on the first loss and the second loss of each sample audio, jointly adjusting model parameters of an initial neural network model;
the first loss or the second loss is determined based on a sample label and a voiceprint recognition result of each sample audio, and the voiceprint recognition result is obtained by performing recognition processing on the sample audio by the initial neural network model.
In the embodiment of the application, the audio data of the object to be identified in the target service platform can be obtained, and then the audio data is subjected to characteristic extraction processing to obtain the voiceprint characteristics of the object to be identified; then, the identification processing can be carried out on the object to be identified based on the voiceprint characteristics so as to determine the object type of the object to be identified; and finally, acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified based on the service management rule and the object type in the target service platform. Therefore, on one hand, in the application, in the process of carrying out identity recognition on the user, the identity recognition can be carried out based on the voiceprint characteristics of the user, and the user cannot be interfered in the process of service experience (such as the process of playing games or the process of watching live broadcast), so that the voiceprint recognition mode is more flexible and convenient compared with the face recognition mode; on the other hand, the service management can be carried out on the object to be identified according to the object type determined by the identity identification, so that the service management can be carried out on the object to be identified in a targeted service platform according to pertinence, and the service management can be more efficient and convenient.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer storage medium, in which a computer program is stored, and the computer program includes program instructions, when executed by a processor, can perform the method in the corresponding embodiment, so that a detailed description will not be given here. For technical details not disclosed in the embodiments of the computer storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed on one computer device or executed on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can perform the method in the foregoing corresponding embodiment, and therefore, a detailed description will not be given here.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data processing device, such as a server, data center, or the like, that contains an integration of one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A method of data processing, comprising:
if the current opened service platform is detected to be a target service platform, acquiring audio data of an object to be identified in the target service platform;
performing voiceprint extraction processing on the audio data of the object to be identified to obtain voiceprint characteristics of the object to be identified;
performing identity recognition processing on the object to be recognized based on the voiceprint characteristics to determine the object type of the object to be recognized;
and acquiring a service management rule corresponding to the target service platform, and carrying out service management on the object to be identified in the target service platform based on the service management rule and the object type.
2. The method of claim 1, wherein the performing voiceprint extraction on the audio data of the object to be identified to obtain voiceprint features of the object to be identified comprises:
Performing feature extraction processing on the audio data to obtain audio features of the audio data;
acquiring a voiceprint recognition model, wherein the voiceprint recognition model is used for carrying out voiceprint recognition on any audio feature;
and carrying out recognition processing on the audio features based on the voiceprint recognition model to obtain voiceprint features corresponding to the object to be recognized.
3. The method of claim 2, wherein the performing feature extraction processing on the audio data to obtain audio features of the audio data comprises:
preprocessing the audio data of the object to be identified to obtain preprocessed audio data;
extracting the characteristics of the preprocessed audio data to obtain the audio characteristics of the audio data;
wherein the pretreatment comprises at least one of: denoising processing, volume enhancement processing, audio clipping processing, audio alignment processing.
4. The method of claim 3, wherein the audio features comprise mel-spectral cepstral coefficient features; the feature extraction of the preprocessed audio data to obtain the audio features of the audio data includes:
Carrying out framing treatment on the preprocessed audio data, and carrying out windowing operation on a plurality of audio frames obtained by the framing treatment to obtain a plurality of windowed signal frames;
performing frequency domain conversion on each windowed signal frame to obtain a frequency domain signal frame corresponding to each windowed signal frame;
and respectively carrying out filtering processing on each frequency domain signal frame through a Mel filter to obtain Mel spectrum cepstrum coefficient characteristics of the audio data.
5. The method of claim 2, wherein the identifying the audio feature based on the voiceprint identification model to obtain the voiceprint feature corresponding to the object to be identified comprises:
performing convolution processing on the audio features to obtain high-level semantic features, wherein the high-level semantic features are used for representing time sequence features of the audio features;
performing aggregation treatment on the high-level semantic features to obtain aggregated audio features;
and carrying out weighted average pooling treatment on the aggregated audio features to obtain voiceprint features corresponding to the object to be identified.
6. The method of claim 1, wherein the identifying the object to be identified based on the voiceprint feature to determine an object type to which the object to be identified belongs comprises:
Acquiring a voiceprint database, wherein the voiceprint database comprises a plurality of voiceprint labels, and one voiceprint label is used for indicating voiceprint characteristics of one object type;
performing similarity calculation on the voiceprint characteristics of the object to be identified and each voiceprint label in the voiceprint database to obtain a plurality of voiceprint similarities;
and determining the object type of the object to be identified based on the obtained voiceprint similarity.
7. The method of claim 6, wherein the determining, based on the obtained plurality of voiceprint similarities, an object type to which the object to be identified belongs comprises any one of:
determining the object type indicated by the voiceprint label corresponding to the maximum voiceprint similarity in the plurality of voiceprint similarities as the object type of the object to be identified;
and acquiring object types corresponding to one or more voiceprint similarities meeting a similarity threshold in the plurality of voiceprint similarities, and determining the object type to which the object to be identified belongs based on the acquired one or more object types.
8. The method of claim 1, wherein the target service platform is a game platform, wherein the performing service management on the object to be identified in the target service platform based on the service management rule and the object type comprises:
If the object type is a first type, determining a processing rule corresponding to the first type from the service management rule, and carrying out service management on the object to be identified according to the processing rule corresponding to the first type; wherein, the service management of the object to be identified according to the processing rule corresponding to the first type comprises at least one of the following: management of game authorities, management of game modes, management of game duration and management of account resources;
if the object type is the second type, determining a processing rule corresponding to the second type from the service management rule, and carrying out service management on the object to be identified according to the processing rule corresponding to the second type; wherein, the service management of the object to be identified according to the processing rule corresponding to the second type comprises at least one of the following: verification of game account, specification of game operation, analysis of game data.
9. The method of claim 8, wherein the method further comprises:
collecting voiceprint data of the object to be identified in the process of carrying out service management on the object to be identified according to the processing rule corresponding to the second type;
Voiceprint extraction processing is carried out on voiceprint data of the object to be identified, so that verification characteristics of the object to be identified are obtained;
performing identity verification processing on the object to be identified based on the verification feature;
and if the identity verification is not passed, freezing the game account of the object to be identified.
10. The method of claim 2, wherein the method further comprises:
acquiring a training data set, wherein the training data set comprises a plurality of sample audios and sample labels of each sample audio, and any sample audio is obtained by preprocessing source audio;
acquiring an initial neural network model, and training the initial neural network model based on the training data set;
and determining the trained initial neural network model as a voiceprint recognition model until the trained initial neural network model meets a model convergence condition.
11. The method of claim 10, wherein the training the initial neural network model based on the training dataset comprises:
calculating a first loss for each sample audio in the training dataset using a first loss function;
Calculating a second loss for each sample audio in the training dataset using a second loss function;
jointly adjusting model parameters of the initial neural network model based on the first loss and the second loss of each sample audio;
the first loss or the second loss is determined based on a sample label and a voiceprint recognition result of each sample audio, and the voiceprint recognition result is obtained by performing recognition processing on the sample audio by the initial neural network model.
12. A data processing apparatus, comprising:
the system comprises an acquisition unit, a control unit and a control unit, wherein the acquisition unit is used for acquiring audio data of an object to be identified in a target service platform if the current opened service platform is detected to be the target service platform;
the processing unit is used for carrying out feature extraction processing on the audio data of the object to be identified to obtain voiceprint features of the object to be identified;
the processing unit is further used for carrying out identity recognition processing on the object to be recognized based on the voiceprint features so as to determine the object type of the object to be recognized;
the processing unit is further configured to obtain a service management rule corresponding to the target service platform, and perform service management on the object to be identified in the target service platform based on the service management rule and the object type.
13. A computer device, comprising: a memory device and a processor;
a memory in which one or more computer programs are stored;
a processor for loading the one or more computer programs to implement the data processing method of any of claims 1-11.
14. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform a data processing method according to any of claims 1-11.
15. A computer program product, characterized in that the computer program product comprises a computer program adapted to be loaded by a processor and to perform the data processing method according to any of claims 1-11.
CN202310714946.3A 2023-06-15 2023-06-15 Data processing method, device, computer equipment, storage medium and product Pending CN116975823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310714946.3A CN116975823A (en) 2023-06-15 2023-06-15 Data processing method, device, computer equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310714946.3A CN116975823A (en) 2023-06-15 2023-06-15 Data processing method, device, computer equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN116975823A true CN116975823A (en) 2023-10-31

Family

ID=88482274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310714946.3A Pending CN116975823A (en) 2023-06-15 2023-06-15 Data processing method, device, computer equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN116975823A (en)

Similar Documents

Publication Publication Date Title
Wu et al. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge
Balamurali et al. Toward robust audio spoofing detection: A detailed comparison of traditional and learned features
Monteiro et al. Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
CN112435684B (en) Voice separation method and device, computer equipment and storage medium
US11875807B2 (en) Deep learning-based audio equalization
CN110310647B (en) Voice identity feature extractor, classifier training method and related equipment
US20130041660A1 (en) System and method for tagging signals of interest in time variant data
Tan et al. Adversarial attack and defense strategies of speaker recognition systems: A survey
CN107229691B (en) Method and equipment for providing social contact object
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
Zhang et al. Imperceptible black-box waveform-level adversarial attack towards automatic speaker recognition
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
Dixit et al. Review of audio deepfake detection techniques: Issues and prospects
CN113112992B (en) Voice recognition method and device, storage medium and server
Reimao Synthetic speech detection using deep neural networks
Nasersharif et al. Evolutionary fusion of classifiers trained on linear prediction based features for replay attack detection
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN116975823A (en) Data processing method, device, computer equipment, storage medium and product
CN114333844A (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition medium and voiceprint recognition equipment
Lei et al. Robust scream sound detection via sound event partitioning
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Park et al. User authentication method via speaker recognition and speech synthesis detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication