CN113378977B - Recording data processing method and device - Google Patents

Recording data processing method and device Download PDF

Info

Publication number
CN113378977B
CN113378977B CN202110742770.3A CN202110742770A CN113378977B CN 113378977 B CN113378977 B CN 113378977B CN 202110742770 A CN202110742770 A CN 202110742770A CN 113378977 B CN113378977 B CN 113378977B
Authority
CN
China
Prior art keywords
recording
matrix
probability
data
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110742770.3A
Other languages
Chinese (zh)
Other versions
CN113378977A (en
Inventor
王珍珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202110742770.3A priority Critical patent/CN113378977B/en
Publication of CN113378977A publication Critical patent/CN113378977A/en
Application granted granted Critical
Publication of CN113378977B publication Critical patent/CN113378977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A method and a device for processing recording data are provided, the method comprises the following steps: generating a modulus based on predefined probabilitiesThe model is obtained, model parameters and a model matrix are obtained, and the model parameters comprise: K. l, alpha, beta,And omega, K represents the number of topic clusters to which the obtained records belong, L represents the number of switching relation clusters, alpha represents the prior probability of the hidden variable g, beta represents the prior probability of the hidden variable z,representing the probability of an attribute of a sound recording belonging to a topic cluster, ω representing the probability of a switching relationship with an operation in the case of another operation belonging to a switching relationship cluster, the model matrix comprising: the attribute matrix Y of the first record set, the transfer relation matrix A of the second record set, and the relation matrix R of each record in the first record set and the second record set are used for carrying out cluster analysis on the records based on model parameters and model data so as to obtain more accurate probability of topic clusters to which each record belongs.

Description

Recording data processing method and device
Technical Field
The present application relates to the field of computers, and more particularly, to a method and apparatus for processing recording data.
Background
The clustering algorithm can be used as an important means to complete the clustering of the similarity nodes of mass data. Probability generation models, namely Bayesian probability models, have extremely mature applications in the current algorithmic field. The cluster structure is described by constructing and generating a graph model, and is deduced by defining different types of objective functions and adopting different optimization methods. In the existing technology for processing the recording data, clustering is completed only based on the content of the recording text, and the accuracy of the finally formed clusters is not high.
Disclosure of Invention
The embodiment of the application provides a recording data processing method and device, which are used for carrying out recording clustering analysis by taking the text content of recording data and the transfer relation between recordings into consideration so as to obtain more accurate clustering results.
In a first aspect, the present application provides a method for processing recording data, where the method includes: generating a model based on the predefined probability, and acquiring model parameters and a model matrix; the modelThe parameters include: K. l, alpha, beta,And ω; k represents the number of topic clusters to which a plurality of pre-obtained sound recordings possibly belong, and L represents the number of switching relation clusters; alpha represents the prior probability of the hidden variable g; beta represents the prior probability of the hidden variable z; / >Representing a probability that a sound recording belongs to a topic cluster with a predefined one of the attributes; ω represents the probability of a join with another operation if that operation belongs to a join cluster; the model matrix includes: the attribute matrix Y of the first recording set, the transfer relation matrix A of the second recording set and the relation matrix R of each recording data in the first recording set and the second recording set; the first recording set comprises a plurality of pieces of recording data obtained in advance, each element in an attribute matrix Y of the first recording set is used for representing the probability that each piece of recording data in the plurality of pieces of recording data has each attribute, the second recording set comprises a plurality of pieces of recording data with a transfer relation, each element in a transfer relation matrix A of the second recording set is used for representing whether a transfer relation exists between every two pieces of recording data in the plurality of pieces of recording data with the transfer relation, and each element in a relation matrix R of each recording in the first recording set and the second recording set is used for representing the corresponding relation between each piece of recording data in the second recording set and the recording data in the first recording set; K. l is a positive integer; and carrying out cluster analysis on the plurality of recording data based on the model parameters and the model data to obtain topic clusters and switching relation clusters to which each recording data belongs.
Based on the technical scheme, the recording text and the transfer relation based on the recording are used as data sources to carry out joint modeling, a probability generation model related to the recording structure and the recording semantics is built, the recording text clustering processing is carried out in a mode of combining the recording content with the transfer relation, so that a more accurate clustering result is obtained, and more accurate service is further provided for clients.
Optionally, the performing cluster analysis on the plurality of record data based on the model parameters and the model data includes: taking a first preset threshold as a threshold of a first iteration number, repeatedly executing the following steps to obtain topic clusters to which each recording data in the first recording set belongs: based on the last obtained model parameters alpha, beta,And omega, obtaining a first matrix, wherein each element in the first matrix represents the probability that each recording data in the first recording set belongs to each topic cluster in K topic clusters; updating model parameters alpha and +.>And repeating the following steps with a second preset threshold as a threshold of a second iteration number to obtain a transfer relation cluster to which each recording in the second recording set belongs: obtaining a second matrix based on the first matrix and the last obtained model parameters beta and omega, wherein each element in the second matrix is used for indicating the probability that each recording in the second recording set belongs to each switching relation cluster in L switching relation clusters; and updating model parameters beta and omega based on the second matrix, the updated parameters alpha and omega For updating of said first matrix and said second matrix.
Optionally, the topic cluster to which the first record in the first record set belongs is a topic cluster corresponding to a probability that the probability median of the first record belongs to the K topic clusters is the largest, and the first record is any record data in the first record set.
Optionally, the transfer relation cluster to which the second record in the second record set belongs is a transfer relation cluster corresponding to a probability that the probability median of the second record belongs to the L transfer relation clusters is the largest, and the second record is any record data in the second record set.
Optionally, the updating of model parameters α and α based on the first matrixComprising the following steps: updating model parameters alpha and +.using a nested maximum expectation algorithm based on the first matrix>
Optionally, updating the model parameters β and ω based on the second matrix includes: model parameters beta and omega are updated based on the second matrix using a nested maximum expectation algorithm.
In a second aspect, the present application provides a recording data processing apparatus, which can implement the method in any one of the foregoing first aspect and the possible implementation manners of the first aspect. The apparatus comprises corresponding units or modules for performing the above-described methods.
In a third aspect, the present application provides a processing apparatus for recording data, the apparatus comprising a processor. The processor is coupled to the memory and is operable to execute a computer program in the memory to implement the method for processing sound recording data in any one of the possible implementations of the first aspect and the first aspect.
Optionally, the apparatus further comprises a memory.
Optionally, the apparatus further comprises a communication interface, the processor being coupled to the communication interface.
In a fourth aspect, the present application provides a chip system comprising at least one processor for supporting the implementation of the functions involved in any of the possible implementations of the first aspect and the second aspect.
In one possible design, the system on a chip further includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.
The chip system may be formed of a chip or may include a chip and other discrete devices.
In a fifth aspect, the present application provides a computer readable storage medium having stored thereon a computer program (which may also be referred to as code, or instructions) which, when executed by a processor, causes the method of any one of the possible implementations of the first aspect and the first aspect to be performed.
In a sixth aspect, the present application provides a computer program product comprising: a computer program (which may also be referred to as code, or instructions) which, when executed, causes the method of any one of the possible implementations of the first aspect and the second aspect described above to be performed.
It should be understood that, the second aspect to the sixth aspect of the present application correspond to the technical solutions of the first aspect of the present application, and the advantages obtained by each aspect and the corresponding possible embodiments are similar, and are not repeated.
Drawings
FIG. 1 is a schematic diagram of a Bayesian probability model suitable for use in a method for processing recorded data provided by an embodiment of the present application;
FIG. 2 is a schematic flow chart of a processing method for recording data provided by an embodiment of the present application;
FIG. 3 is a schematic block diagram of a processing device for recording data provided by an embodiment of the present application;
fig. 4 is a further schematic block diagram of a processing device for recording data according to an embodiment of the present application.
Detailed Description
The technical scheme of the application will be described below with reference to the accompanying drawings.
For ease of understanding, the following description will be given of relevant terms related to the present application.
1. Clustering: a set of multiple nodes with higher similarity characteristics, i.e. multiple nodes within the same cluster, have higher similarity characteristics.
2. Likelihood function: in statistics, likelihood functions are functions that relate to parameters of a statistical model. Given an output X, the probability of the variable X, given that the likelihood function L (θx) for the parameter θ is numerically equal to the given parameter θ, is: l (θ|x) =p (x=x|θ).
3. Maximum expected algorithm (expectation-maximization algorithm, EM): EM is a class of optimization algorithms that iterate through maximum likelihood estimation (maximum likelihood estimation, MLE), often used as a surrogate for newton-iteration (newton-method) for parameter estimation of probabilistic models containing hidden variables (variable) or missing data (exact-data).
4. Observing the variables: also known as "indicator variable" or "apparent variable", as opposed to hidden variable, an observed variable refers to a variable that can be directly observed or measured. It is understood that data may be obtained directly or as real data. For example, the attribute matrix Y, the transitive relationship matrix a, and the relationship matrix R, which are described below, are all data that can be directly obtained, called observation variables.
5. Hidden variables: in contrast to the apparent variables, the hidden variables are variables that are not directly observable or measurable and are ultimately solved. For example, the probability variable g that different sound recordings belong to different topic clusters and the probability variable z that different sound recordings belong to different switch relationship clusters are described below.
6. Probability generation model: in the probabilistic statistical theory, a generative model refers to a model that can randomly generate observation data, especially given certain implicit parameters, that assigns a joint probability distribution to the observation and the sequence of annotation data. In machine learning, a generative model may be used to model data directly (e.g., data sampling from probability density functions of a variable) or may be used to build conditional probability distributions between variables, which may be formed by the generative model according to bayesian theorem.
For example, in the present application, the generation model may use a disk representation in a bayesian probability model, and the probability generation overall model is shown in fig. 1, which may be referred to as a bayesian probability model. In the Bayesian probability model, circles are used for representing various variables, wherein observed variables are represented by black circles; the arrow represents the conditional probability, the arrow head variable is the condition of the tail variable, the arrow head variable generates the arrow tail variable, and all the variables which can be connected by the arrow have the interdependence relationship; wherein, the inside of the box can represent different dimension data sets, and the number of dimensions is how many times the generation process is performed. Thus, the whole data obtaining and generating process can be completed by using the same model.
7. Model parameters: in generating a model, different parameters are used to generate different observations, if an observed variable is obtained under the condition that a certain set of parameters are determined, this means that the model successfully fits data under the certain set of parameters, and thus the obtained hidden variable, that is, the hidden variable which can successfully fit the observed variable, can obtain the most important hidden variable. For example, in FIG. 1, α, β,And ω is a model parameter. Alpha, beta,>and ω has the following specific meanings:
the parameter α may be an a priori probability of the hidden variable g, representing that g obeys a probability distribution of the parameter α. In the embodiment of the application, for each recording node, the probability of each topic is subjected to the probability distribution, which means that g can be generated by alpha, and alpha can be understood as the conditional probability that a certain recording node belongs to a certain topic cluster.
The parameter β may be an a priori probability of the hidden variable z, representing that z obeys a probability distribution of the parameter β. For each transit operation node, the probability of the transit relation cluster corresponding to the transit operation node is subjected to the probability distribution, which means that z is generated by beta, and beta is the conditional probability that a certain recording node belongs to a certain transit relation cluster.
Parameters (parameters)It may represent, for each recording, the probability of containing a certain attribute, provided that it belongs to a certain topic cluster. For example, in FIG. 1, < >>The probability that a certain sound recording contains the attribute h when it belongs to the topic d can be expressed.
The parameter ω represents the probability of having an edge with another join operation for each operation, provided that it belongs to a particular join relationship cluster. For example, in FIG. 1, ω cj The probability that a certain transfer operation i has an edge with another transfer operation j under the condition of belonging to the transfer relation cluster c can be represented, and the conditional probability is also represented. Omega and hidden variable z can jointly generate an observation variable transfer relation matrix A. Omega this parameter is more of a probability of representing an association between a cluster and another non-clustered node.
For better understanding, the technical solution of the present application will be described with reference to fig. 1 and 2.
Fig. 2 is a flow chart of a processing method of recording data according to an embodiment of the present application. The recording data processing method 2000 includes a step 2001 and a step 2010. The steps in method 2000 are described in detail below.
In step 2001, a model is generated based on the predefined probabilities, and model parameters and model matrices are acquired, the model parameters including: K. l, alpha, beta,And ω, the model matrix comprising: the attribute matrix Y of the first record set, the transfer relation matrix A of the second record set and the relation matrix R of each record data in the first record set and the second record set.
K represents the number of topic clusters to which a plurality of pre-obtained sound recordings possibly belong, and L represents the number of switching relation clusters; alpha represents the prior of the hidden variable gProbability; beta represents the prior probability of the hidden variable z;representing a probability of an attribute that a sound recording has if it belongs to a topic cluster; ω represents the probability of a join with another operation in the case that the operation belongs to a join cluster.
The first record set comprises a plurality of record data obtained in advance, each element in an attribute matrix Y of the first record set is used for representing the probability that each record data in the plurality of record data has each attribute, the second record set comprises a plurality of record data with a transfer relation, each element in a transfer relation matrix A of the second record set is used for representing the plurality of record data with the transfer relation, whether the transfer relation exists between every two record data or not, and each element in a relation matrix R of each record in the first record set and the second record set is used for representing the corresponding relation between each record data in the second record set and the record data in the first record set; K. l is a positive integer.
It will be appreciated that some data preprocessing, such as text preprocessing, is required before implementing the method of the present application: deleting spaces, numbers, letters, etc.; word segmentation and part-of-speech tagging; deleting the stop words: delete prepositions, mood words, etc. in the text, etc.
In addition, a first recording set and a second recording set are also constructed, and a transfer relation matrix of the second recording set, an attribute matrix of the first recording set, a relation matrix of each recording data in the first recording set and the second recording set and the like are constructed according to the first recording set and the second recording set.
The first set of recordings is a set of multi-pass recordings for clustering. For example, the first set of sound recordings may be represented as: s= { S 1 ,s 2 ,……,s n Wherein n is not less than 1 and n is an integer, S is understood as the set of n-way recordings, S 1 S to s n May be a vertex representation of the record node in the attribute information network.
The second recording set is a set of recording data in which a transfer relationship exists in the first recording set. For example, the second set of sound recordings may be represented as: o= { O 1 ,o 2 ,……,o m In which m.gtoreq.1 and m is an integer, O is understood to be a set of m-way recordings of n recordings with a transfer relationship, O 1 To o m May be a vertex representation in a transitive relationship topology network. It should be understood that not all recordings have a transitive relationship, so m n.
The transfer relation matrix of the second sound recording is used for representing transfer relation among the sound recordings in the second sound recording set. For example, the transfer relation matrix of the second sound recording may be expressed as: a= (a) ij ) m×m Wherein i is equal to or greater than 1 and i is an integer, j is equal to or less than m and j is an integer. In the case of i.noteq.j, if a ij =1, then represents recording vertex o i To the recording vertex o j The transfer relationship is also known as the record o i O is ended due to the occurrence of the switching relation operation i This is recorded by recording and then producing the recording o j This is recorded. If a is ij When i+.j, it is considered that recording vertex o i To the recording vertex o j The switching relation does not exist between the two; when i=j, each pass record to a certain pass record has no switching relationship, so a ii =0 or a jj =0。
The attribute matrix of the first sound recording is used to represent a relationship between the first sound recording and the attributes. For example, the attribute matrix of the first sound recording may be expressed as: y= (Y) th ) n×e Wherein n represents the number of the first sound recordings, t is less than or equal to n and t is an integer, e is the total number of the attributes contained in the first sound recording set, h is less than or equal to e and e is an integer. If y th =1, then it can be understood that the recording st contains or has the h-th attribute; if y th =0, then it can be understood that the recording st does not contain or have the h-th attribute.
The relationship matrix of the first recording and the second recording is used for indicating whether the second recording is the same general recording as the first recording. For example, the relationship matrix of the first recording and the second recording may be expressed as: r= (R ti ) n×m Wherein n represents the number of the first record, t is more than or equal to 1 and less than or equal to n, m represents the number of the second record, i is more than or equal to 1 and less than or equal to m, and m is more than or equal to n. If r ti =1, then it can be understood that the second recording o i And the first recording s t Recording the same through; if r ti =0, it can be understood that the second recording o i And the first recording s t Not the same pass recording.
It should be understood that matrix A, matrix Y and matrix R are all "0-1" matrices. The "0-1" matrix refers to a matrix consisting of 0 and 1.
Based on the model parameters and the model data, performing cluster analysis on the plurality of recording data to obtain topic clusters and switching relation clusters to which each recording data belongs, wherein the cluster analysis process is shown in steps 2002 to 2010.
Optionally, with the first preset threshold as a threshold of the first iteration number, repeating the following steps to obtain a topic cluster to which each recording data in the first recording set belongs: based on the last obtained model parameters alpha, beta, And omega, obtaining a first matrix, wherein each element in the first matrix represents the probability that each recording data in the first recording set belongs to each topic cluster in the K topic clusters; updating model parameters alpha and +.>And repeating the following steps with a second preset threshold as a threshold of a second iteration number to obtain a transfer relation cluster to which each recording in the second recording set belongs: obtaining a second matrix based on the first matrix and the last obtained model parameters beta and omega, wherein each element in the second matrix is used for indicating the probability that each recording in the second recording set belongs to each switching relation cluster in the L switching relation clusters; and updating model parameters beta and omega based on the second matrix, the updated parameters alpha and +.>For updating the first matrix and the second matrix.
Wherein the first matrix may be denoted by q1 and the second matrix may be denoted by q 2.
It should be understood that the topic cluster to which the first record belongs in the first record set is a topic cluster corresponding to a probability that the probability median of the first record belongs to the K topic clusters is the largest, and the first record is any record data in the first record set; the transfer relation cluster to which the second record belongs in the second record set is a transfer relation cluster corresponding to the probability that the probability median of the second record belongs to the L transfer relation clusters is the largest, and the second record is any record data in the second record set.
In step 2002, the method is performed according to model parameters K, alpha, beta,And ω and matrix Y, matrix a and matrix R, a first matrix q1 is calculated.
The first matrix q1 is a matrix of probabilities that each of the first recordings belongs to a different topic cluster. As mentioned above, the probability that different recordings in the first recording belong to different topic clusters may be denoted by g. Thus, q1= (g dt ) n×k Wherein n is the number of recordings, k is the number of topic clusters, d is more than or equal to 1 and less than or equal to n, t is more than or equal to 1 and less than or equal to k, and d, t, n and k are positive numbers. g dt It can be understood that the probability that the d-th recording belongs to the t-th topic cluster is g dt
According to the generative model shown in figure 1, model parameters alpha, beta, related in the generation model,And omega, the number of topic clusters and the number of transfer relation clusters, and a transfer relation matrix of the second sound recording, an attribute matrix of the first sound recording and a relation matrix of the first sound recording and the second sound recording, calculating the probability g that different sound recordings belong to different topic clusters in the first sound recording, namely, calculating a first matrix q1.
The joint probability formula of the bayesian probability model can be obtained according to the bayesian probability model shown in fig. 1 and the setting of each model parameter. The joint probability of the bayesian probability model can be understood as the joint probability of all variables in the bayesian probability model. From the definition of the arrow pointing in the bayesian probability model, the following joint probability formula can be obtained from the bayesian probability model:
According to a Bayesian formula, the joint probability is required to be converted into conditional probability, an objective function of a model is searched, the posterior probability of an observed variable and a hidden variable of the Bayesian probability model under the condition of the model parameters can be obtained, and the calculation process can be as follows:
in this way, the posterior probability of the observed variable and the hidden variable under the model parameters can be obtained.
It should be understood that, in general, in the process of model calculation, the hidden variables in the model are to be obtained finally, but the hidden variables are not directly obtained, so that the hidden variables in the formula need to be marginalized, so that the calculation and the model generation process are not affected. For example, after the hidden variables g and z in the bayesian probability model are marginalized, the following formula can be obtained:
substituting the data of each dimension of each model parameter into the formula can obtain:
since many events of small probability may lead to numerical underflow problems, a natural logarithmic form of likelihood function may be used. In order to avoid the problem of underflow of the objective function value caused by multiplication of a large number of probabilities, the joint probabilities obtained before are used as logarithmic functions, so that non-negative constraint of parameter variables can be ensured. The method is as follows: In doing the solving of the likelihood function, if the following formula is to be used:
as a likelihood function, the sum of the hidden variables will lie inside the logarithm, i.e. the joint probability distribution belongs to an exponential distribution.
The presence of the summation in the above formula prevents the logarithmic operation from acting directly on the joint probability distribution, making the form of the maximum likelihood solution more complex. Thus, the Jensen inequality can be applied to the joint probability formula, from which:
wherein q (g) t ) Is the probability distribution of the topic cluster,is q (g) an edge probability distribution that marginalizes g, can beThe method is characterized by comprising the following steps: />
Using the Jensen inequality, all possible maxima for q (g) at l=l1 can be obtained, when the following formula can be obtained:
in this case, the maximum value of L1 is calculated, which corresponds to the requirement of q (g) and the respective model parameters α, β,And a maximum value of ω. It will be appreciated that solving for the maximum value of q (g) corresponds to step E in the EM algorithm, and that solving for the various model parameters α, β,/->The maximum value of ω corresponds to step M in the EM algorithm.
According to the joint probability formula obtained in the previous step, namely a specific definition function of posterior probability of the hidden variable, the method can be used for bringing each parameter into the formula:
Optionally, updating the model parameters α and α based on the first matrix using an EM algorithm
In step 2003, model parameters α and α are updated according to the first matrix q1Is a value of (2).
As already mentioned above, solving for the maximum value of q (g) corresponds to step E in the EM algorithm, and solving for the various model parameters α, β,The maximum value of ω corresponds to step M in the EM algorithm.
Updating model parameters a and using a nested EM algorithm with the calculated first matrix q1Is a value of (2). Bringing q1 into L1 maximizes α in the likelihood function d And->
Let likelihood function pair alpha d Deriving, obtaining maximum value and simultaneously carrying Lagrange operator constraint termFinally get->
Similarly, pairs of likelihood functions are madeDeriving, obtaining a maximum value, and bringing in Lagrange operator constraint terms +.>Finally get->
In step 2004, it is determined whether the first iteration number is greater than a first preset threshold, and if the first iteration number is less than or equal to the first preset threshold, step 2005 is performed; if the first iteration number is greater than the first preset threshold, step 2009 is executed.
For example, the first preset threshold may be represented by N, the first number of iterations may be represented by N, and step 2005 is performed when n+.n; when N > N, then step 2010 is performed, i.e., the first matrix q1 and the second matrix q2 are output.
It should be appreciated that the first preset threshold may be set by the user to a specific value. When the user sets the specific value of the first preset threshold, in step 2001, the specific value of the first preset threshold may also be received. The first preset threshold may also be a fixed value set in advance, for example, n=200 is preset.
In step 2005, a second matrix q2 is calculated from the values of the model parameters β, ω and the first matrix q 1.
The second matrix q2 is a matrix related to probabilities that different sound recordings in the second sound recording belong to different switching relation clusters, and the probabilities that different sound recordings in the second sound recording belong to different switching relation clusters can be represented by z.
In the likelihood function L1 obtained above, the hidden variable z still exists in the latter half of the function i And therefore, the right half of the formula is calculated again using the Jensen inequality.
And carrying out a secondary EM algorithm according to the values of the model parameters beta and omega and the first matrix q1, wherein likelihood functions in the secondary EM algorithm are as follows:
wherein,is the edge probability of the probability distribution q (z), which can be understood as the second recording o i Probability of belonging to the c-th switching relation cluster.
The right half of the formula is maximized and can be understood as the hidden variable z i Posterior probability of (2) to obtain:
optionally, the model parameters β and ω are updated based on the second matrix using an EM algorithm.
In step 2006, the values of the model parameters β and ω are updated according to the second q 2.
Assume thatIs a fixed value, will ∈ ->Is brought into formula L1 to maximize beta in formula L1 c 、ω cj
Equation L1 vs. parameter beta c Deriving, obtaining maximum value, and bringing Lagrange operator constraint termFinally, the method comprises the following steps: />
Equation L1 vs. parameter ω cj Deriving, obtaining maximum value, and bringing Lagrange operator constraint termFinally, the method comprises the following steps: />
In step 2007, the second iteration number is incremented by 1, resulting in a new second iteration number value.
For example, the second iteration number may be denoted by m, and the new second generation iteration number may be m+1, which may be: m=m+1.
In step 2008, it is determined whether the second iteration number is greater than a second preset threshold, and if the second iteration number is less than or equal to the second preset threshold, step 2005 is performed; if the second iteration number is greater than the second preset threshold, step 2009 is performed.
For example, the second preset threshold may be denoted by M, and when m.ltoreq.M, step 2005 is performed; when M > M, step 2009 is performed.
In step 2009, the first iteration number is increased by 1, and a new first iteration number value is obtained.
For example, the first iteration number may be denoted by n, and the new first generation iteration number is n+1, which may be: n=n+1.
It is to be understood that after step 2009, step 2002 is continued.
In step 2010, a first matrix q1 and a second matrix q2 are output.
It should be understood that the calculation of the values of the hidden variables g and z, i.e. the first matrix q1 and the second matrix q2, can also be understood as the generation of the generation model. In the generation process, each recording node is generated in turn, so that probability generation of the cluster where the node is located is performed for each recording node, and after the cluster where the recording node is located is determined, edges between the recording node and other recording nodes are generated, and the probability distribution model is a multi-term distribution.
It has been mentioned above that the first matrix q1 is a matrix of probabilities g of different recordings belonging to different topic clusters in the first recording and the second matrix q2 is a matrix of probabilities z of different recordings belonging to different switch relation clusters in the second recording. Outputting the first matrix q1 may be understood as outputting the probability g that each recording in the first recording belongs to a different topic cluster, and outputting the second matrix q2 may be understood as outputting the probability z that each recording in the second recording belongs to a different switch relationship cluster.
It should also be appreciated that in the algorithm implementation described above, the parameters of the generated model are first initialized, a random initial value is set for each parameter, and then an EM iteration process is entered. First, an EM iteration process of the outer layer is performed, and at this time, a first iteration number N is set to record the number of EM iterations of the outer layer, where N is from 0 until N > N. In each iteration process, firstly, carrying out solution on q1 by taking an initial value of a parameter into a formula of q1, then setting the q1 value as a fixed value, carrying out maximization on the likelihood function by taking the q1 value into a likelihood function at the front, calculating a bias derivative of the likelihood function on the parameter, obtaining an extremum value to obtain a value of the parameter which makes the likelihood function maximum for the q1 at the moment, obtaining a new parameter again, and then iterating step by step. After each parameter value is obtained, whether the first iteration times N is larger than N or not is judged, if N is smaller than or equal to N, the first iteration times N enter an inner layer EM iteration process, if N is larger than N, the whole iteration process is ended, and the obtained parameter value is the final global optimal solution. For each inner layer EM iteration process, the number of inner EM iterations is recorded with a second iteration number M, starting with a value of 0 until M > M. Firstly, solving q2 according to the initial value of the parameter, then bringing the obtained value of q2 into a likelihood function, carrying out bias derivative on the parameter one by the likelihood function, obtaining a new parameter by obtaining a polar value, adding 1 to the iteration times, and iterating gradually. Similarly, after each time the parameter is obtained, the iteration number M is judged, if M is more than M, the internal iteration is finished after the local optimal solution of the parameter is obtained, the external iteration is carried out again, and the external iteration number is increased by 1; if the second iteration number M is less than or equal to M, continuing iteration. In the Bayesian probability model and the iterative calculation process, the influence of the transfer relation operation between the recording content and the recording on recording clustering is comprehensively considered.
Based on the technical scheme, the recording text and the transfer relation based on the recording are used as data sources to carry out joint modeling, a probability generation model related to the recording structure and the recording semantics is built, the recording text clustering processing is carried out in a mode of combining the recording content with the transfer relation, so that a more accurate clustering result is obtained, and more accurate service is further provided for clients.
Fig. 3 is a schematic block diagram of a processing apparatus 300 for recording data according to an embodiment of the present application.
As shown in fig. 3, the apparatus 300 may include: a processing module 310 and an acquisition module 320. Wherein the obtaining module 320 is configured to generate a model based on a predefined probability, obtain model parameters and a model matrix; the processing module 310 may be configured to perform cluster analysis on the plurality of recording data based on the model parameters and the model data, so as to obtain a topic cluster and a transitive relationship cluster to which each recording data belongs.
Optionally, the processing module 310 may be configured to repeatedly perform the following steps with a first preset threshold as a threshold of a first iteration number, so as to obtain a topic cluster to which each recording data in the first recording set belongs: based on the last obtained model parameters alpha, beta, And omega, obtaining a first matrix, wherein each element in the first matrix represents the probability that each recording data in the first recording set belongs to each topic cluster in K topic clusters; updating model parameters alpha and +.>And repeating the following steps with a second preset threshold as a threshold of a second iteration number to obtain a transfer relation cluster to which each recording in the second recording set belongs: obtaining a second matrix based on the first matrix and the last obtained model parameters beta and omega, wherein each element in the second matrix is used for indicating the probability that each recording in the second recording set belongs to each switching relation cluster in L switching relation clusters; and updating model parameters beta and omega based on said second matrix, updated parameters alpha and +.>For updating of said first matrix and said second matrix.
Optionally, the topic cluster to which the first record in the first record set belongs is a topic cluster corresponding to a probability that the probability median of the first record belongs to the K topic clusters is the largest, and the first record is any record data in the first record set.
Optionally, the transfer relation cluster to which the second record in the second record set belongs is a transfer relation cluster corresponding to a probability that the probability median of the second record belongs to the L transfer relation clusters is the largest, and the second record is any record data in the second record set.
Optionally, the processing module 310 may be further configured to update the model parameters α and α based on the first matrix using a nested maximum expectation algorithm
Optionally, the processing module 310 may be further configured to update the model parameters β and ω based on the second matrix using a nested maximum expectation algorithm.
It should be understood that the division of the modules in the embodiment of the present application is illustrative, and is merely a logic function division, and other division manners may be implemented in practice. In addition, each functional module in the embodiments of the present application may be integrated in one processor, or may exist alone physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.
Fig. 4 is a schematic block diagram of a processing apparatus 400 for recording data according to an embodiment of the present application.
As shown in fig. 4, the apparatus 400 may include at least one processor 410 for implementing the method provided by the embodiment of the present application. Wherein the device may be a system-on-chip. In the embodiment of the application, the chip system can be formed by a chip, and can also comprise the chip and other discrete devices.
Illustratively, when the apparatus 400 implements the method provided by the embodiment of the present application, the processor 410 may be configured to generate a model based on a predefined probability, obtain a model parameter and a model matrix, and perform cluster analysis on the plurality of recording data based on the model parameter and the model data, so as to obtain a topic cluster and a transitive relation cluster to which each recording data belongs. Reference is made specifically to the detailed description in the method examples, and details are not described here.
The apparatus 400 may also include at least one memory 420 for storing program instructions and/or data. Memory 420 is coupled to processor 410. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units, or modules, which may be in electrical, mechanical, or other forms for information interaction between the devices, units, or modules. Processor 410 may operate in conjunction with memory 420. Processor 410 may execute program instructions stored in memory 420. At least one of the at least one memory may be included in the processor.
The apparatus 400 may also include a communication interface 430 for communicating with other devices over a transmission medium so that an apparatus for use in the apparatus 400 may communicate with other devices. The communication interface 430 may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of implementing a transceiver function. Processor 410 may utilize communication interface 430 to transceive data and/or information and to implement the methods performed in the corresponding embodiments of fig. 2.
The specific connection medium between the processor 410, the memory 420, and the communication interface 430 is not limited in the embodiment of the present application. Embodiments of the present application are illustrated in fig. 4 as being coupled between processor 410, memory 420, and communication interface 430 via bus 440. The bus 440 is shown in bold lines in fig. 4, and the manner in which the other components are connected is illustrated schematically and not by way of limitation. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
The embodiment of the application provides a server, which comprises at least one memory and at least one server, wherein the at least one memory is used for storing a computer program and storing data in a database; the at least one processor is configured to invoke the computer program to cause the server to perform the method of the embodiment shown in fig. 2.
The present application provides a chip system including at least one processor for supporting implementation of the method of the embodiment shown in fig. 2.
In one possible design, the system on a chip further includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.
The chip system may be formed of a chip or may include a chip and other discrete devices.
The present application also provides a computer program product comprising: a computer program (which may also be referred to as code, or instructions) which, when executed, causes a computer to perform the method of the embodiment shown in fig. 2.
The present application also provides a computer-readable storage medium storing a computer program (which may also be referred to as code, or instructions). The computer program, when executed, causes the computer to perform the method of the embodiment shown in fig. 2.
It should be appreciated that the processor in embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The terms used in this specification: "unit," "module," etc., may be used to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, or software in execution.
Those of ordinary skill in the art will appreciate that the various illustrative logical blocks (illustrative logical block) and steps (steps) described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
In the above-described embodiments, the functions of the respective functional units may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions (programs). When the computer program instructions (program) are loaded and executed on a computer, the processes or functions according to the embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for processing sound recording data, comprising:
generating a model based on the predefined probability, and acquiring model parameters and a model matrix; the model parameters include: K. l, alpha, beta,And ω; k represents the number of topic clusters to which a plurality of pre-obtained sound recordings possibly belong, and L represents the number of switching relation clusters; alpha represents the prior probability of the hidden variable g; beta represents the first of the hidden variable zA probability of experience; />Representing a probability that a sound recording belongs to a topic cluster with a predefined one of the attributes; ω represents the probability of a join with another operation if that operation belongs to a join cluster; the model matrix includes: the attribute matrix Y of the first recording set, the transfer relation matrix A of the second recording set and the relation matrix R of each recording data in the first recording set and the second recording set; the first recording set comprises a plurality of pieces of recording data obtained in advance, each element in an attribute matrix Y of the first recording set is used for representing the probability that each piece of recording data in the plurality of pieces of recording data has each attribute, the second recording set comprises a plurality of pieces of recording data with a transfer relation, each element in a transfer relation matrix A of the second recording set is used for representing whether a transfer relation exists between every two pieces of recording data in the plurality of pieces of recording data with the transfer relation, and each element in a relation matrix R of each recording in the first recording set and the second recording set is used for representing the corresponding relation between each piece of recording data in the second recording set and the recording data in the first recording set; K. l is a positive integer; wherein, alpha, beta, and- >ω, g, z, A, Y and R satisfy the following formula:
and carrying out cluster analysis on the plurality of recording data based on the model parameters and the model data to obtain topic clusters and switching relation clusters to which each recording data belongs.
2. The method of claim 1, wherein said clustering the plurality of sound recordings based on the model parameters and the model data comprises:
taking a first preset threshold as a threshold of a first iteration number, repeatedly executing the following steps to obtain topic clusters to which each recording data in the first recording set belongs:
based on the last obtained model parameters alpha, beta,And omega, obtaining a first matrix, wherein each element in the first matrix represents the probability that each recording data in the first recording set belongs to each topic cluster in K topic clusters;
updating model parameters alpha and alpha based on the first matrixAnd
and repeatedly executing the following steps by taking a second preset threshold as a threshold of a second iteration number to obtain a transfer relation cluster to which each recording in the second recording set belongs:
obtaining a second matrix based on the first matrix and the last obtained model parameters beta and omega, wherein each element in the second matrix is used for indicating the probability that each recording in the second recording set belongs to each switching relation cluster in L switching relation clusters; and
Updating model parameters beta and omega based on the second matrix, and updated parameters alpha and omegaFor updating of said first matrix and said second matrix.
3. The method of claim 2, wherein the topic cluster to which a first record in the first set of records belongs is a topic cluster corresponding to a probability that a median probability of the first record belonging to the K topic clusters is greatest, the first record being any record data in the first set of records.
4. The method of claim 2, wherein the switch relationship cluster to which the second record in the second record set belongs is a switch relationship cluster to which a probability of the second record belonging to the L switch relationship clusters having a largest median probability corresponds, the second record being any record data in the second record set.
5. The method of claim 2, wherein the updating model parameters α and based on the first matrixComprising the following steps:
updating model parameters a and based on the first matrix using a nested maximum expectation algorithm
6. The method of claim 2, wherein the updating model parameters β and ω based on the second matrix comprises:
Model parameters beta and omega are updated based on the second matrix using a nested maximum expectation algorithm.
7. A processing device of recorded data, characterized by being adapted to implement a method as claimed in any one of claims 1 to 6.
8. A processing device of recorded data, characterized by comprising a processor for executing program code to cause the device to implement the method of any one of claims 1 to 6.
9. A computer readable storage medium comprising a computer program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when run, causes a computer to perform the method of any one of claims 1 to 6.
CN202110742770.3A 2021-06-30 2021-06-30 Recording data processing method and device Active CN113378977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110742770.3A CN113378977B (en) 2021-06-30 2021-06-30 Recording data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110742770.3A CN113378977B (en) 2021-06-30 2021-06-30 Recording data processing method and device

Publications (2)

Publication Number Publication Date
CN113378977A CN113378977A (en) 2021-09-10
CN113378977B true CN113378977B (en) 2023-11-21

Family

ID=77580343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110742770.3A Active CN113378977B (en) 2021-06-30 2021-06-30 Recording data processing method and device

Country Status (1)

Country Link
CN (1) CN113378977B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679135A (en) * 2017-09-22 2018-02-09 深圳市易图资讯股份有限公司 The topic detection of network-oriented text big data and tracking, device
CN108388674A (en) * 2018-03-26 2018-08-10 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN109101518A (en) * 2018-05-21 2018-12-28 全球能源互联网研究院有限公司 Phonetic transcription text quality appraisal procedure, device, terminal and readable storage medium storing program for executing
CN109410954A (en) * 2018-11-09 2019-03-01 杨岳川 A kind of unsupervised more Speaker Identification device and method based on audio-video
CN109451182A (en) * 2018-10-19 2019-03-08 北京邮电大学 A kind of detection method and device of fraudulent call
CN111309824A (en) * 2020-02-18 2020-06-19 中国工商银行股份有限公司 Entity relationship map display method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679135A (en) * 2017-09-22 2018-02-09 深圳市易图资讯股份有限公司 The topic detection of network-oriented text big data and tracking, device
CN108388674A (en) * 2018-03-26 2018-08-10 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN109101518A (en) * 2018-05-21 2018-12-28 全球能源互联网研究院有限公司 Phonetic transcription text quality appraisal procedure, device, terminal and readable storage medium storing program for executing
CN109451182A (en) * 2018-10-19 2019-03-08 北京邮电大学 A kind of detection method and device of fraudulent call
CN109410954A (en) * 2018-11-09 2019-03-01 杨岳川 A kind of unsupervised more Speaker Identification device and method based on audio-video
CN111309824A (en) * 2020-02-18 2020-06-19 中国工商银行股份有限公司 Entity relationship map display method and system

Also Published As

Publication number Publication date
CN113378977A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
JP7392668B2 (en) Data processing methods and electronic equipment
JP6743934B2 (en) Method, apparatus and system for estimating causal relationship between observed variables
US10452793B2 (en) Multi-dimension variable predictive modeling for analysis acceleration
Liu et al. Iterative methods for private synthetic data: Unifying framework and new methods
CN108205570B (en) Data detection method and device
Größer et al. Copulae: An overview and recent developments
CN112232495B (en) Prediction model training method, device, medium and computing equipment
CN108140022B (en) Data query method and database system
CN112162860A (en) CPU load trend prediction method based on IF-EMD-LSTM
CN111046882B (en) Disease name standardization method and system based on profile hidden Markov model
Tian et al. Variable selection in the high-dimensional continuous generalized linear model with current status data
CN113110843B (en) Contract generation model training method, contract generation method and electronic equipment
CN113378977B (en) Recording data processing method and device
Wang et al. A Note on" Towards Efficient Data Valuation Based on the Shapley Value''
Liu et al. New method for multi-state system reliability analysis based on linear algebraic representation
Wan et al. Graphical lasso for extremes
Abdelaal et al. AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
Zhu et al. A hybrid model for nonlinear regression with missing data using quasilinear kernel
CN115587125A (en) Metadata management method and device
Dahinden et al. Decomposition and model selection for large contingency tables
JPWO2008084842A1 (en) Kernel function generation method and device, data classification device
Aflakparast et al. Analysis of Twitter data with the Bayesian fused graphical lasso
JP5914291B2 (en) Transition probability calculation device, total value calculation device, transition probability calculation method, total value calculation method
Fernández Zeros of sections of power series: deterministic and random
JP5008096B2 (en) Automatic document classification method and automatic document classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant