CN115617992A

CN115617992A - Label generation method and device, computer readable storage medium and computer equipment

Info

Publication number: CN115617992A
Application number: CN202211236358.5A
Authority: CN
Inventors: 姜磊; 胡加学; 赵景鹤; 贺志阳; 鹿晓亮; 魏思; 胡国平; 赵志伟
Original assignee: Anhui Xunfei Medical Co ltd Wuhan Branch; Anhui Xunfei Medical Co ltd
Current assignee: Anhui Xunfei Medical Co ltd Wuhan Branch; Anhui Xunfei Medical Co ltd
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-01-17

Abstract

The application discloses a label generation method, a label generation device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining dialogue audio data and a preset scale of a target object, wherein the preset scale comprises priori knowledge under each scene dimension, identifying and processing the dialogue audio data to obtain each scene dimension, dialogue text data and answer text data under each scene dimension, determining a first classification result of the target object on a time sequence coding dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension and time sequence information of each scene dimension in the dialogue audio data, determining a second classification result of the target object on the preset scale dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension, generating a target classification label of the target object according to the first classification result and the second classification result, and improving accuracy of generating the target classification label.

Description

Label generation method and device, computer readable storage medium and computer equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a tag generation method, an apparatus, a computer-readable storage medium, and a computer device.

Background

At the moment of the explosion-level development of the data industry index, by collecting all dimensional data such as user social attributes, consumption habits, preference characteristics and the like, the characteristic attributes of a user or a product are described, and potential value information is analyzed and statistically mined for the characteristics, so that the information overview of a user is abstracted, and the information overview can be taken as a root of enterprise application big data and is a precondition for targeted advertisement putting and personalized recommendation.

And with the current pace of life of people becoming faster and social competition becoming more intense, the generation of user classification labels such as depression rating labels also becomes a demand.

At present, most of the classification labels generated by users are based on scale evaluation, for example, the classification labels are finally given by inquiring and grading through a scale, or the label classification is carried out by adopting analog reasoning, carrying out sequence learning through a large amount of labeled data and simply learning interactive dialogue voice information. The method based on scale evaluation is a manual evaluation-based method, the determined classification labels are influenced by manual subjective factors, and the given classification labels may be greatly different from the experience-rich classification labels and the experience-poor classification labels, so that the accuracy of the classification labels is influenced; analogy-based reasoning is easy to implement, but when the number of conversation turns is large and the content is long, the problem that the key information in the conversation is not adjusted and captured in place exists, and the accuracy of the classification label is influenced.

Disclosure of Invention

The embodiment of the application provides a label generation method, a label generation device, a computer readable storage medium and computer equipment, which can improve the accuracy of generating a target classification label of a target object.

The embodiment of the application provides a label generation method, which comprises the following steps:

obtaining dialogue audio data corresponding to a target object, and identifying and processing the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension, which are related in the dialogue audio data;

acquiring a preset scale corresponding to the target object, wherein the preset scale comprises prior knowledge under each scene dimension, and the prior knowledge is predetermined based on the dialogue audio data;

determining a first classification result of the target object in a time sequence coding dimension according to the dialog text data, the answer text data and the prior knowledge in each scene dimension and the time sequence information of each scene dimension in the dialog audio data;

determining a second classification result of the target object on the preset scale dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension;

and generating a target classification label of the target object according to the first classification result and the second classification result.

An embodiment of the present application further provides a tag generation method, including:

acquiring training dialogue audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and acquiring a training preset table of the training target object, wherein the training preset table comprises training prior knowledge under each scene dimension in a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data;

performing recognition processing on the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data under each scene dimension, and training answer text data under each scene dimension, which are related in the training dialogue audio data;

obtaining an initial classification label generation model;

according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension and the time sequence information appearing in each scene dimension in the training dialogue audio data, generating a model by using the initial classification label, and determining a training first classification result of the training target object in the time sequence coding dimension;

according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension, generating a model by using the initial classification label, and determining a training scene score of the training target object in each scene dimension;

determining an overall loss value according to the training first classification result, the first label result, the training scenario score and the training prior knowledge;

and updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain a trained classification label generation model.

An embodiment of the present application further provides a tag generation apparatus, including:

the acquisition module is used for acquiring dialogue audio data corresponding to the target object;

the recognition module is used for recognizing the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension, which are related in the dialogue audio data;

the acquisition module is further configured to acquire a preset scale corresponding to the target object, where the preset scale includes prior knowledge in each scene dimension, and the prior knowledge is predetermined based on the dialog audio data;

a first determining module, configured to determine, according to the dialog text data, the answer text data, and the priori knowledge in each scene dimension, and timing information of occurrence of each scene dimension in the dialog audio data, a first classification result of the target object in a timing coding dimension;

a second determining module, configured to determine, according to the dialog text data, the answer text data, and the priori knowledge in each scene dimension, a second classification result of the target object in the preset table dimension;

and the generating module is used for generating a target classification label of the target object according to the first classification result and the second classification result.

a training acquisition module, configured to acquire training dialogue audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and acquire a training preset table of the training target object, where the training preset table includes training prior knowledge in each of a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data;

the training recognition module is used for recognizing the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data under each scene dimension and training answer text data under each scene dimension, which are related in the training dialogue audio data;

the training acquisition module is used for acquiring an initial classification label generation model;

a first training determination module, configured to determine, according to the training dialogue text data, the training answer text data, and the training priori knowledge in each scene dimension, and timing information appearing in each scene dimension in the training dialogue audio data, a training first classification result of the target object in the timing coding dimension by using the initial classification tag generation model;

a second training determination module, configured to determine a training scene score of the training target object in each scene dimension by using the initial classification label generation model according to the training dialogue text data, the training answer text data, and the training priori knowledge in each scene dimension;

a loss value determination module, configured to determine an overall loss value according to the training first classification result, the first label result, the training scenario score, and the training prior knowledge;

and the updating module is used for updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain a trained classification label generation model.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, where the computer program is suitable for being loaded by a processor to execute steps in a tag generation method according to any one of the above embodiments.

An embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor executes the steps in the tag generation method according to any of the above embodiments by calling the computer program stored in the memory.

According to the tag generation method, the tag generation device, the computer readable storage medium and the computer device, when the target classification tag of the target object is generated, not only the first classification in the time coding dimension is considered, but also the second classification result in the preset table dimension is considered, so that the accuracy of generating the target classification tag of the target object is improved, moreover, the priori knowledge in the preset table is considered in the first classification result in the time coding dimension and the second classification result in the preset table dimension, and is integrated into the process of determining the first classification result and the second classification result, so that the accuracy of generating the target classification tag of the target object is improved, and in addition, when the first classification result in the time coding dimension is generated, the time sequence information appearing in each scene dimension in the dialogue audio data is considered, namely, the context information of the dialogue audio data is considered, the global information in the whole audio data is integrated, the accuracy of the first classification result is improved, and the accuracy of generating the target classification tag of the target object of the target classification tag of the target object is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a tag generation method provided in an embodiment of the present application.

Fig. 2 is a schematic diagram of a flow of data preprocessing provided in an embodiment of the present application.

Fig. 3 is a schematic diagram of text information provided in an embodiment of the present application.

Fig. 4 is a schematic diagram of a classification label generation model according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a multi-granularity knowledge inference module provided in an embodiment of the present application.

Fig. 6 is a schematic sub-flow diagram of a tag generation method provided in the embodiment of the present application.

Fig. 7 is a sub-flow diagram of a tag generation method according to an embodiment of the present application.

Fig. 8 is a schematic sub-flow diagram of a tag generation method according to an embodiment of the present application.

Fig. 9 is another schematic flow chart of a tag generation method according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a label generation apparatus according to an embodiment of the present application.

Fig. 11 is another schematic structural diagram of a label generation apparatus according to an embodiment of the present application.

Fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a label generation method, a label generation device, a computer readable storage medium and computer equipment. Specifically, the tag generation method in the embodiment of the present application may be executed by a computer device, and the tag generation apparatus in the embodiment of the present application is integrated in the computer device. The computer device may be a terminal or a server. The terminal can be a terminal device such as a smart phone, a tablet Computer, a notebook Computer, a touch screen, a Personal Computer (PC), and a robot. The server may be an independent physical server, a server cluster formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service and a cloud database.

Fig. 1 is a schematic flowchart of a tag generation method provided in an embodiment of the present application, where the tag generation method is applied to a computer device and includes the following steps.

101, obtaining dialogue audio data corresponding to a target object, and performing recognition processing on the dialogue audio data to obtain multiple scene dimensions, dialogue text data under each scene dimension, and answer text data under each scene dimension, which are related in the dialogue audio data.

At least two different character objects, such as a first character object and a second character object, are included in the conversational audio data. For example, the first character object is a doctor, the second character object is a patient, the corresponding target classification label may be one of a depression classification label and a disease classification label, and the second character object is a target object. In other embodiments, the first character object is a teacher, the second character object is a student, and the like, the second character object is a target object, and the corresponding target classification label may be one of the knowledge mastering grade labels, and the like.

The dialog audio data corresponding to the target object may be understood as the dialog audio data including the audio data of the target object, for example, when the doctor asks the patient for the disease condition information, the dialog audio data between the doctor and the patient includes the audio data of the patient, and the dialog audio data corresponding to the target object. The audio data of the target object includes multiple rounds of interactive communication contents between the doctor and the user, for example, multiple rounds of interactive communication contents are obtained by performing multiple rounds of queries and responses according to contents of a Scale such as Hamilton Anxiety Scale (HAMD Scale) and PHQ9 Scale.

And acquiring dialogue audio data of the target object, and identifying the dialogue audio data.

As shown in fig. 2, the dialog audio data is respectively subjected to role separation processing, role classification processing, speech recognition processing, and scene dimension recognition processing, and finally corresponding text information is obtained, where the text information includes a plurality of scene dimensions related to the dialog audio data, dialog text data in each scene dimension, and answer text data in each scene dimension. The dialog text data in each scene dimension refers to text data corresponding to a dialog between at least two different character objects, and the answer text data in each scene dimension refers to text data corresponding to a target object for answering other character objects. Note that the answer text data is for the target object. The answer text data in each scene dimension of this step is based on the influence of the answer text data of the target object, such as a patient, on the final target classification label.

In an embodiment, the step of performing recognition processing on the dialogue audio data includes: performing role separation processing and role classification processing on the dialogue audio data to obtain at least two different role objects, wherein the at least two different role objects comprise target objects; performing voice recognition processing on the conversation audio data to obtain conversation text data corresponding to at least two different role objects and answer text data of the target object; and carrying out scene dimension identification on the conversation text data corresponding to at least two different role objects to obtain a plurality of scene dimensions, conversation text data under each scene dimension and answer text year data under each scene dimension related in the conversation audio data.

The dialog audio data may be subjected to a role separation process, a voice recognition process, a role separation process, a scene dimension recognition process, and the like.

The role object of the speaker in the dialogue audio data is determined by the role separation engine, for example, the role separation engine acquires sound source information, a sound source position corresponding to the sound source information and a voiceprint feature in the dialogue audio data, and determines the role object corresponding to the sound source position according to the voiceprint feature, so that the first role object and the second role object are separated.

The speech recognition process converts dialogue audio data into dialogue text data. Specifically, reference may be made to a method of speech recognition processing in the prior art, which is not described herein in detail.

The role classification process mainly distinguishes the separated first role object from the second role object, for example, distinguishes the first role object from the second role object between doctors and patients. The first character object and the second character object that have been separated can be distinguished using a rule matching approach. For example, it includes: firstly, constructing a query sentence pattern library commonly used by doctors in doctors on scales such as HAMD scales, and judging whether the scales are doctors or not by carrying out editing distance calculation on input texts such as dialogue text data; secondly, constructing a keyword library of the doctor in inquiry, such as: if the first character object and the second character object are not judged, if the first character object and the second character object are judged, and the like, the first character object and the second character object are judged by counting the keyword frequency of the first character object and the second character object in the dialogue audio data (the whole dialogue). Since the audio data is dialogue audio data based on the HAMD table, the accuracy of the character classification processing is high.

The scene dimension identification processing is mainly to divide the whole dialog text data into 17 scene dimension segments of the HAMD scale table, so as to meet the requirements of subsequent processing. The scene dimension identification process may be performed based on a neural network model or a deep learning model, for example, a Bi-directional Long Short-Term Memory (BiLSTM) neural network model is used to perform the scene dimension identification process, each time step of the BiLSTM model is input as a single question-answer pair, and each time step outputs a scene dimension category to which the current question-answer pair belongs. The scene dimension difference of the HAMD table is obvious, so that the actual accuracy of scene dimension identification processing is high.

After being subjected to a character separation process, a voice recognition process, a character classification process, and a scene dimension recognition process, otherwise referred to as a data preprocessing, to output the text information in fig. 2.

102, obtaining a preset scale corresponding to the target object, where the preset scale includes prior knowledge in each scene dimension, where the prior knowledge is predetermined based on the dialog audio data.

The preset scale is a scale for the target object, such as the HAMD scale. It should be noted that the target object in this step is the same target object as the target object in the above step.

Each scene dimension of the HAMD scale has a separate scene score and a score standard, and the scene score and the score standard are obtained by summarizing and refining a large number of psychiatric domain experts in practice, so that the HAMD scale has rich prior knowledge. That is, the preset table includes prior knowledge in each scene dimension, and the prior knowledge may include scene scores and option contents corresponding to the scene scores in 17 scene dimensions in the predetermined HAMD table.

The a priori knowledge may be determined based on the dialog audio data, for example, the scene scores and the option contents corresponding to the scene scores in 17 scene dimensions in the HAMD scale are determined according to the contents in the dialog audio data. For example, in the process of inquiring the patient, the doctor inquires about 17 scene dimensions in the HAMD scale respectively, and determines scene scores and option contents corresponding to the scene scores in the 17 scene dimensions in the HAMD scale based on the answers of the patient and the performance of the patient.

The preset table is presented in a data table form, and prior knowledge under each scene dimension in the data table can be acquired, namely the scene score under each scene dimension and option content corresponding to the scene score.

As shown in table 1, the scoring criteria in a part of scene dimensions of the HAMD scale and the scene scoring results in corresponding scene dimensions are shown in a schematic table. The item refers to a scene dimension, the HAMD table has 17 scene dimensions, only 2 scene dimensions are shown in table 1, the scene score of the first scene dimension is 2, corresponding option content is spontaneously expressed in conversation, the scene score of the second scene dimension is 1, and the corresponding option content is responsible for oneself and feels that the oneself has tired of other people.

Table 1 scoring criteria under part of scene dimensions of HAMD scale and schematic table of scene scoring results under corresponding scene dimensions

After the above steps, the text information as shown in fig. 3 can be obtained. Wherein, science _ sequence represents a scene dimension sequence, "0" under science _ sequence represents the 0 th scene dimension, the text content in the sent represents the dialogue text data, "depressed mood" in science represents the concrete scene dimension name, score represents the scene score, the scene score of the 0 th scene dimension is 2, score_query is the option content of the scene score, and the text content in answer represents the answer text data.

As can be seen from fig. 3, the dialog text data, the answer text data, and the a priori knowledge are included in each scene dimension. The text information shown in fig. 3 is a basis of the subsequent processing, and a series of processing is performed based on the text information to generate a target classification tag of the target object.

In an embodiment, the present application provides a classification tag generation model, which may be represented by MT-MGKR, and the classification tag generation model may determine a first classification result of a target object in a time-series coding dimension, a second classification result in a preset table dimension, and generate a target classification tag of the target object according to the first classification result and the second classification result. Fig. 4 is a schematic diagram of a classification label generation model provided in the embodiment of the present application. Thereafter, a series of processes are performed on the obtained text information using the classification label generation model to generate a target classification label of the target object.

In order to introduce priori knowledge of a preset scale, a Multi-granularity knowledge inference module is designed in the embodiment of the present application, fig. 5 is a schematic diagram of the Multi-granularity knowledge inference module in a classification tag generation model provided in the embodiment of the present application, the Multi-granularity knowledge inference module may be represented by using MGKI (Multi-granular knowledge inference), the Multi-granularity knowledge inference module heavily considers the influence of answer text data of a target object on a final target classification tag when performing feature coding on corresponding dialog audio data, and then, in order to better fuse the priori knowledge, a Corss-Attention module is designed to perform interaction between the dialog text data and the priori knowledge, so as to strengthen the representation of the priori knowledge on input information. Hereinafter, the classification label generation model provided by the embodiment of the present application will be understood in conjunction with fig. 4 and 5.

And 103, determining a first classification result of the target object on a time sequence coding dimension according to the dialog text data, the answer text data and the prior knowledge under each scene dimension and the time sequence information of the occurrence of each scene dimension in the dialog audio data.

The time sequence coding dimension can be understood as an information representation dimension fused with time sequence information. The time series information is fused when the feature is subjected to the feature encoding processing, and the result is determined based on the information representation.

Understandably, a first classification result of the target object in a time sequence coding dimension is determined by utilizing a classification label generation model according to the dialog text data, the answer text data and the prior knowledge in each scene dimension and the time sequence information of the occurrence of each scene dimension in the dialog audio data.

According to the dialogue text data, the answer text data and the priori knowledge under each scene dimension and the time sequence information of each scene dimension in the dialogue audio data, a classification label is used for generating a model, and a first classification result of the target object on the time sequence coding dimension is determined. Specifically, dialog text data, answer text data and priori knowledge in each scene dimension and time sequence information appearing in each scene dimension in dialog audio data are input into a classification label generation model, the classification label generation model is utilized to process, and a first classification result of a target object in the time sequence coding dimension is determined.

In an embodiment, the step 103 further includes: determining semantic fusion characteristics under each scene dimension according to the dialogue text data, the answer text data and the prior knowledge under each scene dimension; and performing time sequence coding processing on the semantic fusion characteristics under each scene dimension according to the time sequence information appearing in each scene dimension in the dialogue audio data to determine a first classification result of the target object on the time sequence coding dimension. The semantic fusion feature is obtained by performing semantic processing on fusion features obtained according to the dialogue text data, the answer text data and the priori knowledge in each scene dimension, and the semantic processing may be corresponding processing on the fusion features by using a Self-Attention mechanism, such as Self-Attention processing.

Further, in order to introduce the prior knowledge of the scale, the prior knowledge is fused with the input information. Correspondingly, the step of determining the semantic fusion characteristics in each scene dimension according to the dialogue text data, the answer text data and the priori knowledge in each scene dimension includes: respectively carrying out feature coding processing on the dialogue text data, the answer text data and the priori knowledge under each scene dimension to obtain dialogue text features, answer text features and priori knowledge features; carrying out fusion processing on the priori knowledge characteristic, the dialogue text characteristic and the answer text characteristic to obtain a fusion characteristic under each scene dimension; and performing semantic processing on the fusion features by using an attention mechanism to obtain semantic fusion features under each scene dimension.

The method for fusing the prior knowledge characteristic, the dialogue text characteristic and the answer text characteristic can be various, and the prior knowledge characteristic can be fused into the text characteristic to strengthen the representation of the prior knowledge on the input information such as the dialogue text data and the answer text data.

For example, the priori knowledge characteristic and the dialog text characteristic in each scene dimension may be subjected to interaction processing, and then the result of the interaction processing and the answer text characteristic are subjected to first fusion processing to obtain the fusion characteristic in each scene dimension, or the priori knowledge characteristic and the answer text characteristic may be subjected to interaction processing, and then the result of the interaction processing and the dialog text characteristic are subjected to first fusion processing to obtain the fusion characteristic in each scene dimension, or the priori knowledge characteristic and the dialog text characteristic may be subjected to first interaction processing, then the priori knowledge characteristic and the answer text characteristic are subjected to second interaction processing, and the result of the first interaction processing and the result of the second interaction processing are subjected to first fusion processing to obtain the fusion characteristic in each scene dimension.

In an embodiment, specifically, as shown in fig. 6, the step 103 includes the following steps.

And 201, respectively carrying out feature coding processing on the dialogue text data, the answer text data and the priori knowledge under each scene dimension to obtain dialogue text features, answer text features and priori knowledge features.

Wherein after data preprocessing, the dialog text data can be represented as

Where i is a scene dimension index, for example, there are 17 scene dimensions at most for HAMD17, and j is a text index in each scene dimension, and the maximum text length is L. In order to enhance the answer of the target object, the answer text data of the target object such as a patient in each scene dimension is represented individually, given the dialogue video data, and after being preprocessed according to the data, the answer text data is correspondingly represented as:

wherein i and j are consistent with i and j in the dialogue text data.

Reference 01_ QA in fig. 4 corresponds to dialog text data in the first scene dimension, and reference 01_ Answer corresponds to Answer text data in the first scene dimension.

Correspondingly, the scene question-answer in fig. 5 refers to the dialog text data in a certain scene dimension, where [ CLS ] is the start identifier, and [ SEP ] corresponds to the delimiter and the like, and currently refers to the dialog text data in the first scene dimension, including:

[ CLS ] question: how do you feel your mood since the last week? Answering: the sink bar is lowered. Asking for: is something unpleasant has occurred? Answering: may not be, or may be, a persistent, relatively low, or low, question: can family and friends around see you without much emotion? Answering: can be seen. [ SEP ].

The patient responses in fig. 5 refer to the response text data in a certain scene dimension, and currently refer to the response text data in the first scene dimension, and include:

[ CLS ] the relatively low-fall sink [ SEP ] may not be or may be a persistent relatively low-sink [ SEP ] that can be seen [ SEP ].

In an embodiment, the step of performing feature coding processing on the dialog text data, the answer text data, and the priori knowledge in each scene dimension to obtain the dialog text feature, the answer text feature, and the priori knowledge feature includes: carrying out first feature coding processing on the dialogue text data and the answer text data under each scene dimension to obtain dialogue text features and answer text features; and carrying out second feature coding processing on the priori knowledge under each scene dimension to obtain the priori knowledge features.

The first feature encoding process may be a text encoding process, for example, performing the first feature encoding process on the dialog text data and the answer text data in each scene dimension by using a BERT (bidirectional encoder retrieval from transform) model to obtain a dialog text feature and an answer text feature, which are specifically shown in formula (1) and formula (2).

H＝Sequence(BERT(X)) (1)

H _A ＝Pool(BERT(A)) (2)

Wherein the content of the first and second substances,

H _A ＝{v _i ,0<＝i<17,v _i ∈R ^d wherein H isSequential coded representation of dialog text data, i.e. dialog text features, H _A A sequential encoded representation of the answer text data, i.e. the answer text features, for the target object.

For example, for the HAMD scale, the maximum value of the number of the options in the multiple scene dimensions is 5, and the number of the options in some scene dimensions is 3, so that the dimension of the priori knowledge determined is 5, and the options corresponding to the content of the options scored for the scene in the priori knowledge are mapped into the dimension of the priori knowledge determined. Finally, the obtained a priori knowledge characteristics can be expressed as

Wherein, i is a scene dimension subscript, there are at most 17 scene dimensions in the HAMD17, j is a subscript of a plurality of knowledge features in the priori knowledge features under each scene dimension, the maximum length thereof is 5, and a single scene dimension is represented by using PAD/other identifiers and the like which are not enough for 5 pieces of knowledge.

Thus, the dialog text feature, the answer text feature and the prior knowledge feature under each scene dimension are obtained. It should be noted that the feature coding process may be performed on the dialog text data, the answer text data, and the prior knowledge in each scene dimension in other manners to obtain corresponding dialog text features, answer text features, and prior knowledge features.

After the priori knowledge characteristics, the dialogue text characteristics and the answer text characteristics are obtained, in order to fully learn the priori knowledge characteristics of the preset scale, the priori knowledge characteristics, the dialogue text characteristics and the answer text characteristics are processed in a multi-granularity knowledge reasoning module customized in a scene, so that semantic fusion characteristics under each scene dimension and scene scores under each scene dimension are obtained. Among them, the following formula (3) can be used for representation.

s _i ,m _i ＝MGKR _i (h _i ,v _i ),0<＝i<17 (3)

Wherein, after being processed by the multi-granularity knowledge inference module, the output is expressed as S = { S = { (S) } _i ,0<＝i<17,s _i ∈R ¹ }，M＝{m _i ,0<＝i<17,m _i ∈R ^d Where S is a set of scene scores for each scene dimension, S _i For the scene score in the ith scene dimension, M is the vector representation set of semantic fusion features in each scene dimension, M _i Vector representation of semantic fusion features in each scene dimension. The scene score is the same as the scene score, and since the scene score is obtained by multi-granularity knowledge inference in this step, the scene score is referred to as the scene score in order to distinguish from the scene score in the prior knowledge described above.

As shown in fig. 4, S _ science 01, S _ science 02, S _ science 03, \8230;, S _ science 17 are scene scores of a first scene dimension, a second scene dimension, a third scene dimension, \8230;, 17 th scene dimension, H01, H02, H03, \8230;, H17 are semantic fusion features of a first scene dimension, a second scene dimension, a third scene dimension, \8230; \\, 8230;, and 17 th scene dimension, respectively.

The following steps 202 to 203 further describe how the multi-granularity knowledge inference module obtains semantic fusion features in each scene dimension.

202, carrying out fusion processing on the priori knowledge characteristic, the dialogue text characteristic and the answer text characteristic to obtain a fusion characteristic under each scene dimension.

In order to introduce the prior knowledge characteristics of the preset scale, the prior knowledge characteristics are fused with other dialog text characteristics and answer text characteristics in the step. The fusion processing of the prior knowledge feature, the dialog text feature, and the answer text feature is described above, so that there are various implementation manners for obtaining the fusion feature in each scene dimension. In the embodiments of the present application, one implementation manner is described as an example.

Correspondingly, the step of performing fusion processing on the priori knowledge feature, the dialogue text feature and the answer text feature to obtain a fusion feature under each scene dimension includes: carrying out interactive processing on the priori knowledge characteristics and the dialogue text characteristics to obtain text knowledge interactive characteristics under each scene dimension; and performing first fusion processing on the text knowledge interaction characteristics and the answer text characteristics in each scene dimension to obtain fusion characteristics in each scene dimension, so that the conversation text data is fully represented by using the prior knowledge characteristics, and the representation of the prior knowledge on the conversation text data is strengthened.

The interactive processing refers to a processing mode of interactively fusing the prior knowledge feature and the dialog text feature, and as shown in fig. 5, the interactive processing can be realized by a Corss attribute module.

In an embodiment, as shown in fig. 7, the step of performing interactive processing on the a priori knowledge characteristic and the dialogue text characteristic to obtain the text knowledge interactive characteristic in each scene dimension includes the following steps. Or it may be understood that the Corss Attention module implements the contents shown in fig. 7.

301, a relationship weight is determined between each knowledge feature and each of the dialog text features in each scene dimension.

To adequately represent input information using a priori knowledge features, first a relationship weight between each of the a priori knowledge features and each of the input dialog text features for each scene dimension needs to be determined. For example, the number of the prior knowledge features in each scene dimension may be 5, and the number of the text features in the dialog text features may be L, which may be understood as that one text corresponds to one text feature. A relationship weight between each of the a priori knowledge features and each of the input dialog text features in each scene dimension is determined using a non-linear function.

The relationship weight between each knowledge feature and text feature in each scene dimension can be determined using equation (4).

Wherein, W ₁ ∈R ^d*2d 、b ₁ ∈R ^d 、W ₂ ∈R ^1*d Are the parameters of the model that can be learned,

in the ith scene dimension, the t ₁ The individual knowledge characteristics and t ₂ The relational weight or weight representation between individual text features, reflecting the tth ₁ The individual knowledge characteristics and t ₂ Correlation information between individual text features.

302, according to the relation weight, determining a weight matrix between the prior knowledge characteristic and the dialogue text characteristic in each scene dimension.

Paying attention to the information representation of the prior knowledge to the text, a weight matrix between the prior knowledge characteristics and input information such as the dialog text characteristics needs to be calculated.

For example, for each scene dimension, according to the t ₁ The individual knowledge characteristics and t ₂ Determining the relationship weight between text features ₁ The sum of the relation weights of the knowledge features and all the text features under the scene dimension is determined according to the t ₁ The individual knowledge characteristics and t ₂ The weight of the relation between the text features and the tth ₁ The sum of the relation weights corresponding to the knowledge characteristics is determined ₁ The individual knowledge characteristics and t ₂ And determining a weight matrix between the prior knowledge characteristic and the corresponding dialog text characteristic under the scene dimension according to the weight values of the knowledge characteristic and the text characteristic.

Wherein, in the ith scene dimension, the t ₁ The individual knowledge characteristics and t ₂ The weight value of each text feature can be determined by equation (5).

Wherein, the weight matrix between the prior knowledge characteristic and the dialogue text characteristic under the ith scene dimension can be expressed as

303, processing the dialog text feature by using the weight matrix to obtain a text knowledge interaction feature of the dialog text feature fusion prior knowledge.

Multiplying the dialogue text features by using the weight matrix to obtain text knowledge interaction features, wherein the text knowledge interaction features are fused with priori knowledge.

For example, the text knowledge interaction feature can be obtained by the following formula (6).

And C is the obtained text knowledge interaction feature, and H is the dialog text feature processed by the BERT model.

And after the text knowledge interaction characteristics under each scene dimension are obtained, fusing the answer text characteristics of the target object. The answer text features of the target object are finer-grained features and generally contain key judgment information, so that the text knowledge interaction features and the answer text features under each scene dimension are subjected to first fusion processing to obtain fusion features under each scene dimension.

The first fusion process may be an addition process, where the text knowledge interaction feature and the dialog text feature are first processed into the same dimension, for example, the text knowledge interaction feature is 5 × L dimensions, the dialog text feature is copied multiple times to obtain 5 × L dimensions, then the first text knowledge interaction feature and the first answer text feature are added to obtain a first value of the fusion feature, and so on, to obtain the fusion feature in each scene dimension. In fig. 5, the first fusion process is represented as a circle with a plus sign in the middle.

For example, the text knowledge interaction feature and the answer text feature in each scene dimension may be subjected to the first fusion processing by the following formula (7), so as to obtain a fusion feature.

G＝element_size sum(C,H _A ) (7)

Wherein G is a fusion feature obtained after fusing the answer text information of the target object.

It should be noted that the fusion feature in the embodiment of the present application may also be referred to as a multi-granularity fusion feature, which means that the fusion feature fuses information of multiple granularities, such as multiple dimensions of a priori knowledge information, answer text data of a target object, and dialog text data.

In one embodiment, after obtaining the target classification label of the target object, a certain scientific explanation needs to be given, because the current analog reasoning or the scale-based manner does not give explanatory contents related to the target classification label of the target object, and the interpretability problem is often a disadvantage of the current deep learning model. Therefore, in the embodiment of the present application, an explanation is given of the reasonableness of the target classification label of the target object based on the weight matrix.

And 304, generating a first interpretation content of the target classification label of the target object according to the weight matrix.

Specifically, the explanatory content of the final target classification label is further generated according to the weight value of the weight matrix, i.e. explaining why the target classification label is obtained. The weight matrix represents the weight between the prior knowledge characteristic and the input information such as the dialog text characteristic, and the larger the weight value is, the larger the relevance of the corresponding knowledge characteristic to the text characteristic is, so that the corresponding knowledge characteristic and text characteristic with the larger weight value in the weight matrix are obtained, and the first interpretation content of the target classification label of the target object is generated according to the knowledge characteristic and text characteristic with the larger weight value.

And 203, performing semantic processing on the fusion features by using an attention mechanism to obtain semantic fusion features under each scene dimension.

And after the fusion features under each scene dimension are obtained, performing semantic processing on the fusion features by using an attention mechanism to obtain the semantic fusion features under each scene dimension. H01, H02, H03, \ 8230; \ 8230;, H17, etc. in FIG. 4 are semantic fusion features in each scene dimension.

Correspondingly, the semantic fusion feature in each scene dimension can be obtained according to formula (8).

M＝Self-Attention(G) (8)

Wherein, M is a vector representation set of semantic fusion features in each scene dimension, and for specific meaning, please refer to the description in the foregoing.

All of the above is represented by learning within a single scene dimension. However, to understand the entire conversational audio data, learning only in a single scene dimension is clearly not sufficient.

In order to understand global information in the dialogue audio data, step 204 performs time-series coding on the semantic fusion features in each scene dimension to represent the semantic fusion features in each scene dimension in series.

And 204, performing time sequence coding processing on the semantic fusion features under each scene dimension according to the time sequence information appearing in each scene dimension in the dialogue audio data to determine a first classification result of the target object on the time sequence coding dimension.

The time sequence information of each scene dimension may be the time sequence of each scene dimension, or the time information of each scene dimension appearing in the dialog audio data, or other information that may characterize the time sequence. In the embodiment of the application, any one or more neural network models or deep learning models capable of expressing time sequence information can be used for performing time sequence coding processing on the semantic fusion features in each scene dimension to obtain a first classification result of the target object in the time sequence coding dimension.

In an embodiment, the step 204 includes: according to time sequence information appearing in each scene dimension in the dialogue audio data, carrying out time sequence series coding processing on the semantic fusion characteristics under each scene dimension to obtain global semantic representation information comprising context information; and carrying out semantic classification processing on the global semantic representation information to obtain a first classification result of the target object on the time sequence coding dimension.

The time-series encoding process may be implemented by using a Bi-directional Long Short-Term Memory (Bi-lstm) neural network model, and in other embodiments, may be implemented by using other neural networks that may combine information of the morning and the afternoon.

According to time sequence information appearing in each scene dimension in the dialogue audio data, semantic fusion features under each scene dimension are input into a bidirectional long-short time memory neural network model, forward processing is carried out by using a forward long-short time memory neural network to obtain a first output vector in the forward direction, backward processing is carried out by using a backward long-short time memory neural network to obtain a second output vector in the backward direction, and second fusion processing is carried out on the first output vector and the second output vector to obtain global semantic representation information including context information. The second fusion processing may be splicing processing, that is, combining the first output vector and the second output vector to obtain global semantic representation information.

For example, the forward processing may be implemented by using formula (9) to obtain a first output vector, the backward processing may be implemented by using formula (10) to obtain a second output vector, and the second fusion processing may be implemented by using formula (11) to obtain global semantic representation information.

M＝[M ^F ,M ^b ] (11)

Wherein M is ^F ∈R ^d Is the first output vector in the forward direction, M ^b ∈R ^d For the second output vector in the backward direction, M ∈ R ^1r*2d The representation information for context before and after consideration, i.e. the global semantic representation information including context information, may also be understood as global information including dialogue audio data.

After the global semantic representation information is obtained, performing semantic classification processing on the global semantic representation information to obtain a first classification result of the target object on a time sequence coding dimension. For example, after performing semantic processing on the global semantic representation information by using a self-attention mechanism, inputting a processing result into a multi-layer Perceptron (MLP) for perception processing, and inputting a result after the perception processing into a Softmax layer for normalization processing, a first classification result of the target object in the time sequence coding dimension can be obtained. After global semantic representation information passes through Self-orientation, MLP and Softmax, the classification label and the probability of the target object from the global angle can be obtained.

For example, the processing procedure of the semantic classification process can be represented by formula (12).

S_e2e_level＝Softmax(MLP(Self-Attention(M))) (12)

Wherein S _ e2e _ level represents an end-to-end first classification result. After the Softmax processing is utilized, a plurality of classification labels and the probability thereof are obtained and are expressed as

For example, the depression level is divided into 4 levels without depression, mild depression, moderate depression, severe depression, etc., so that 4 classification labels and corresponding probabilities thereof are corresponded, and the classification label with the highest probability is obtained as the first classification result, that is, the classification label with the highest probability and the corresponding probability are included in the first classification result.

The process of determining the first classification result of a target object in the time-sequential encoding dimension is described above.

And 104, determining a second classification result of the target object on the dimension of the preset table according to the dialogue text data, the answer text data and the prior knowledge under each scene dimension.

The preset scale dimension may be understood as a statistical processing dimension based on a plurality of scene dimensions of the preset scale, and a result is determined from the plurality of scene dimensions in the preset scale based on a statistical (statistical processing) manner.

Understandably, according to the dialogue text data, the answer text data and the prior knowledge in each scene dimension, a second classification result of the target object on the preset scale dimension is determined by utilizing the classification label generation model.

And determining a second classification result of the target object on the dimension of the preset scale by utilizing a classification label generation model according to the dialogue text data, the answer text data and the prior knowledge under each scene dimension. Specifically, the dialogue text data, the answer text data and the priori knowledge under each scene dimension are input into a classification label generation model, the classification label generation model is utilized to process, and a second classification result of the target object on the preset scale dimension is determined.

In an embodiment, the step 104 further includes: determining scene scores under each scene dimension according to the dialogue text data, the answer text data and the prior knowledge under each scene dimension; and determining a first classification result of the target object on the dimension of the preset scale according to the scene score under each scene dimension. Wherein, the scene score in each scene dimension can be 0, 1, 2, 3, 4, etc. Wherein, the scene score in each scene dimension is calculated, which is different from the scene score in the prior knowledge in the above.

Further, the step of determining a scene score in each scene dimension according to the dialogue text data, the answer text data and the prior knowledge in each scene dimension includes: respectively carrying out feature coding processing on the dialogue text data, the answer text data and the prior knowledge under each scene dimension to obtain a dialogue text feature, an answer text feature and a prior knowledge feature; carrying out fusion processing on the priori knowledge characteristic, the dialogue text characteristic and the answer text characteristic to obtain a fusion characteristic under each scene dimension; and carrying out normalization processing on the fusion characteristics to obtain a scene score under each scene dimension.

In an embodiment, specifically, as shown in fig. 8, the step 104 includes the following steps.

401, feature coding processing is performed on the dialog text data, the answer text data and the prior knowledge in each scene dimension, so as to obtain a dialog text feature, an answer text feature and a prior knowledge feature.

And 402, carrying out fusion processing on the priori knowledge characteristic, the dialogue text characteristic and the answer text characteristic to obtain a fusion characteristic under each scene dimension.

Please refer to the above corresponding description for the steps 401 and 402, which are not described again here.

And 403, performing normalization processing on the fusion features to obtain a scene score under each scene dimension.

The normalization processing may be Softmax processing, or may be other processing that can implement similar functions. And carrying out normalization processing on the fusion characteristics to obtain a scene score under each scene dimension and a probability value of the scene score under each scene dimension. For example, equation (13) may be utilized to derive a scene score in each scene dimension, and a probability value of the scene score in each scene dimension. For each scene dimension, there are 5 scene scores and 5 corresponding probability values, and the scene score with the maximum probability value is used as the final corresponding scene score of the scene dimension.

S＝softmax(G) (13)

Wherein, the meanings of S and G are shown in the corresponding parts of the description above.

As shown in FIG. 4, S _ science 01, S _ science 02, S _ science 03, \ 8230 \ 8230;, S _ science 17 are scene scores in the first, second, third, and third scene dimensions, \8230;, and 17 th scene dimensions, respectively.

And 404, determining a first classification result of the target object on a preset scale dimension according to the scene score of each scene dimension.

After the scene score under each scene dimension and the probability value of the scene score under each scene dimension are obtained, the scene scores under each scene dimension are added to obtain the total score of a preset scale, such as an HAMD scale, classification labels are obtained according to the segmentation division mode and the total score defined by a specialist, the probability corresponding to the classification labels can be obtained according to the probability value of the scene score under each scene dimension, and the corresponding probability can be obtained by adding the probability value of the scene score under each scene dimension. Wherein, the score evaluation result already gives a quantitative grade/classification label under the dimension of a preset scale, and if the total score is 20, the quantitative grade/classification label is moderate depression and the like. The first classification result includes a classification label such as moderate depression and its corresponding probability.

As shown in table 2, a segmentation method based on the experience of the specialist is used for the score evaluation part of the HAMD scale.

Table 2 example of segmentation of HAMD table

Classification tag ranking	Score range	Code ID
			Without depression	Score<＝7	0
Mild depression	7<Score<＝17	1
			Moderate depression	17<Score<＝24	2
Major depression	24<Score	3

The implementation of this step can be represented using equation (14).

S _ science _ level = HAMD score evaluation (S) (14)

In an embodiment, an explanation of the rationality of the target classification label of the target object is given based on the scene score in the embodiment of the present application. Further, the method comprises the following steps.

And 405, generating second explanation content of the target classification label of the target object according to the scene score of each scene dimension.

For example, if the target classification label is major depression, the scene scores in several scene dimensions are necessarily higher, and the scene score is higher, for example, a scene dimension larger than 3 is determined as the target dimension, so that the second interpretation content of the target classification label of the target object is generated according to the target dimension.

It should be noted that, in the foregoing, the step of generating the first interpretation content of the target classification label of the target object according to the weight matrix, and the step of generating the second interpretation content of the target classification label of the target object according to the scene score in each scene dimension may be performed after obtaining the target classification label. After the first interpretation content and the second interpretation content are obtained, the final interpretation content is generated according to the first interpretation content and the second interpretation content so as to accord with reading habits.

In the actual execution of the above steps 103 and 104, the steps may be completed according to an execution manner of the classification label generation model shown in the drawing, that is, the steps 103 and 104 are not clearly distinguished, and in the overall execution process, the fusion feature in each scene dimension is obtained first, and then the subsequent processing is executed according to the fusion feature, so as to obtain the first classification result of the target object in the time-series encoding dimension and the second classification result of the target object in the preset table dimension, respectively.

And 105, generating a target classification label of the target object according to the first classification result and the second classification result.

As shown in fig. 4, a precision _ Level Fusion module is used to fuse the first classification result and the second classification result, and generate a target classification tag of the target object. The Decision _ Level Fusion module adopts three strategies to fuse the first classification result and the second classification result, and comprises the following steps: taking the first classification result as a final classification result, and taking S _ e2e _ level as the final classification result if the final classification result is the first classification result; taking the second classification result as a final classification result, for example, taking S _ science _ level as the final classification result; and weighting the probability of the classification label in the second classification result and adding the probability to the corresponding overall probability in the first classification result, and then reordering according to the probability to output the quantitative grade/classification label with the maximum probability as a final target classification label.

In the above embodiment, when generating the target classification label of the target object, not only the first classification in the time coding dimension is considered, but also the second classification result in the preset table dimension is considered, so as to improve the accuracy of generating the target classification label of the target object, and the first classification result in the time coding dimension and the second classification result in the preset table dimension are considered, and the prior knowledge is integrated into the determination process of the first classification result and the second classification result, so as to improve the accuracy of generating the target classification label of the target object, and in addition, when generating the first classification result in the time coding dimension, the time sequence information appearing in each scene dimension in the dialogue audio data is considered, that is, the context information of the dialogue audio data is considered, the global information in the whole dialogue audio data is synthesized, the accuracy of the first classification result is improved, the accuracy of generating the target classification label of the target object is further improved, and in addition, in the embodiment, the target classification label content of the target classification label can be further explained according to the interpretation matrix, and the target classification label content of the target classification can be further improved.

The classification label generation model in the embodiment of the application is based on the multi-granularity knowledge reasoning module, conversation text data such as doctor-patient conversations in the depression field can be effectively coded and understood, the priori knowledge and fine-grained answer text data are fused for modeling, the expression capability of the classification label generation model is enhanced, and the understanding capability of the classification label generation model on global and local information is effectively improved through two different classification modes of time sequence coding dimension and preset scale dimension.

Fig. 9 is another schematic flowchart of a label generation method provided in an embodiment of the present application, where the label generation method mainly describes a process of how to generate a classification label generation model, and specifically includes the following steps.

501, acquiring training dialogue audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and acquiring a training preset table of the training target object, where the training preset table includes training prior knowledge in each scene dimension of a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data. The first label result is a classification result on a time sequence coding dimension which is determined in advance.

502, performing recognition processing on the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data under each scene dimension, and training answer text data under each scene dimension, which are involved in the training dialogue audio data.

An initial classification label generation model is obtained 503.

And 504, according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension and the time sequence information appearing in each scene dimension in the training dialogue audio data, generating a model by using the initial classification label, and determining a training first classification result of the training target object in the time sequence coding dimension.

And 505, according to the training dialogue text data, the training answer text data and the training priori knowledge under each scene dimension, generating a model by using the initial classification label, and determining the training scene score of the training target object under each scene dimension.

The steps 501 to 505 are the same as the process of generating the model using the classification label, except that a "training" word is added to the naming for distinguishing, and the implementation process is the same. It should be noted that in the step 505, only the training scene score of the training target object in each scene dimension needs to be obtained in the training stage.

It should be noted that if further extension of the steps 501 to 505 described above is required, reference may be made to the above process of generating a model using class labels, with the difference that "training" bigrams need to be added to some nomenclature in the further extension.

An overall loss value is determined 506 based on the training first classification result, the first label result, the training scenario score, and the training prior knowledge.

The classification label generation model in the embodiment of the application belongs to a multi-task model, and the multi-task model can obtain two different results in two different dimensions, such as a time sequence coding dimension and a preset scale dimension. Therefore, a multi-task joint training mode is adopted for training during training, so that a classification label generation model after a training year is obtained.

In an embodiment, the step 506 includes: determining a first loss value according to the training first classification result and the first label result; determining a second loss value according to the training scene score and training prior knowledge under each scene dimension; and determining an overall loss value based on the first loss value and the second loss value.

Considering the problems of sample difficulty and quantity imbalance in the depression field, a sample balance Loss function such as a Focal Loss function and a gradient balance Loss function such as a GHM Loss function are used as a combined Loss function for optimization. The idea of the Focal local Loss function is to reduce the weight of the samples which are easy to be classified (i.e. the samples with high confidence), and increase the weight of the samples which are difficult to be classified, so that the model is forced to mainly focus on the samples which are difficult to be classified. In the medical field, especially in the field of psychiatric inquiry, due to the lack of information, many samples are difficult to distinguish, and the situations of wrong labeling and confusion exist in a large number. The Focal local Loss function will affect the stability and optimization direction of the model in this case, so the GHM local Loss function is introduced to balance the attention of the model to the difficult samples, the GHM local Loss function will dynamically balance the weights of the difficult samples, and the difficult samples will not give more attention because they are often label confusion or wrong samples. In conclusion, the final Loss value is obtained by performing weighted balance on the Loss value calculated by the Focal Loss function and the Loss value calculated by the GHM Loss function.

For example, the loss value may be determined by equation (15).

loss＝L _FL +γ*L _GHM (15)

Where gamma is a hyperparameter for balancing L _GHM Loss is a loss value, L _FL Loss value, L, calculated for the Focal local Loss function _GHM And calculating the obtained Loss value for the GHM Loss function.

It was mentioned above that embodiments of the present application relate to training of multiple tasks and, thus, to loss values, such as a first loss value and a second loss value, for multiple tasks.

Wherein, the step of determining the first loss value according to the training first classification result and the first label result comprises: calculating a first loss value corresponding to the first task by using a mode corresponding to formula (15) according to the training first classification result and the training second label result, wherein the calculated first loss value can be loss _e2e To indicate.

Wherein, the step of determining the second loss value according to the training scene score and the training prior knowledge in each scene dimension includes: according to the training scene score under each scene dimension and the training prior under the corresponding scene dimensionKnowledge determines a scene loss value under each scene dimension; and determining a second loss value according to the scene loss value under each scene dimension. For example, according to the training scene score and the training prior knowledge in each scene dimension, the scene loss value in each scene dimension is calculated by using the corresponding mode of formula (15), and the sum of the scene loss values in each scene dimension is determined as the second loss value corresponding to the second task, for example, the calculated second loss value may be loss _scence To indicate.

Wherein the step of determining the total loss value based on the first loss value and the second loss value comprises: and carrying out weighted summation on the first loss value and the second loss value to obtain an overall loss value. The overall loss value is the final loss value of the initial classification label generation model.

For example, the loss value can be determined by equation (16).

Loss＝α*loss _scence +β*loss _e2e (16)

Wherein Loss represents the total Loss value, and alpha and beta are over parameters for balancing the weight ratio of the first task and the second task.

And 507, updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain the trained classification label generation model. The preset condition may be a condition such as convergence of the total loss value.

The trained classification label generation model is the classification label generation model described in the above embodiment.

In one case, after the classification tag generation model is obtained, the dialog text data in each scene dimension of the multiple scene dimensions related to the target object, the answer text data in each scene dimension, and the preset table corresponding to the target object are processed by using the classification tag generation model, and a target classification tag of the target object is generated.

In one case, after the classification tag generation model is obtained, a plurality of scene dimensions, dialog text data under each scene dimension, answer text data under each scene dimension, and a preset table corresponding to the target object, which includes prior knowledge under each scene dimension, can be obtained from the dialog audio data corresponding to the target object, where the prior knowledge is predetermined based on the dialog audio data; and processing the answer text data, the dialogue text data and the prior knowledge under each scene dimension by using the classification label generation model to generate a target classification label of the target object.

It should be noted that the classification label generation model in the above embodiments does not include the process of data preprocessing the dialogue audio data of the target object, and in some embodiments, the process of data preprocessing the dialogue audio data of the target object may also be integrated in the classification label generation model. Correspondingly, when the classification label generation model is used, after the dialogue audio data of the target object are obtained, the dialogue audio data are input into the classification label generation model to be subjected to data preprocessing, and after a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension which are related in the dialogue audio data are obtained, subsequent processing is performed. In the training stage, after the classification label generation model is obtained, the dialogue audio data and the preset table of the target object are processed by the classification label generation model to obtain the target classification label of the target object.

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.

In order to better implement the tag generation method according to the embodiment of the present application, an embodiment of the present application further provides a tag generation apparatus, where the tag generation apparatus corresponds to a tag generation method using a classification tag generation model. Referring to fig. 10, fig. 10 is a schematic structural diagram of a label generating apparatus according to an embodiment of the present application. The tag generation apparatus may include an acquisition module 601, an identification module 602, a first determination module 603, a second determination module 604, and a generation module 605.

The obtaining module 601 is configured to obtain dialog audio data corresponding to a target object.

The recognition module 602 is configured to perform recognition processing on the dialog audio data to obtain a plurality of scene dimensions, dialog text data in each scene dimension, and answer text data in each scene dimension, which are related in the dialog audio data.

The obtaining module 601 is further configured to obtain a preset table corresponding to the target object, where the preset table includes prior knowledge in each scene dimension, and the prior knowledge is predetermined based on the dialog audio data.

A first determining module 603, configured to determine, according to the dialog text data, the answer text data, and the priori knowledge in each scene dimension, and timing information of occurrence of each scene dimension in the dialog audio data, a first classification result of the target object in a timing coding dimension.

A second determining module 604, configured to determine, according to the dialog text data, the answer text data, and the priori knowledge in each scene dimension, a second classification result of the target object in the preset table dimension.

A generating module 605, configured to generate a target classification label of the target object according to the first classification result and the second classification result.

The embodiment of the application further provides a label generation device, and the label generation device corresponds to a label generation method for training the obtained classification label generation model. Referring to fig. 11, fig. 11 is a schematic structural diagram of a tag generation apparatus according to an embodiment of the present disclosure. The label generation apparatus may include a training acquisition module 701, a training identification module 702, a first training determination module 703, a second training determination module 704, a loss value determination module 705, and an update module 706.

A training obtaining module 701, configured to obtain training dialog audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and obtain a training preset table of the training target object, where the training preset table includes training prior knowledge in each scene dimension of multiple scene dimensions, and the training prior knowledge is predetermined based on the training audio data.

A training recognition module 702, configured to perform recognition processing on the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data in each scene dimension, and training answer text data in each scene dimension, which are involved in the training dialogue audio data.

A training obtaining module 701, configured to obtain an initial classification label generation model.

A first training determining module 703, configured to determine, according to the training dialogue text data, the training answer text data, and the training priori knowledge in each scene dimension, and timing information appearing in each scene dimension in the training dialogue audio data, a training first classification result of the target object in the timing coding dimension by using the initial classification tag generation model.

A second training determining module 704, configured to determine a training scene score of the training target object in each scene dimension by using the initial classification label generation model according to the training dialogue text data, the training answer text data, and the training priori knowledge in each scene dimension.

A loss value determining module 705, configured to determine an overall loss value according to the training first classification result, the first label result, the training scenario score, and the training prior knowledge.

An updating module 706, configured to update the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition, so as to obtain a trained classification label generation model.

In an embodiment, the tag generating apparatus may further include a processing module, where the processing module is configured to process, by using the classification tag generation model, the dialog text data in each scene dimension of the multiple scene dimensions related to the target object, the answer text data in each scene dimension, and the preset table corresponding to the target object, so as to generate a target classification tag of the target object.

In an embodiment, the training obtaining module is further configured to obtain a plurality of scene dimensions, dialog text data in each scene dimension, answer text data in each scene dimension, and a preset table corresponding to the target object, where the preset table includes a priori knowledge in each scene dimension, and the a priori knowledge is predetermined based on the dialog audio data. And the processing module is used for processing the answer text data, the dialogue text data and the priori knowledge under each scene dimension by using the classification label generation model to generate a target classification label of the target object.

Correspondingly, the embodiment of the application also provides a computer device, and the computer device can be a terminal or a server. As shown in fig. 12, fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer apparatus 800 includes a processor 801 having one or more processing cores, a memory 802 having one or more computer-readable storage media, and a computer program stored on the memory 802 and operable on the processor. The processor 801 is electrically connected to the memory 802. Those skilled in the art will appreciate that the computer device configurations illustrated in the figures are not meant to be limiting of computer devices and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The processor 801 is a control center of the computer apparatus 800, connects respective parts of the entire computer apparatus 800 by various interfaces and lines, performs various functions of the computer apparatus 800 and processes data by running or loading software programs (computer programs) and/or modules stored in the memory 802, and calling data stored in the memory 802, thereby monitoring the computer apparatus 800 as a whole.

In the embodiment of the present application, the processor 801 in the computer device 800 loads instructions corresponding to one or more processes of an application program/computer program into the memory 802, and the processor 801 executes the application program/computer program s stored in the memory 802, thereby implementing various functions as follows:

obtaining dialogue audio data corresponding to a target object, and identifying and processing the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension, which are related in the dialogue audio data; acquiring a preset scale corresponding to the target object, wherein the preset scale comprises prior knowledge under each scene dimension, and the prior knowledge is predetermined based on the dialogue audio data; determining a first classification result of the target object on a time sequence coding dimension according to the dialog text data, the answer text data and the prior knowledge under each scene dimension and the time sequence information of each scene dimension in the dialog audio data; determining a second classification result of the target object on the preset scale dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension; generating a target classification label of the target object according to the first classification result and the second classification result; alternatively, the first and second electrodes may be,

acquiring training dialogue audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and acquiring a training preset table of the training target object, wherein the training preset table comprises training prior knowledge under each scene dimension in a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data; performing recognition processing on the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data under each scene dimension, and training answer text data under each scene dimension, which are related in the training dialogue audio data; acquiring an initial classification label generation model; according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension and the time sequence information appearing in each scene dimension in the training dialogue audio data, generating a model by using the initial classification label, and determining a training first classification result of the training target object in the time sequence coding dimension; according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension, generating a model by using the initial classification label, and determining a training scene score of the training target object in each scene dimension; determining a total loss value according to the training first classification result, the first label result, the training scene score and the training prior knowledge; and updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain a trained classification label generation model.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, as shown in fig. 12, the computer device 800 further includes: a touch display 803, a radio frequency circuit 804, an audio circuit 805, an input unit 806, and a power supply 807. The processor 801 is electrically connected to the touch display 803, the radio frequency circuit 804, the audio circuit 805, the input unit 806, and the power supply 807 respectively.

The touch display screen 803 can be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display 803 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 801 to determine the type of the touch event, and then the processor 801 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, a touch panel and a display panel may be integrated into the touch display screen 803 to realize input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display 803 may also be used as a part of the input unit 806 to implement an input function.

In the embodiment of the present application, the touch display screen 803 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The radio frequency circuit 804 may be used for transceiving radio frequency signals to establish wireless communication with a network device or other computer device through wireless communication, and to transceive signals with the network device or other computer device.

The audio circuit 805 may be used to provide an audio interface between a user and a computer device through speakers and microphones. The audio circuit 805 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into an audio signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 805 and converted into audio data, and the audio data is processed by the audio data output processor 801 and then transmitted to another computer device via the rf circuit 804, or the audio data is output to the memory 802 for further processing. The audio circuit 805 may also include an earbud jack to provide communication of peripheral headphones with the computer device.

The input unit 806 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 807 is used to power the various components of the computer device 800. Optionally, the power supply 807 may be logically connected to the processor 801 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 807 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 12, the computer device 800 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, where the computer programs can be loaded by a processor to execute the steps in any one of the label generation methods provided in the embodiments of the present application. For example, the computer program may perform the steps of:

obtaining dialogue audio data corresponding to a target object, and identifying and processing the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension, which are related in the dialogue audio data; acquiring a preset scale corresponding to the target object, wherein the preset scale comprises prior knowledge under each scene dimension, and the prior knowledge is predetermined based on the dialogue audio data; determining a first classification result of the target object on a time sequence coding dimension according to the dialog text data, the answer text data and the prior knowledge under each scene dimension and the time sequence information of each scene dimension in the dialog audio data; determining a second classification result of the target object on the preset scale dimension according to the conversation text data, the answer text data and the prior knowledge under each scene dimension; generating a target classification label of the target object according to the first classification result and the second classification result; alternatively, the first and second electrodes may be,

acquiring training dialogue audio data corresponding to a training target object and a first label result of the training target object in a time sequence coding dimension, and acquiring a training preset table of the training target object, wherein the training preset table comprises training prior knowledge under each scene dimension in a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data; performing recognition processing on the training dialogue audio data to obtain a plurality of scene dimensions, training dialogue text data under each scene dimension, and training answer text data under each scene dimension, which are related in the training dialogue audio data; obtaining an initial classification label generation model; according to the training dialogue text data, the training answer text data and the training priori knowledge under each scene dimension and the time sequence information appearing in each scene dimension in the training dialogue audio data, a model is generated by utilizing the initial classification label, and a training first classification result of the training target object on the time sequence coding dimension is determined; according to the training dialogue text data, the training answer text data and the training priori knowledge in each scene dimension, generating a model by using the initial classification label, and determining a training scene score of the training target object in each scene dimension; determining an overall loss value according to the training first classification result, the first label result, the training scenario score and the training prior knowledge; and updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain a trained classification label generation model.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any label generation method provided in the embodiments of the present application, beneficial effects that can be achieved by any label generation method provided in the embodiments of the present application can be achieved, and detailed descriptions are omitted herein for the foregoing embodiments.

The foregoing detailed description is directed to a tag generation method, apparatus, storage medium, and computer device provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A tag generation method, comprising:

determining a first classification result of the target object on a time sequence coding dimension according to the dialog text data, the answer text data and the prior knowledge under each scene dimension and the time sequence information of each scene dimension in the dialog audio data;

determining a second classification result of the target object on a preset scale dimension according to the conversation text data, the answer text data and the prior knowledge under each scene dimension;

2. The method of claim 1, wherein the step of determining a first classification result of the target object in a time-series coding dimension according to the dialog text data, the answer text data, and the a priori knowledge in each scene dimension, and the time-series information of the occurrence of each scene dimension in the dialog audio data comprises:

determining semantic fusion characteristics under each scene dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension;

and performing time sequence coding processing on the semantic fusion features under each scene dimension according to the time sequence information appearing in each scene dimension in the dialogue audio data so as to determine a first classification result of the target object on the time sequence coding dimension.

3. The method of claim 2, wherein the step of determining semantic fusion features for each scene dimension based on the dialog text data, the answer text data, and the a priori knowledge for each scene dimension comprises:

respectively carrying out feature coding processing on the dialogue text data, the answer text data and the priori knowledge under each scene dimension to obtain dialogue text features, answer text features and priori knowledge features;

performing fusion processing on the priori knowledge characteristics, the dialogue text characteristics and the answer text characteristics to obtain fusion characteristics under each scene dimension;

and performing semantic processing on the fusion features by using an attention mechanism to obtain semantic fusion features under each scene dimension.

4. The method according to claim 3, wherein the step of fusing the a priori knowledge feature, the dialogue text feature and the answer text feature to obtain a fused feature in each scene dimension comprises:

performing interactive processing on the prior knowledge characteristic and the dialogue text characteristic to obtain a text knowledge interactive characteristic under each scene dimension;

and performing first fusion processing on the text knowledge interaction features and the answer text features under each scene dimension to obtain fusion features under each scene dimension.

5. The method according to claim 4, wherein the a priori knowledge characteristics in each scene dimension include a plurality of knowledge characteristics, and the step of interactively processing the a priori knowledge characteristics and the dialog text characteristics to obtain text knowledge interactive characteristics in each scene dimension includes:

determining a relationship weight between each knowledge feature in each scene dimension and each text feature in the dialog text features;

determining a weight matrix between the prior knowledge characteristic and the dialogue text characteristic under each scene dimension according to the relation weight;

and processing the dialog text features by using the weight matrix to obtain text knowledge interaction features of the dialog text features fused with the prior knowledge.

6. The method according to claim 2, wherein the step of performing time-series encoding processing on the semantic fusion features in each scene dimension according to the time-series information of occurrence of each scene dimension in the dialogue audio data to determine the first classification result of the target object in the time-series encoding dimension comprises:

performing time sequence series coding processing on the semantic fusion features under each scene dimension according to the time sequence information appearing in each scene dimension in the dialogue audio data to obtain global semantic representation information comprising context information;

and carrying out semantic classification processing on the global semantic representation information to obtain a first classification result of the target object on a time sequence coding dimension.

7. The method according to claim 1, wherein the step of determining a second classification result of the target object in a preset scale dimension according to the dialog text data, the answer text data and the a priori knowledge in each scene dimension comprises:

determining a scene score under each scene dimension according to the dialogue text data, the answer text data and the priori knowledge under each scene dimension;

and determining a first classification result of the target object on the preset scale dimension according to the scene score under each scene dimension.

8. The method of claim 7, wherein the step of determining a scene score for each scene dimension based on the dialog text data, the answer text data, and the a priori knowledge for each scene dimension comprises:

respectively carrying out feature coding processing on the dialog text data, the answer text data and the prior knowledge under each scene dimension to obtain a dialog text feature, an answer text feature and a prior knowledge feature;

and carrying out normalization processing on the fusion characteristics to obtain a scene score under each scene dimension.

9. The method according to claim 1, wherein the step of performing recognition processing on the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data in each scene dimension, and answer text data in each scene dimension, which are involved in the dialogue audio data, comprises:

performing role separation processing and role classification processing on the conversation audio data to obtain at least two different role objects, wherein the at least two different role objects comprise target objects;

performing voice recognition processing on the conversation audio data to obtain conversation text data corresponding to at least two different role objects and answer text data of the target object;

and carrying out scene dimension identification on the dialogue text data corresponding to at least two different role objects to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension related in the dialogue audio data.

10. A tag generation method, comprising:

acquiring training dialogue audio data corresponding to a training target object and a first label result of the training target object on a time sequence coding dimension, and acquiring a training preset table of the training target object, wherein the training preset table comprises training prior knowledge under each scene dimension of a plurality of scene dimensions, and the training prior knowledge is predetermined based on the training audio data;

obtaining an initial classification label generation model;

according to the training dialogue text data, the training answer text data and the training priori knowledge under each scene dimension and the time sequence information appearing in each scene dimension in the training dialogue audio data, a model is generated by utilizing the initial classification label, and a training first classification result of the training target object on the time sequence coding dimension is determined;

according to the training dialogue text data, the training answer text data and the training priori knowledge under each scene dimension, a model is generated by utilizing the initial classification label, and a training scene score of the training target object under each scene dimension is determined;

determining a total loss value according to the training first classification result, the first label result, the training scene score and the training prior knowledge;

11. The method of claim 10, wherein the step of determining an overall loss value based on the training first classification result, the first label result, the training scenario score, and the training prior knowledge comprises:

determining a first loss value according to the training first classification result and the first label result;

determining a second loss value according to the training scene score and the training priori knowledge under each scene dimension;

determining an overall loss value from the first loss value and the second loss value.

12. The method of claim 10, wherein the step of determining a second loss value based on the training scene score and the training prior knowledge for each scene dimension comprises:

determining a scene loss value under each scene dimension according to the training scene score under each scene dimension and training priori knowledge under the corresponding scene dimension;

and determining a second loss value according to the scene loss value under each scene dimension.

13. A label generation apparatus, comprising:

the recognition module is used for recognizing and processing the dialogue audio data to obtain a plurality of scene dimensions, dialogue text data under each scene dimension and answer text data under each scene dimension, wherein the plurality of scene dimensions, the dialogue text data under each scene dimension and the answer text data are related to the dialogue audio data;

a second determining module, configured to determine, according to the dialog text data, the answer text data, and the prior knowledge in each scene dimension, a second classification result of the target object in the preset scale dimension;

14. A label generation apparatus, comprising:

and the updating module is used for updating the training parameters in the initial classification label generation model according to the total loss value until the total loss value meets a preset condition so as to obtain the trained classification label generation model.

15. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the steps of the label generation method according to any one of claims 1-12.

16. A computer device, characterized in that the computer device comprises a memory in which a computer program is stored and a processor which performs the steps in the label generation method according to any one of claims 1-12 by calling the computer program stored in the memory.