CN115063155B

CN115063155B - Data labeling method, device, computer equipment and storage medium

Info

Publication number: CN115063155B
Application number: CN202210731923.9A
Authority: CN
Inventors: 陈杭; 陈子意; 朱益兴; 于欣璐; 李骁
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-06-25
Filing date: 2022-06-25
Publication date: 2024-05-24
Anticipated expiration: 2042-06-25
Also published as: CN115063155A

Abstract

The embodiment of the application discloses a data labeling method, a device, computer equipment and a storage medium, wherein the method comprises the steps of selecting first data to be labeled and second data to be labeled from historical call data of a customer service seat and a customer, manually labeling the first data to be labeled to obtain first labeling audio and a first labeling audio text with emotion labels, acquiring first sound features and first text features based on positions of the emotion labels in the audio and the text, performing emotion recognition on the first sound features and the first text features through an emotion analysis model, training the model based on recognition results and emotion identification, segmenting the second data to be labeled, acquiring second sound features and second text features based on segmentation results, performing emotion recognition based on the second sound features and the second text features through the emotion analysis model, and automatically labeling the emotion labels. Through the mode, semi-supervised data marking is achieved, and labor cost is saved.

Description

Data labeling method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data labeling method, a data labeling device, a computer device, and a storage medium.

Background

Along with the development of social economy and financial science and technology, people have higher and higher requirements on the service quality of customer service agents of banks, so that the banks can set corresponding monitoring models, in the conversation process, besides conversation between the customer service agents and clients, the monitoring models can collect information input by the clients so as to analyze and judge the emotion of the clients, and meanwhile, the customer service agents are correspondingly prompted according to the analysis and judgment results, so that the influence on the emotion of the clients due to the problems of insufficient experience of the customer service agents or personal emotion and the like is avoided, and the conversation quality is further influenced.

When the monitoring model is used, model training is needed to be carried out to improve the analysis and judgment capabilities of the model, the model training needs to obtain a training set, and the training set is usually obtained from historical call data.

At present, part of historical call data is selected from a database, and the selected historical call data is marked manually to obtain a training set, so that more labor cost is required.

Disclosure of Invention

The embodiment of the application provides a data labeling method, a data labeling device, computer equipment and a storage medium, which are used for solving the problems in the background technology.

In a first aspect, an embodiment of the present application provides a data labeling method, where the method includes:

Selecting first data to be marked and second data to be marked from historical call data of a customer service agent and a customer, and sending the first data to be marked to a manual end for manual marking to obtain first marking data, wherein the first marking data comprises first marking audio with emotion labels and first marking audio texts, the emotion labels comprise emotion marks and position information of audio or audio texts marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio texts are identical;

acquiring an emotion marking audio segment in the first marking audio based on the position information in the emotion label of the first marking audio, and acquiring a first sound characteristic of the emotion marking audio segment;

Acquiring emotion marking sentence segments of the first marking data based on the position information in the emotion label of the first marking audio text, and acquiring first text characteristics of the emotion marking sentence segments;

carrying out emotion recognition based on the first sound feature and the first text feature through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion label;

Performing audio segment segmentation on the second data to be marked to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data;

and acquiring second sound characteristics of the audio data and second text characteristics of the text data, carrying out emotion recognition on the basis of the second sound characteristics and the second text characteristics through the emotion analysis model, and carrying out automatic labeling of emotion labels on the audio data and the text data on the basis of recognition results.

In some embodiments, the obtaining the emotion marking audio segment in the first marked audio based on the position information in the emotion tag of the first marked audio, and obtaining the first sound feature of the emotion marking audio segment include:

classifying the emotion marking audio segments based on the emotion identifications;

And acquiring a sound spectrogram of the emotion marking audio frequency band of the same category by using a preset audio frequency algorithm, and acquiring a first sound feature for representing emotion based on the sound spectrogram.

In some embodiments, the obtaining the emotion markup sentence segment of the first markup data based on the location information in the emotion tag of the first markup audio text, and obtaining the first text feature of the emotion markup sentence segment includes:

Classifying the emotion marking sentence segments based on the emotion identifications;

And acquiring emotion feature words of the emotion marking sentence segments in the same category by using a preset text algorithm, and acquiring first text features for representing emotion based on the emotion feature words.

In some embodiments, the acquiring the second sound feature of the audio data and the second text feature of the text data, performing emotion recognition based on the second sound feature and the second text feature through the emotion analysis model, labeling the audio data and the text data with emotion labels based on recognition results, includes:

Setting an audio weight and a text weight in the emotion analysis model;

and when the emotion identifications of the second sound feature and the second text feature corresponding to the second sound feature are different, determining the emotion identifications of the second sound feature and the second text feature corresponding to the second sound feature based on the weight ratio of the audio weight to the text weight.

In some embodiments, the selecting the first to-be-marked data and the second to-be-marked data from the historical call data of the customer service agent and the customer, and sending the first to-be-marked data to the manual end for manual marking to obtain first marked data, where the first marked data includes a first marked audio with an emotion tag and a first marked audio text, the emotion tag includes an emotion identifier and position information of the audio or audio text marked by the emotion identifier, and the emotion identifiers recorded in the emotion tags of the first marked audio and the first marked audio text are the same, and the method includes:

Selecting first to-be-marked data from historical call data of a customer service agent and a customer of a bank, wherein the first to-be-marked data comprises first to-be-marked audio;

inputting the first audio to be marked into a voice separation model, and separating and marking the first audio to be marked according to voiceprint characteristics of different speakers by the voice separation model;

inputting the processed first audio to be annotated into a text recognition model to obtain a corresponding first audio text to be annotated;

And selecting second data to be marked from the rest historical call data, and sending the processed first audio to be marked and the first audio text to be marked to a manual terminal for manual marking.

In some embodiments, the inputting the processed first to-be-annotated audio into the text recognition model to obtain a first to-be-annotated text corresponding to the first to-be-annotated audio includes:

Inputting the first audio to be marked into a text recognition model, recognizing the semantics of the voice frequency band in the first audio to be marked by the text recognition model, and determining the blank position of the blank frequency band in the first audio to be marked, wherein the blank frequency band is a silent frequency band in the first audio to be marked;

Based on the semantic recognition result and the blank position, obtaining an initial text;

And inputting the initial text into a deep neural network model, determining punctuation mark positions in the blank positions, automatically marking the punctuation marks, and connecting the rest sentences adjacent to the blank positions in front and behind to obtain a first text to be marked.

In some embodiments, the selecting the first to-be-marked data and the second to-be-marked data from the historical call data of the customer service agent and the customer, and sending the first to-be-marked data to the artificial end for manual marking, so as to obtain first marked data, where the first marked data includes a first marked audio with an emotion tag and a first marked audio text, and the method includes:

And preprocessing the first data to be marked and the second data to be marked, wherein the preprocessing comprises noise reduction processing.

In a second aspect, an embodiment of the present application provides a data labeling apparatus, where the apparatus includes:

The system comprises a manual marking unit, a first marking unit and a second marking unit, wherein the manual marking unit is used for selecting first data to be marked and second data to be marked from historical call data of a customer service agent and a customer, the first data to be marked is sent to a manual terminal to be marked manually to obtain first marking data, the first marking data comprises first marking audio with emotion labels and first marking audio texts, the emotion labels comprise emotion marks and position information of audio or audio texts marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio texts are identical;

the sound characteristic acquisition unit is used for acquiring an emotion marking audio segment in the first marked audio based on the position information in the emotion tag of the first marked audio and acquiring a first sound characteristic of the emotion marking audio segment;

The text feature acquisition unit is used for acquiring emotion marking sentence segments of the first marking data based on the position information in the emotion labels of the first marking audio text and acquiring first text features of the emotion marking sentence segments;

The model training unit is used for carrying out emotion recognition based on the first sound features and the first text features through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion label;

the segmentation processing unit is used for segmenting the audio segments of the second data to be marked to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data;

And the automatic labeling unit is used for acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, carrying out emotion recognition on the basis of the second sound characteristic and the second text characteristic through the emotion analysis model, and carrying out automatic labeling on emotion labels on the audio data and the text data on the basis of recognition results.

In a third aspect, an embodiment of the present application further provides a computer device, including a memory and a processor, where the memory is configured to store instructions and data, and the processor is configured to perform the data labeling method described above.

In a fourth aspect, embodiments of the present application further provide a storage medium having stored therein a plurality of instructions adapted to be loaded by a processor to perform the data tagging method described above.

According to the data labeling method, first data to be labeled and second data to be labeled are selected from historical call data of a customer service agent and a customer, the first data to be labeled is sent to a manual end to be manually labeled, the characteristics are obtained through the manual labeling result, the model is trained based on the characteristics, the trained model is obtained, the second data to be labeled is automatically identified and labeled through the trained model, a semi-supervised data labeling mode is achieved, and labor cost is saved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a data labeling method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a data labeling device according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a data labeling method, a device, computer equipment and a storage medium, wherein a model is trained through manually labeled data, and the model after training automatically identifies and labels the data to realize a semi-supervised data labeling mode.

Referring to fig. 1, fig. 1 is a flowchart of a data labeling method according to an embodiment of the present application, where the method includes the following steps:

101. The method comprises the steps of selecting first to-be-marked data and second to-be-marked data from historical call data of a customer service agent and a customer, sending the first to-be-marked data to a manual terminal for manual marking to obtain first marking data, wherein the first marking data comprises first marking audio with emotion labels and first marking audio texts, the emotion labels comprise emotion marks and position information of audio or audio texts marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio texts are identical.

When a customer service agent and a customer call, the system usually records and stores corresponding data, the data can be used as historical call data of the customer service agent and the customer, the historical call data are usually stored in a database, and the historical call data can be called out from the database when required.

When the first data to be marked and the second data to be marked are selected, the first data to be marked can be selected from the historical call data, the first data to be marked is sent to the artificial end, and then the second data to be marked is selected from the remaining historical call data. The first marked data and the second marked data can also be directly selected from the historical call data.

After the first data to be marked and the second data to be marked are selected, preprocessing is performed on the first data to be marked and the second data to be marked, wherein the preprocessing can comprise noise reduction processing and also can comprise mute filtering processing, and the preprocessing is not limited.

It can be understood that each first data to be marked is data of a bank customer service agent and a customer in a call, and each second data to be marked is data of a bank customer service agent and a customer in a call.

After the first data to be marked is sent to the artificial end, the first data to be marked is marked manually by a staff of the artificial end, and when the first data to be marked is marked manually, the staff can mark all sentence segments of the first data to be marked and can also select part of sentence segments to be marked.

The first data to be marked after manual marking is first marked data, the first marked data comprises first marked audio and first marked audio text, the first marked audio and the first marked audio text are provided with marked emotion labels, and the emotion labels are used for representing emotion of a user corresponding to the first data to be marked.

Optionally, the first data to be marked includes a first audio to be marked, the first audio to be marked is sent to a staff at the artificial end for artificial marking, the first audio to be marked is obtained, and text translation is performed on the first audio to be marked, so that a first audio text to be marked is obtained.

Optionally, the first to-be-annotated data includes a first to-be-annotated audio and a first to-be-annotated audio text corresponding to the first to-be-annotated audio, and the first to-be-annotated audio text are both generated for a worker at the artificial end to carry out artificial annotation, so that the first to-be-annotated audio and the first to-be-annotated audio text are obtained.

In some embodiments, first to-be-marked data are selected from historical call data of a customer service agent and a customer, the first to-be-marked data comprise first to-be-marked audio, the first to-be-marked audio is input into a voice separation model, the voice separation model performs separation and marking processing on the first to-be-marked audio according to voiceprint characteristics of different speakers, the processed first to-be-marked audio is input into a text recognition model to obtain corresponding first to-be-marked audio text, second to-be-marked data are selected from the remaining historical call data, and the processed first to-be-marked audio and the first to-be-marked audio text are sent to an artificial end for manual marking.

In the process of communication, the data stored in the system at least comprises the audio data of the customer service agent and the audio data of the customer, so that in order to improve the accuracy of labeling, the data are preferably subjected to audio separation. When the audio separation is carried out, a sound spectrogram can be generated through the model, the segmentation of the sound spectrogram is carried out, the segmentation of the segmented sound spectrogram is input into the recognition model to obtain voiceprint characteristics of different speakers, the speakers in the audio are recognized according to the voiceprint characteristics, the audio is separated according to the recognition result to obtain the audio corresponding to the different speakers, and the separated audio is marked so as to be convenient for distinguishing the audio.

Further, inputting the first audio to be marked into a text recognition model, recognizing the semantics of the voice frequency band in the first audio to be marked by the text recognition model, determining the blank position of the blank frequency band in the first audio to be marked, which is the silent frequency band in the first audio to be marked, inputting the initial text into a deep neural network model based on the semantic recognition result and the blank position, determining the punctuation mark position in the blank position, automatically marking the punctuation mark, and connecting the sentences adjacent to the rest blank position.

In the setting of the emotion tag, the emotion tag may include position information in the first labeling audio and the first labeling audio text, and an emotion mark may be set as an excited emotion, a neutral emotion, a sinking emotion, and the like. Since the first labeling audio corresponds to the first labeling audio text, the emotion identifications of the sentence segments of the parts corresponding to the first labeling audio text and the first labeling audio text in the semantic correspondence and the position are the same.

One emotion tag may be marked in one sentence segment in the emotion tags of the first marked audio and the first marked audio text, and a plurality of emotion tags may also be marked in one sentence segment, which is not limited herein.

102. And acquiring an emotion marking audio segment in the first marking audio based on the position information in the emotion label of the first marking audio, and acquiring a first sound characteristic of the emotion marking audio segment.

In the embodiment of the application, the position information of the emotion label in the audio and the emotion mark are recorded in one emotion label, wherein the position information can be represented by time, namely the emotion mark is a corresponding audio segment of a certain time range in the audio, the emotion mark is used for representing the emotion of a person in the time range, and the emotion mark can be set as an excited emotion, a neutral emotion, a sunken emotion and the like.

And extracting an audio frequency segment of the first marked audio frequency through the position information of the emotion tags, wherein the extracted audio frequency segment is an audio frequency segment corresponding to each emotion tag, and the audio frequency segments are emotion marked audio frequency segments.

In addition to the extraction mode, the audio segment without the emotion label in the first marked audio can be selectively filtered, and then the rest audio segment is the emotion marked audio segment.

In some embodiments, the emotion marking audio segments are classified based on emotion identification, sound spectrograms of the emotion marking audio segments of the same category are obtained by using a preset audio algorithm, and first sound features for representing emotion are obtained based on the sound spectrograms.

For example, if the emotion mark is provided with categories such as excited emotion, neutral emotion and sinking emotion, the emotion marking audio segments are classified according to the mark categories to obtain excited emotion mark categories, neutral emotion mark categories and sinking emotion mark categories, sound spectrograms of the emotion marking audio segments in the three categories are respectively obtained, and the first sound features corresponding to each category are obtained through extracting the features of the sound spectrograms in each category so as to be beneficial to the accuracy of model training.

It can be understood that a plurality of emotion tags may be marked in a section of audio, when a section of audio is marked with a plurality of emotion tags, the emotion identification corresponding to each emotion tag is determined, and the final emotion identification of the section of audio is determined by setting the priority of the emotion identifications.

For example, a first emotion label, a second emotion label and a third emotion label are marked in a section of speech, the first emotion corresponding to the first emotion label is marked as neutral emotion, the second emotion corresponding to the second emotion label is marked as excited emotion, the third emotion corresponding to the third emotion label is marked as neutral emotion, and the priority of the emotion marks set by the system is from high to low, namely excited emotion, low and neutral emotion, so that the final emotion mark in the section of speech is marked as excited emotion.

103. And acquiring emotion marking sentence segments of the first marking data based on the position information in the emotion labels of the first marking audio texts, and acquiring first text characteristics of the emotion marking sentence segments.

In the embodiment of the application, the position information of the emotion label in the audio text and the emotion mark are recorded in one emotion label, the position information can be represented by time, namely the emotion mark is a section of the audio text corresponding to a certain time range in the audio text, the emotion mark is used for representing the emotion of a person in the time range, and the emotion mark can be set as an excited emotion, a neutral emotion, a sunken emotion and the like.

And extracting audio text segments of the first marked audio text through the position information of the emotion tags, wherein the extracted audio text segments are audio text segments corresponding to each emotion tag, and the audio text segments are emotion marked sentence segments.

In addition to the extraction mode, the audio text segment without the emotion label in the first marked audio text can be selectively filtered, and then the rest audio segment is the emotion marked sentence segment.

In some embodiments, the emotion marking sentence segments are classified based on emotion identification, emotion feature words of the emotion marking sentence segments of the same category are obtained by using a preset text algorithm, and first text features for representing emotion are obtained based on the emotion feature words.

For example, if the emotion mark is provided with categories such as excited emotion, neutral emotion and sinking emotion, the emotion marking sentence segments are classified according to the mark categories to obtain excited emotion mark categories, neutral emotion mark categories and sinking emotion mark categories, emotion feature words of the emotion marking sentence segments of the three categories are respectively obtained, and the first text feature corresponding to each category is obtained through feature extraction of the emotion feature words of each category, so that accuracy of model training is facilitated.

It can be understood that a plurality of emotion labels may be marked in a text segment, when a plurality of emotion labels are marked in the text segment, the emotion label corresponding to each emotion label is determined, and the final emotion label of the text segment is determined through setting the priority of the emotion labels.

For example, a text is marked with a first emotion label, a second emotion label and a third emotion label, the first emotion corresponding to the first emotion label is marked as neutral emotion, the second emotion corresponding to the second emotion label is marked as excited emotion, the third emotion corresponding to the third emotion label is marked as neutral emotion, and the priority of the emotion marks set by the system is from high to low, namely excited emotion, low and neutral emotion, so that the final emotion mark of the text is marked as excited emotion.

104. And carrying out emotion recognition based on the first sound feature and the first text feature through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion label.

Actually, based on the preset through the emotion analysis model, the first sound characteristic and the first text characteristic are used as training data and input into the emotion analysis model for training, so that the emotion analysis model after training has corresponding functions.

In some embodiments, the audio weight and the text weight are set in the emotion analysis model, and when the emotion identifications of the second sound feature and the corresponding second text feature are recognized to be different, the emotion identifications of the second sound feature and the corresponding second text feature are determined based on the weight ratio of the audio weight and the text weight.

For example, the audio weight is set to be greater than the text weight in the emotion analysis model, and the emotion identified by the second text feature is identified as an excited emotion, and the emotion identified by the second text feature corresponding to the excited emotion is identified as a neutral emotion, and then the emotion identification corresponding to the excited emotion is used as the emotion identification of the second text feature and the second text feature corresponding to the excited emotion.

The weight is set to select emotion marks in the case of inconsistency in recognition results when recognizing corresponding sound features and text features, because sound features such as mood and tone may more represent emotion of a person relative to word features when a person speaks.

105. And performing audio segment segmentation on the second data to be marked to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data.

Since the person usually expresses a sentence when speaking, especially when communicating, when audio segmentation is performed on the second data to be marked, the audio can be segmented by taking the continuous audio segments in the preset time as segmentation basis.

After obtaining the multi-segment audio data, the corresponding audio text can be obtained through voice recognition, voice translation and other technologies.

106. And acquiring second sound characteristics of the audio data and second text characteristics of the text data, carrying out emotion recognition on the basis of the second sound characteristics and the second text characteristics through the emotion analysis model, and carrying out automatic labeling of emotion labels on the audio data and the text data on the basis of recognition results.

The second sound feature may be obtained by referring to the first sound feature, and correspondingly, the second text feature may be obtained by referring to the first text feature.

After the second sound feature and the second text feature are obtained, the second sound feature and the second text feature are input into an emotion analysis model, and the features are identified and automatically marked through the trained emotion analysis model.

Alternatively, if unrecognized and annotated second sound features and second text features appear in the emotion analysis model, these features may be processed by the human end.

By setting a timer task in the emotion analysis model, the unidentified second text features and the unidentified second text features of the stored certificate are detected regularly, and the features are sent to the manual end for processing.

The data labeling method comprises the steps of selecting first data to be labeled and second data to be labeled from historical call data of a customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeling data, wherein the first labeling data comprises first labeling audio and first labeling audio text with emotion labels, the emotion labels comprise emotion marks and position information of audio or audio text labeled by the emotion marks, and the emotion marks recorded in the emotion labels of the first labeling audio text are identical; acquiring an emotion marking audio segment in the first marked audio based on the position information in the emotion tag of the first marked audio, and acquiring a first sound feature of the emotion marking audio segment; acquiring emotion marking sentence segments of the first marking data based on position information in emotion labels of the first marking audio texts, and acquiring first text features of the emotion marking sentence segments; carrying out emotion recognition based on the first sound features and the first text features through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion labels; performing audio segment segmentation on the second data to be marked to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data; and acquiring second sound characteristics of the audio data and second text characteristics of the text data, carrying out emotion recognition based on the second sound characteristics and the second text characteristics through the emotion analysis model, and carrying out automatic labeling of emotion labels on the audio data and the text data based on recognition results. Through the mode, semi-supervised data marking is realized, so that labor cost is saved.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data labeling device according to an embodiment of the application, and the data labeling device 200 includes the following units:

201. The system comprises a manual marking unit, a first marking unit and a second marking unit, wherein the manual marking unit is used for selecting first data to be marked and second data to be marked from historical call data of a customer service agent and a customer, the first data to be marked is sent to a manual terminal to be marked manually to obtain first marking data, the first marking data comprises first marking audio with emotion labels and first marking audio texts, the emotion labels comprise emotion marks and position information of the audio or audio texts marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio texts are identical.

202. The sound feature acquisition unit is used for acquiring the emotion marking audio frequency band in the first marking audio frequency based on the position information in the emotion label of the first marking audio frequency, and acquiring the first sound feature of the emotion marking audio frequency band.

203. The text feature acquisition unit is used for acquiring emotion marking sentence segments of the first marking data based on the position information in the emotion labels of the first marking audio texts and acquiring first text features of the emotion marking sentence segments.

204. The model training unit is used for carrying out emotion recognition based on the first sound features and the first text features through the emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion label.

205. The segmentation processing unit is used for segmenting the audio segments of the second data to be marked to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data.

206. And the automatic labeling unit is used for acquiring second sound characteristics of the audio data and second text characteristics of the text data, carrying out emotion recognition on the basis of the second sound characteristics and the second text characteristics through the emotion analysis model, and carrying out automatic labeling on emotion labels on the audio data and the text data on the basis of recognition results.

The data labeling device 200 of the embodiment of the application comprises a manual labeling unit 201, a first labeling unit and a second labeling unit, wherein the manual labeling unit 201 is used for selecting first data to be labeled and second data to be labeled from historical call data of a customer service agent and a customer, the first data to be labeled is sent to a manual terminal for manual labeling to obtain first labeling data, the first labeling data comprises first labeling audio and first labeling audio text with emotion labels, the emotion labels comprise emotion identifications and position information of audio or audio text labeled by the emotion identifications, and the emotion identifications recorded in the emotion labels of the first labeling audio text are identical; a sound feature obtaining unit 202, configured to obtain an emotion marking audio segment in the first marking audio based on the position information in the emotion tag of the first marking audio, and obtain a first sound feature of the emotion marking audio segment; a text feature obtaining unit 203, configured to obtain an emotion markup sentence segment of the first markup data based on the position information in the emotion tag of the first markup audio text, and obtain a first text feature of the emotion markup sentence segment; a model training unit 204, configured to perform emotion recognition based on the first sound feature and the first text feature through an emotion analysis model, obtain an emotion recognition result, and train the emotion analysis model based on the emotion recognition result and the emotion label; the segmentation processing unit 205 is configured to segment the audio segment of the second data to be annotated to obtain multiple segments of audio data, and obtain a corresponding audio text based on the multiple segments of audio data; the automatic labeling unit 206 is configured to obtain a second sound feature of the audio data and a second text feature of the text data, perform emotion recognition based on the second sound feature and the second text feature through the emotion analysis model, and perform automatic labeling of emotion labels on the audio data and the text data based on the recognition result. By the aid of the device, semi-supervised data marking is achieved, and labor cost is saved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 300 includes a processor 301 with one or more processing cores, a memory 302 with one or more computer readable storage media, and a computer program stored in the memory 302 and capable of running on the processor 301. The processor 301 is electrically connected to the memory 302.

It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

Processor 301 is a control center of computer device 300 and utilizes various interfaces and lines to connect various portions of the overall computer device 300, and to perform various functions of computer device 300 and process data by running or loading software programs and/or modules stored in memory 302 and invoking data stored in memory 302, thereby performing overall monitoring of computer device 300.

In the embodiment of the present application, the processor 301 in the computer device 300 loads the instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 executes the application programs stored in the memory 302, so as to implement various functions:

Selecting first data to be marked and second data to be marked from historical call data of a customer service agent and a customer, sending the first data to be marked to a manual terminal for manual marking to obtain first marking data, wherein the first marking data comprises first marking audio with an emotion label and first marking audio text, the emotion label comprises emotion marks and position information of audio or audio text marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio and the first marking audio text are the same;

Acquiring an emotion marking audio segment in the first marked audio based on the position information in the emotion tag of the first marked audio, and acquiring a first sound feature of the emotion marking audio segment;

Acquiring emotion marking sentence segments of the first marking data based on position information in emotion labels of the first marking audio texts, and acquiring first text features of the emotion marking sentence segments;

carrying out emotion recognition based on the first sound features and the first text features through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model based on the emotion recognition result and the emotion labels;

and acquiring second voice characteristics of the audio data and second text characteristics of the text data, carrying out emotion recognition on the basis of the second voice characteristics and the second text characteristics through an emotion analysis model, and carrying out automatic labeling on emotion labels on the audio data and the text data on the basis of recognition results.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, the computer device 300 further includes an audio module 303 and a text module 304, where the audio module 303 and the text module 304 are electrically connected to the processor 301, the audio module 303 is configured to receive input audio data, and the text module 304 is configured to display text data corresponding to the input audio data. Those skilled in the art will appreciate that the computer device structure shown in FIG. 3 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

Although not shown in fig. 3, the computer device 300 may also include a display module and other electronic structures, which are not described in detail herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the foregoing, in the computer device provided in this embodiment, first to-be-annotated data and second to-be-annotated data are selected from historical call data of a customer and a customer, the first to-be-annotated data are sent to a manual terminal to perform manual annotation, first annotation data are obtained, the first annotation data include first annotation audio and first annotation audio text with emotion labels, the emotion labels include emotion identifications and position information of the audio or audio text marked by the emotion identifications, the emotion identifications recorded in the emotion labels of the first annotation audio text are the same, the emotion identifications in the emotion labels of the first annotation audio are based on the position information in the emotion labels of the first annotation audio, emotion marking audio segments in the first annotation audio are obtained, first sound features of the emotion marking audio segments are obtained, emotion marking segments of the first annotation data are obtained based on the position information in the emotion labels of the first annotation audio text, first text features of the emotion marking segments are obtained, emotion analysis is performed based on the first text features of the emotion marking segments, emotion recognition results are obtained based on the first sound features of the emotion marking segments, emotion analysis is performed on the first text features of the first text features, emotion recognition results are obtained based on the emotion recognition results and the emotion label models of the first text features, the emotion label data are obtained based on the emotion analysis model data, the second text feature analysis is performed on the emotion analysis model is based on the second text feature data, and the emotion analysis model is obtained, and the second text feature is obtained, and the emotion analysis is based on the emotion label.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform the steps of any of the data labeling methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM, re client account d Only Memory), random access Memory (R client account M, R client account ndom client account ccess Memory), magnetic or optical disk, and the like.

The steps of any data labeling method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects of any data labeling method provided by the embodiment of the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted.

The foregoing has described in detail a data labeling method, apparatus, computer device and storage medium provided by the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method of labeling data, the method comprising:

Inputting the initial text into a deep neural network model, determining punctuation mark positions in the blank positions, automatically marking the punctuation marks, and connecting the rest sentences adjacent to the blank positions in front and behind to obtain a first audio text to be marked;

Selecting second data to be marked from the rest historical call data, and sending the processed first audio to be marked and the first audio text to be marked to a manual end for manual marking to obtain first marking data, wherein the first marking data comprises first marking audio with emotion labels and first marking audio text, the emotion labels comprise emotion marks and position information of the audio or audio text marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio and the first marking audio text are the same;

Setting an audio weight and a text weight in the emotion analysis model;

And when the emotion identifications of the second sound characteristic and the second text characteristic of the corresponding audio text of the audio data are recognized to be different, determining the emotion identifications of the second sound characteristic and the second text characteristic based on the weight ratio of the audio weight to the text weight.

2. The method for labeling data according to claim 1, wherein the obtaining the emotion marking audio segment in the first labeling audio based on the position information in the emotion tag of the first labeling audio, and obtaining the first sound feature of the emotion marking audio segment, comprises:

3. The method for labeling data according to claim 1, wherein the obtaining the emotion markup sentence segment of the first markup data based on the position information in the emotion tag of the first markup audio text, and obtaining the first text feature of the emotion markup sentence segment, comprises:

4. The method of claim 1, further comprising:

5. A data tagging device, the device comprising:

the manual labeling unit is used for selecting first data to be labeled from historical call data of a customer service agent and a customer of a bank, wherein the first data to be labeled comprises first audio to be labeled; inputting the first audio to be marked into a voice separation model, and separating and marking the first audio to be marked according to voiceprint characteristics of different speakers by the voice separation model; inputting the first audio to be marked into a text recognition model, recognizing the semantics of the voice frequency band in the first audio to be marked by the text recognition model, and determining the blank position of the blank frequency band in the first audio to be marked, wherein the blank frequency band is a silent frequency band in the first audio to be marked; based on the semantic recognition result and the blank position, obtaining an initial text; inputting the initial text into a deep neural network model, determining punctuation mark positions in the blank positions, automatically marking the punctuation marks, and connecting the rest sentences adjacent to the blank positions in front and behind to obtain a first audio text to be marked; selecting second data to be marked from the rest historical call data, and sending the processed first audio to be marked and the first audio text to be marked to a manual end for manual marking to obtain first marking data, wherein the first marking data comprises first marking audio with emotion labels and first marking audio text, the emotion labels comprise emotion marks and position information of the audio or audio text marked by the emotion marks, and the emotion marks recorded in the emotion labels of the first marking audio and the first marking audio text are the same;

The automatic labeling unit is used for setting audio weights and text weights in the emotion analysis model; and when the emotion identifications of the second sound characteristic and the second text characteristic of the corresponding audio text of the audio data are recognized to be different, determining the emotion identifications of the second sound characteristic and the second text characteristic based on the weight ratio of the audio weight to the text weight.

6. A computer device comprising a memory for storing instructions and data and a processor for performing the data annotation method of any of claims 1-4.

7. A storage medium having stored therein a plurality of instructions adapted to be loaded by a processor to perform the data tagging method of any of claims 1-4.