CN115063155A

CN115063155A - Data labeling method and device, computer equipment and storage medium

Info

Publication number: CN115063155A
Application number: CN202210731923.9A
Authority: CN
Inventors: 陈杭; 陈子意; 朱益兴; 于欣璐; 李骁
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-06-25
Filing date: 2022-06-25
Publication date: 2022-09-16

Abstract

The embodiment of the application discloses a data labeling method, a data labeling device, computer equipment and a storage medium, wherein, the method comprises selecting a first data to be labeled and a second data to be labeled from historical conversation data of a bank customer service seat and a customer, manually labeling the first data to be labeled to obtain a first labeled audio with an emotion label and a first labeled audio text, acquiring a first sound characteristic and a first text characteristic based on the position of the emotion label in the audio and the text, performing emotion recognition on the first sound characteristic and the first text characteristic through an emotion analysis model, training the model based on a recognition result and an emotion mark, segmenting second data to be labeled, acquiring a second sound characteristic and a second text characteristic based on a segmentation result, and performing emotion recognition and automatic annotation of the emotion label on the basis of the second sound characteristic and the second text characteristic through an emotion analysis model. Through the mode, semi-supervised data labeling is realized, and labor cost is saved.

Description

Data labeling method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data annotation method and apparatus, a computer device, and a storage medium.

Background

Along with the development of social economy and financial science and technology, the requirement of people on the service quality of a bank customer service seat is higher and higher, therefore, a bank can set a corresponding monitoring model, in the conversation process, except the conversation between the customer service seat and a client, the monitoring model can collect information input by the client side so as to analyze and judge the emotion of the client, meanwhile, corresponding prompt is carried out on the customer service seat according to the analysis and judgment result, and the influence on the emotion of the client caused by insufficient experience of the customer service seat or personal emotion and other problems is avoided, so that the conversation quality is influenced.

When the monitoring model is used, the model training is required to be carried out to improve the analysis and judgment capability of the model, and the model training needs to acquire a training set which is usually obtained from historical call data.

At present, part of historical call data is selected from a database, and the selected historical call data is manually marked to obtain a training set, so that the mode needs to consume more labor cost.

Disclosure of Invention

Embodiments of the present application provide a data annotation method, apparatus, computer device, and storage medium, so as to solve the problems in the background art.

In a first aspect, an embodiment of the present application provides a data annotation method, where the method includes:

selecting first data to be labeled and second data to be labeled from historical call data of a bank customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeled data, wherein the first labeled data comprises first labeled audio with an emotion label and first labeled audio text, the emotion label comprises an emotion identifier and position information of the audio or audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio text and the first labeled audio text are the same;

acquiring an emotion label audio segment in the first label audio based on the position information in the emotion label of the first label audio, and acquiring a first sound characteristic of the emotion label audio segment;

acquiring an emotion labeling sentence segment of the first labeling data based on position information in an emotion label of the first labeling audio text, and acquiring a first text characteristic of the emotion labeling sentence segment;

performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and the emotion label;

carrying out audio segment segmentation on the second data to be labeled to obtain a plurality of segments of audio data, and obtaining a corresponding audio text based on the plurality of segments of audio data;

and acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the basis of the second sound characteristic and the second text characteristic through the emotion analysis model, and performing automatic annotation of emotion labels on the audio data and the text data on the basis of recognition results.

In some embodiments, the obtaining an emotional tag audio segment in the first tagged audio based on the location information in the emotional tag of the first tagged audio, obtaining a first sound feature of the emotional tag audio segment, comprises:

classifying the emotion labeled audio segment based on the emotion identification;

and acquiring sound frequency spectrograms of the emotion marking audio bands in the same category by using a preset audio algorithm, and acquiring a first sound characteristic for representing emotion based on the sound frequency spectrograms.

In some embodiments, the obtaining an emotion markup period of the first markup data based on position information in an emotion tag of the first markup audio text, and obtaining a first text feature of the emotion markup period, includes:

classifying the emotion labeling sentence segments based on the emotion identifications;

and acquiring emotion characteristic words of the emotion labeling sentence segments in the same category by using a preset text algorithm, and acquiring first text characteristics for representing emotion based on the emotion characteristic words.

In some embodiments, the obtaining a second sound feature of the audio data and a second text feature of the text data, performing emotion recognition based on the second sound feature and the second text feature through the emotion analysis model, and labeling emotion labels for the audio data and the text data based on recognition results includes:

setting audio weight and text weight in the emotion analysis model;

and when the second sound characteristic and the emotion identification of the second text characteristic corresponding to the second sound characteristic are different, determining the emotion identification of the second sound characteristic and the second text characteristic corresponding to the second sound characteristic based on the weight ratio of the audio weight and the text weight.

In some embodiments, the selecting first to-be-labeled data and second to-be-labeled data from historical call data of a bank customer service agent and a customer, sending the first to-be-labeled data to a manual end for manual labeling to obtain first labeled data, where the first labeled data includes a first labeled audio with an emotion label and a first labeled audio text, where the emotion label includes an emotion identifier and position information of an audio or audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio text and the first labeled audio text are the same includes:

selecting first data to be labeled from historical call data of a bank customer service seat and a customer, wherein the first data to be labeled comprises first audio to be labeled;

inputting the first audio to be labeled into a voice separation model, wherein the voice separation model performs separation and labeling processing on the first audio to be labeled according to the voiceprint characteristics of different speakers;

inputting the processed first audio to be labeled into a text recognition model to obtain a corresponding first audio text to be labeled;

and selecting second data to be marked from the rest historical call data, and sending the processed first audio to be marked and the first audio text to be marked to a manual end for manual marking.

In some embodiments, the inputting the processed first audio to be annotated into a text recognition model to obtain a first text to be annotated corresponding to the first audio to be annotated includes:

inputting the first audio to be labeled into a text recognition model, wherein the text recognition model recognizes the semantics of the voiced audio segment in the first audio to be labeled and determines the blank position of the blank frequency band in the first audio to be labeled, and the blank frequency band is a silent frequency band in the first audio to be labeled;

obtaining an initial text based on the semantic recognition result and the blank position;

inputting the initial text into a deep neural network model, determining the position of a punctuation mark in the blank position, automatically marking the punctuation mark, and connecting the adjacent sentences in front of and behind the remaining blank position to obtain a first text to be marked.

In some embodiments, the selecting a first to-be-labeled data and a second to-be-labeled data from historical conversation data of a bank customer service agent and a customer, sending the first to-be-labeled data to a manual end for manual labeling, and obtaining a first labeled data, where the first labeled data includes a first labeled audio with an emotion label and a first labeled audio text, includes:

and preprocessing the first data to be marked and the second data to be marked, wherein the preprocessing comprises noise reduction processing.

In a second aspect, an embodiment of the present application provides a data annotation device, where the device includes:

the system comprises an artificial labeling unit, a database and a database, wherein the artificial labeling unit is used for selecting first data to be labeled and second data to be labeled from historical conversation data of a bank customer service seat and a customer, sending the first data to be labeled to an artificial end for artificial labeling to obtain first labeled data, and the first labeled data comprise first labeled audio with emotion labels and first labeled audio texts, wherein the emotion labels comprise emotion identifications and audio labeled by the emotion identifications or position information of the audio texts, and the emotion identifications recorded in the emotion labels of the first labeled audio texts are the same;

a sound characteristic obtaining unit, configured to obtain an emotion tagged audio segment in the first tagged audio based on position information in an emotion tag of the first tagged audio, and obtain a first sound characteristic of the emotion tagged audio segment;

a text feature obtaining unit, configured to obtain an emotion markup sentence segment of the first markup data based on position information in an emotion tag of the first markup audio text, and obtain a first text feature of the emotion markup sentence segment;

the model training unit is used for carrying out emotion recognition on the basis of the first sound characteristic and the first text characteristic through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and the emotion label;

the segmentation processing unit is used for segmenting audio segments of the second data to be labeled to obtain multiple segments of audio data and obtaining corresponding audio texts based on the multiple segments of audio data;

and the automatic labeling unit is used for acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the basis of the second sound characteristic and the second text characteristic through the emotion analysis model, and performing automatic labeling of emotion labels on the audio data and the text data on the basis of a recognition result.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory is used to store instructions and data, and the processor is used to execute the data annotation method described above.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where a plurality of instructions are stored in the storage medium, and the instructions are adapted to be loaded by a processor to perform the data annotation method described above.

According to the data labeling method in the embodiment of the application, first data to be labeled and second data to be labeled are selected from historical call data of a bank customer service seat and a customer, the first data to be labeled are sent to a manual end to be labeled manually, characteristics are obtained through results of manual labeling, the model is trained on the basis of the characteristics to obtain a trained model, the second data to be labeled are automatically identified and labeled through the trained model, a semi-supervised data labeling mode is achieved, and labor cost is saved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a data annotation method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a data annotation device according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a data labeling method, a data labeling device, computer equipment and a storage medium, wherein a model is trained through manually labeled data, and the trained model automatically identifies and labels the data to realize a semi-supervised data labeling mode.

Referring to fig. 1, fig. 1 is a flowchart of a data annotation method according to an embodiment of the present application, where the method includes the following steps:

101. the method comprises the steps of selecting first data to be labeled and second data to be labeled from historical call data of a bank customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeled data, wherein the first labeled data comprises first labeled audio with an emotion label and first labeled audio text, the emotion label comprises an emotion identifier and position information of the audio or the audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio text and the first labeled audio text are the same.

When the bank service seat and the customer communicate with each other, the system usually records and stores corresponding data, the data can be used as historical communication data of the bank service seat and the customer, the historical communication data is usually stored in a database and can be called out from the database when needed.

When the first data to be marked and the second data to be marked are selected, the first data to be marked can be selected from historical call data, the first data to be marked can be sent to the manual end, and then the second data to be marked can be selected from the rest historical call data. Or directly selecting the first data with labels and the second data to be labeled from the historical call data.

After the first data to be labeled and the second data to be labeled are selected, preprocessing is performed on the first data to be labeled and the second data to be labeled, where the preprocessing may include noise reduction processing, and may also include silence filtering processing, which is not limited herein.

It is understood that each of the first data to be labeled is data of a bank service agent and a client in a telephone, and each of the second data to be labeled is data of a bank service agent and a client in a telephone.

After the first data to be labeled is sent to the manual end, a worker at the manual end manually labels the first data to be labeled, and during manual labeling, the worker can label all periods of the first data to be labeled or select part of the periods to label.

The first to-be-labeled data after the artificial labeling is first labeled data, the first labeled data comprises first labeled audio and first labeled audio texts, the first labeled audio and the first labeled audio texts both have labeled emotion labels, and the emotion labels are used for representing the emotion of the user corresponding to the first to-be-labeled data.

Optionally, the first to-be-labeled data includes a first to-be-labeled audio, the first to-be-labeled audio is sent to a worker at the manual end for manual labeling, a first labeled audio is obtained, and then text translation is performed on the first labeled audio, so that a first labeled audio text is obtained.

Optionally, the first to-be-labeled data includes a first to-be-labeled audio and a corresponding first to-be-labeled audio text, and the first to-be-labeled audio text are both subjected to manual labeling by a worker at the manual end, so as to obtain a first labeled audio and a first labeled audio text.

In some embodiments, first data to be labeled is selected from historical call data of a bank customer service seat and a customer, the first data to be labeled comprises first audio to be labeled, the first audio to be labeled is input into a voice separation model, the voice separation model performs separation and marking on the first audio to be labeled according to voiceprint characteristics of different speakers, the processed first audio to be labeled is input into a text recognition model to obtain a corresponding first audio text to be labeled, second data to be labeled is selected from the remaining historical call data, and the processed first audio to be labeled and the processed first audio text to be labeled are sent to a manual end for manual labeling.

In the process of communication, the data stored in the system at least comprises the audio data of the bank customer service seat and the audio data of the customer, so in order to improve the correctness of the labeling, the audio separation is preferably performed on the data. When audio separation is carried out, a sound frequency spectrogram can be generated through a model, the segmented sound frequency spectrogram is segmented and input into an identification model to obtain voiceprint characteristics of different speakers through segmentation of the sound frequency spectrogram, the speakers in the audio are identified according to the voiceprint characteristics, the audio is separated according to an identification result to obtain the audio corresponding to the different speakers, and the separated audio is marked so as to distinguish the audio.

Further, the first audio to be annotated is input into a text recognition model, the text recognition model recognizes the semantics of the voiced audio segment in the first audio to be annotated, the blank position of the blank frequency segment in the first audio to be annotated is determined, the blank frequency segment is the unvoiced frequency segment in the first audio to be annotated, an initial text is obtained based on the semantic recognition result and the blank position, the position of a punctuation mark in the blank position is determined, the punctuation mark is automatically marked, and the adjacent sentences in front of and behind the remaining blank position are connected to obtain the first text to be annotated.

In the setting of the emotion tag, the emotion tag may include position information in the first tagged audio and the first tagged audio text, and an emotion flag, which may be set to excited emotion, neutral emotion, sunken emotion, and the like. Since the first labeled audio corresponds to the first labeled audio text, the emotion identifications of the sentence segments of the semantically corresponding and positionally corresponding portions of the first labeled audio and the first labeled audio text are the same.

In the emotion labels of the first tagged audio and the first tagged audio text, one sentence segment may be labeled with one emotion label, and one sentence segment may also be labeled with a plurality of emotion labels, which is not limited herein.

102. And acquiring an emotion label audio segment in the first label audio based on the position information in the emotion label of the first label audio, and acquiring a first sound characteristic of the emotion label audio segment.

In the embodiment of the application, position information of the emotion tag in audio and an emotion identifier are recorded in the emotion tag, wherein the position information can be represented by time, namely the emotion identifier is an audio segment corresponding to a certain time range in the audio, the emotion identifier is used for representing the emotion of a person in the time range, and the emotion identifier can be set to be excited emotion, neutral emotion, low emotion and the like.

And extracting audio sections of the first labeled audio through the position information of the emotion labels, wherein the extracted audio sections are the audio sections corresponding to each emotion label, and the audio sections are emotion labeled audio sections.

Besides the extraction mode, the audio segments which are not labeled with the emotion labels in the first labeled audio can be selected and filtered, and then the rest audio segments are the emotion labeled audio segments.

In some embodiments, the emotion markup audio segments are classified based on emotion identifications, sound frequency spectrograms of the emotion markup audio segments of the same category are obtained by using a preset audio algorithm, and first sound features used for representing emotions are obtained based on the sound frequency spectrograms.

For example, the emotion identification is provided with categories such as excited emotion, neutral emotion and low emotion, the emotion labeled audio frequency segment is classified according to the identification categories to obtain excited emotion identification categories, neutral emotion identification categories and low emotion identification categories, the sound frequency spectrograms of the emotion labeled audio frequency segments of the three categories are respectively obtained, and the first sound feature corresponding to each category is obtained by extracting the feature of the sound frequency spectrogram of each category, so that the accuracy of model training is facilitated.

It can be understood that a plurality of emotion labels may be marked in a segment of audio segment, when a plurality of emotion labels are marked in a segment of audio segment, an emotion identifier corresponding to each emotion label is determined, and a final emotion identifier of the segment of audio segment is determined through setting of priority of the emotion identifiers.

For example, a first emotion label, a second emotion label and a third emotion label are marked in a section, the first emotion label corresponding to the first emotion label is a neutral emotion, the second emotion label corresponding to the second emotion label is an excited emotion, the third emotion label corresponding to the third emotion label is a neutral emotion, and the priorities of the emotion labels set by the system from high to low are excited emotion, low emotion and neutral emotion, so that the final emotion label in the section is excited emotion.

103. And acquiring the emotion marking sentence segment of the first marking data and acquiring a first text characteristic of the emotion marking sentence segment based on the position information in the emotion label of the first marking audio text.

In the embodiment of the application, position information of an emotion tag in an audio text and an emotion identifier are recorded in the emotion tag, the position information can be represented by time, namely the emotion identifier is in an audio text section corresponding to a certain time range in the audio text, the emotion identifier is used for representing the emotion of a person in the time range, and the emotion identifier can be set to be excited emotion, neutral emotion, low emotion and the like.

And extracting audio text segments of the first labeled audio text according to the position information of the emotion labels, wherein the extracted audio text segments are audio text segments corresponding to each emotion label, and the audio text segments are emotion labeled sentence segments.

In addition to the extraction method, the audio text segment which is not labeled with the emotion label in the first labeled audio text may also be selectively filtered, and then the remaining audio segment is the emotion labeling period segment.

In some embodiments, the emotion labeling sentence segments are classified based on emotion identifications, emotion feature words of the emotion labeling sentence segments in the same category are obtained by using a preset text algorithm, and first text features used for representing emotions are obtained based on the emotion feature words.

For example, the emotion marks are provided with categories such as excited emotion, neutral emotion and sunken emotion, the emotion labeling sentence segments are classified according to the mark categories to obtain excited emotion mark categories, neutral emotion mark categories and sunken emotion mark categories, emotion feature words of the emotion labeling sentence segments of the three categories are respectively obtained, and a first text feature corresponding to each category is obtained by extracting features of the emotion feature words of each category, so that accuracy of model training is facilitated.

It can be understood that a plurality of emotion labels may be marked in a section of text, when a plurality of emotion labels are marked in a section of text, an emotion identifier corresponding to each emotion label is determined, and a final emotion identifier of the section of text is determined by setting a priority of the emotion identifiers.

For example, a first emotion label, a second emotion label and a third emotion label are marked in a piece of text, the first emotion label corresponding to the first emotion label is a neutral emotion, the second emotion label corresponding to the second emotion label is an excited emotion, the third emotion label corresponding to the third emotion label is a neutral emotion, and the priorities of the emotion labels set by the system from high to low are excited emotion, low emotion and neutral emotion, so that the final emotion label of the piece of text is excited emotion.

104. And performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and the emotion label.

Actually, the first voice feature and the first text feature are used as training data and input into the emotion analysis model for training based on the preset emotion analysis model, so that the trained emotion analysis model has corresponding functions.

In some embodiments, an audio weight and a text weight are set in the emotion analysis model, and when the emotion identifications of the second sound feature and the corresponding second text feature are different, the emotion identifications of the second sound feature and the corresponding second text feature are determined based on the weight ratio of the audio weight and the text weight.

For example, the audio weight is set to be greater than the text weight in the emotion analysis model, and the emotion identifier recognized by the second sound feature is an excited emotion, and the emotion identifier recognized by the corresponding second text feature is a neutral emotion, then the emotion identifier corresponding to the excited emotion is used as the emotion identifier of the second sound feature and the corresponding second text feature.

The weight is set for selecting the emotion mark when the corresponding voice feature and text feature are recognized and inconsistency of recognition results is generated, because voice features such as tone and tone may show the emotion of a person more than word features when the person speaks.

105. And carrying out audio segment segmentation on the second data to be labeled to obtain a plurality of segments of audio data, and obtaining a corresponding audio text based on the plurality of segments of audio data.

Since a person usually expresses a sentence when speaking, especially when there is communication in a call, when audio segmentation is performed on the second data to be labeled, the audio can be segmented by using continuous audio segments in a preset time as segmentation basis.

After obtaining multiple segments of audio data, corresponding audio texts can be obtained through technologies such as speech recognition and speech translation.

106. And acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the audio data and the text data through the emotion analysis model based on the second sound characteristic and the second text characteristic, and performing automatic annotation of emotion labels on the audio data and the text data based on recognition results.

The second sound feature may be obtained by referring to the first sound feature, and correspondingly, the second text feature may be obtained by referring to the first text feature.

And after the second sound characteristic and the second text characteristic are obtained, inputting the second sound characteristic and the second text characteristic into an emotion analysis model, and identifying and automatically labeling the characteristics through the trained emotion analysis model.

Optionally, if the second sound feature and the second text feature which are not identified and labeled appear in the emotion analysis model, the features can be processed by a human end.

And setting a timer task in the emotion analysis model, periodically detecting the second unrecognized and labeled sound characteristic and the second text characteristic of the stored certificate, and sending the characteristics to a manual end for processing.

The data annotation method comprises the steps of selecting first data to be annotated and second data to be annotated from historical call data of a bank customer service seat and a customer, sending the first data to be annotated to a manual end for manual annotation to obtain first annotation data, wherein the first annotation data comprises first annotation audio with emotion labels and first annotation audio texts, the emotion labels comprise emotion identifiers and position information of audio or audio texts annotated by the emotion identifiers, and the emotion identifiers recorded in the emotion labels of the first annotation audio and the first annotation audio texts are the same; acquiring an emotion label audio segment in the first label audio based on the position information in the emotion label of the first label audio, and acquiring first sound characteristics of the emotion label audio segment; acquiring an emotion marking sentence segment of first marking data and acquiring a first text characteristic of the emotion marking sentence segment based on position information in an emotion label of the first marking audio text; performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and an emotion label; carrying out audio segment segmentation on the second data to be labeled to obtain a plurality of segments of audio data, and obtaining corresponding audio texts based on the plurality of segments of audio data; and acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the basis of the second sound characteristic and the second text characteristic through the emotion analysis model, and performing automatic annotation of emotion labels on the audio data and the text data on the basis of recognition results. Through the mode, semi-supervised data labeling is realized, and labor cost is saved.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data annotation device 200 according to an embodiment of the present application, which includes the following units:

201. and the manual labeling unit is used for selecting first data to be labeled and second data to be labeled from historical call data of a bank customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeled data, wherein the first labeled data comprises first labeled audio with an emotion label and a first labeled audio text, the emotion label comprises an emotion identifier and position information of the audio or the audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio text and the first labeled audio text are the same.

202. And the sound characteristic acquisition unit is used for acquiring the emotion label audio segment in the first label audio based on the position information in the emotion label of the first label audio and acquiring the first sound characteristic of the emotion label audio segment.

203. And the text characteristic acquisition unit is used for acquiring the emotion marking sentence segment of the first marking data and acquiring the first text characteristic of the emotion marking sentence segment based on the position information in the emotion label of the first marking audio text.

204. And the model training unit is used for performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through the emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and the emotion label.

205. And the segmentation processing unit is used for segmenting audio segments of the second data to be labeled to obtain multiple segments of audio data, and obtaining corresponding audio texts based on the multiple segments of audio data.

206. And the automatic labeling unit is used for acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the basis of the second sound characteristic and the second text characteristic through an emotion analysis model, and performing automatic labeling of emotion labels on the audio data and the text data on the basis of recognition results.

The data annotation device 200 of the embodiment of the application comprises a manual annotation unit 201, which is used for selecting first data to be annotated and second data to be annotated from historical call data of a bank customer service seat and a customer, sending the first data to be annotated to a manual end for manual annotation, and obtaining first annotation data, wherein the first annotation data comprises first annotation audio with an emotion label and a first annotation audio text, the emotion label comprises an emotion identifier and position information of audio or audio text annotated by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first annotation audio text are the same; the sound feature obtaining unit 202 is configured to obtain an emotion labeled audio segment in the first labeled audio based on the position information in the emotion label of the first labeled audio, and obtain a first sound feature of the emotion labeled audio segment; a text feature obtaining unit 203, configured to obtain an emotion markup sentence segment of the first markup data based on the position information in the emotion tag of the first markup audio text, and obtain a first text feature of the emotion markup sentence segment; the model training unit 204 is used for performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through the emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and the emotion label; the segmentation processing unit 205 is configured to perform audio segment segmentation on the second data to be labeled to obtain multiple segments of audio data, and obtain a corresponding audio text based on the multiple segments of audio data; and the automatic labeling unit 206 is configured to obtain a second sound feature of the audio data and a second text feature of the text data, perform emotion recognition based on the second sound feature and the second text feature through an emotion analysis model, and perform automatic labeling of emotion labels on the audio data and the text data based on a recognition result. Through the device, semi-supervised data labeling is realized, so that labor cost is saved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device 300 according to an embodiment of the present application, where the computer device includes a processor 301 having one or more processing cores, a memory 302 having one or more computer-readable storage media, and a computer program stored in the memory 302 and capable of running on the processor 301. The processor 301 is electrically connected to the memory 302.

Those skilled in the art will appreciate that the computer device configurations illustrated in the figures are not meant to be limiting of computer devices, and may include more or fewer components than those illustrated, or combinations of certain components, or different arrangements of components.

The processor 301 is a control center of the computer apparatus 300, connects various parts of the entire computer apparatus 300 by various interfaces and lines, performs various functions of the computer apparatus 300 and processes data by running or loading software programs and/or modules stored in the memory 302, and calling data stored in the memory 302, thereby monitoring the computer apparatus 300 as a whole.

In this embodiment, the processor 301 in the computer device 300 loads instructions corresponding to processes of one or more application programs into the memory 302, and the processor 301 executes the application programs stored in the memory 302 according to the following steps, so as to implement various functions:

selecting first data to be labeled and second data to be labeled from historical call data of a bank customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeled data, wherein the first labeled data comprises first labeled audio with an emotion label and first labeled audio text, the emotion label comprises an emotion identifier and position information of the audio or the audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio and the first labeled audio text are the same;

acquiring an emotion label audio segment in the first label audio based on the position information in the emotion label of the first label audio, and acquiring first sound characteristics of the emotion label audio segment;

performing emotion recognition on the basis of the first sound characteristic and the first text characteristic through an emotion analysis model to obtain an emotion recognition result, and training the emotion analysis model on the basis of the emotion recognition result and an emotion label;

carrying out audio segment segmentation on the second data to be labeled to obtain a plurality of segments of audio data, and obtaining corresponding audio texts based on the plurality of segments of audio data;

and acquiring a second sound characteristic of the audio data and a second text characteristic of the text data, performing emotion recognition on the basis of the second sound characteristic and the second text characteristic through an emotion analysis model, and performing automatic annotation of emotion labels on the audio data and the text data on the basis of recognition results.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, the computer device 300 further includes an audio module 303 and a text module 304, the audio module 303 and the text module 304 are both electrically connected to the processor 301, the audio module 303 is configured to receive input audio data, and the text module 304 is configured to display text data corresponding to the input audio data. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 3 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

Although not shown in fig. 3, the computer device 300 may also include a display module and other electronic structures, which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As can be seen from the above, in the computer device provided in this embodiment, the first to-be-labeled data and the second to-be-labeled data are selected from historical call data of a bank customer service seat and a customer, the first to-be-labeled data are sent to a manual end for manual labeling, so as to obtain the first labeled data, the first labeled data include a first labeled audio with an emotion tag and a first labeled audio text, where the emotion tag includes an emotion identifier and position information of an audio or an audio text labeled by the emotion identifier, the emotion identifiers recorded in the emotion tags of the first labeled audio and the first labeled audio text are the same, based on the position information in the emotion tag of the first labeled audio, the emotion labeled audio segment in the first labeled audio is obtained, the first sound feature of the emotion labeled audio segment is obtained, based on the position information in the emotion tag of the first labeled audio text, the method comprises the steps of obtaining an emotion labeling sentence segment of first labeling data, obtaining a first text feature of the emotion labeling sentence segment, conducting emotion recognition based on a first sound feature and the first text feature through an emotion analysis model to obtain an emotion recognition result, training the emotion analysis model based on the emotion recognition result and an emotion label, conducting audio segment segmentation on second data to be labeled to obtain multiple sections of audio data, obtaining corresponding audio texts based on the multiple sections of audio data, obtaining a second sound feature of the audio data and a second text feature of the text data, conducting emotion recognition based on the second sound feature and the second text feature through the emotion analysis model, and conducting automatic labeling of the emotion label on the audio data and the text data based on the recognition result.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any data annotation method provided by the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: a read Only Memory (ROM, Re client account d Only Memory), a random access Memory (R client account M, R client account and access Memory), a magnetic disk or an optical disk, and the like.

Since the computer program stored in the storage medium can execute the steps in any data annotation method provided in the embodiments of the present application, the beneficial effects that can be achieved by any data annotation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The data annotation method, apparatus, computer device and storage medium provided in the embodiments of the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and implementation manner of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for annotating data, the method comprising:

2. The data annotation method of claim 1, wherein the obtaining of the emotion annotation audio segment in the first annotation audio and the first sound feature of the emotion annotation audio segment based on the position information in the emotion tag of the first annotation audio comprises:

3. The data annotation method of claim 1, wherein the obtaining of the emotion markup sentence fragment of the first annotation data and the obtaining of the first text feature of the emotion markup sentence fragment based on the position information in the emotion tag of the first annotated audio text comprises:

classifying the emotion marking sentence segments based on the emotion identifications;

4. The data annotation method according to claim 1, wherein the obtaining of the second sound feature of the audio data and the second text feature of the text data, performing emotion recognition based on the second sound feature and the second text feature by the emotion analysis model, and performing annotation of emotion labels on the audio data and the text data based on recognition results comprises:

setting audio weight and text weight in the emotion analysis model;

5. The data annotation method of claim 1, wherein the selecting of first data to be annotated and second data to be annotated from historical call data of a bank customer service agent and a customer, sending the first data to be annotated to a manual end for manual annotation, and obtaining first annotated data, the first annotated data including a first annotated audio and a first annotated audio text with emotion tags, wherein the emotion tags include emotion identifiers and position information of audio or audio text annotated by the emotion identifiers, and the emotion identifiers recorded in the emotion tags of the first annotated audio and the first annotated audio text are the same, includes:

6. The data annotation method of claim 5, wherein the inputting the processed first audio to be annotated into the text recognition model to obtain a first text to be annotated corresponding to the first audio to be annotated comprises:

7. The data annotation method of claim 1, wherein the method comprises the steps of selecting first data to be annotated and second data to be annotated from historical conversation data of a bank customer service agent and a customer, sending the first data to be annotated to a manual end for manual annotation to obtain first annotation data, wherein the first annotation data comprises a first annotation audio with an emotion label and a first annotation audio text, and comprises the following steps:

8. A data annotation device, said device comprising:

the system comprises a manual labeling unit, a data processing unit and a data processing unit, wherein the manual labeling unit is used for selecting first data to be labeled and second data to be labeled from historical call data of a bank customer service seat and a customer, sending the first data to be labeled to a manual end for manual labeling to obtain first labeled data, the first labeled data comprises first labeled audio with an emotion label and a first labeled audio text, the emotion label comprises an emotion identifier and position information of the audio or the audio text labeled by the emotion identifier, and the emotion identifiers recorded in the emotion labels of the first labeled audio text are the same;

9. A computer device comprising a memory for storing instructions and data and a processor for performing the data annotation method of any one of claims 1-7.

10. A storage medium having stored therein a plurality of instructions adapted to be loaded by a processor to perform the data annotation method of any one of claims 1-7.