CN111326173B

CN111326173B - Voice information processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN111326173B
Application number: CN201811544377.8A
Authority: CN
Inventors: 张凌宇
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2023-03-24
Anticipated expiration: 2038-12-17
Also published as: CN111326173A

Abstract

The application provides a voice information processing method, a voice information processing device, an electronic device and a readable storage medium, wherein the method comprises the following steps: determining at least one sound production user identification in voice information, sound production voice sections corresponding to the sound production user identifications and time information of the sound production voice sections in the voice information; matching each voice production user identification with a text information segment corresponding to each voice production segment according to the time information of the voice production segments in the voice information to obtain dialogue information; and identifying the emotion type of the user according to the dialogue information. According to the embodiment of the application, the dialogue information among the sound-producing users is generated according to the voice information, and the emotion types of the sound-producing users are determined according to the dialogue information, so that the condition that the emotion types are determined according to the voice information of a single sound-producing user is avoided, and the accuracy and flexibility of determining the emotion types of the sound-producing users are improved.

Description

Voice information processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of audio technologies, and in particular, to a method and an apparatus for processing voice information, an electronic device, and a readable storage medium.

Background

With the continuous development of artificial intelligence, the terminal can analyze the current emotion of the user according to characters and also can analyze the emotion of the user according to voice information sent by the user.

In the related art, the terminal can acquire the voice information sent by the user, analyze and process the voice information to obtain the character information corresponding to the voice information, and analyze the character information to determine the current emotion of the user.

However, in the process of analyzing according to the voice information of the user, the terminal analyzes according to the voice information of a single user only, and the obtained result has errors, so that the emotion type of the user cannot be accurately determined.

Disclosure of Invention

In view of the above, an object of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a computer-readable storage medium for processing voice information, in which at least one uttering user identifier, an uttering voice segment corresponding to each uttering user identifier, and time information of the uttering voice segment in the voice information are determined, each uttering user identifier is matched with a text information segment corresponding to each uttering voice segment according to the time information of the uttering voice segment in the voice information to obtain dialog information, and an emotion type of a user is identified according to the dialog information. By generating the dialogue information between the sound-producing users according to the voice information and determining the emotion types of the sound-producing users according to the dialogue information, the situation that the emotion types are determined according to the voice information of a single sound-producing user is avoided, and the accuracy and the flexibility of determining the emotion types of the sound-producing users are improved.

In a first aspect, an embodiment of the present application provides a method for processing voice information, including:

determining at least one sound production user identification in voice information, sound production voice sections corresponding to the sound production user identifications and time information of the sound production voice sections in the voice information;

matching each voice production user identification with a text information segment corresponding to each voice production segment according to the time information of the voice production segments in the voice information to obtain dialogue information;

and identifying the emotion type of the user according to the conversation information.

Optionally, before determining at least one utterance user identifier, an utterance speech segment corresponding to each utterance user identifier, and time information of the utterance speech segment in the speech information, the method includes:

acquiring audio data;

before the matching of each of the uttered user identifications and the text information segment corresponding to each of the uttered voice segments according to the time information of the uttered voice segment in the voice information, the method includes:

and converting the audio data into text information.

Optionally, the acquiring the audio data includes:

detecting whether the current environment comprises voice information sent by at least one sound-emitting user;

and if the voice information sent by any sound-producing user is detected, acquiring the audio data in the current environment.

Optionally, the converting the audio data into text information includes:

denoising the audio data to obtain the voice information;

and converting the voice information into the text information through a preset voice recognition model.

Optionally, after the converting the voice information into the text information through the preset voice recognition model, the method further includes:

determining time information corresponding to each text information segment included in the text information;

the matching of each voice-producing user identifier with the text information segment corresponding to each voice-producing speech segment according to the time information of the voice-producing speech segment in the speech information includes:

and matching a target sound production user identifier corresponding to a target time period and a target text information segment corresponding to the target time period according to the time information corresponding to each text information segment in the text information and the time information of the sound production voice segment in the voice information to obtain the dialogue information.

Optionally, the determining at least one utterance user identifier in the speech information includes:

recognizing the voice information to obtain user characteristics of at least one sound-producing user, wherein the user characteristics comprise at least one characteristic of tone, frequency and voiceprint;

and generating at least one sound-producing user identifier according to the number of the user features, wherein the number of the user features is consistent with that of the sound-producing user identifiers.

Optionally, the voice information is stored in a block storage manner.

Optionally, the recognizing the emotion type of the user according to the dialog information includes:

and analyzing the dialogue information through a preset emotion recognition model to obtain the emotion type of the user.

In a second aspect, an embodiment of the present application provides a speech information processing apparatus, including:

the system comprises a first determining module, a second determining module and a processing module, wherein the first determining module is used for determining at least one sound production user identification in voice information, sound production voice sections corresponding to the sound production user identifications and time information of the sound production voice sections in the voice information;

the matching module is used for matching each sound production user identification with the text information segment corresponding to each sound production voice segment according to the time information of the sound production voice segment in the voice information to obtain dialogue information;

and the identification module is used for identifying the emotion type of the user according to the conversation information.

Optionally, the method further includes:

the acquisition module is used for acquiring audio data;

and the conversion module is used for converting the audio data into text information.

Optionally, the obtaining module is specifically configured to detect whether the current environment includes voice information sent by at least one voice-emitting user; and if the voice information sent by any sound-producing user is detected, acquiring the audio data in the current environment.

Optionally, the conversion module is specifically configured to perform denoising processing on the audio data to obtain the voice information; and converting the voice information into the text information through a preset voice recognition model.

Optionally, the method further includes:

the second determining module is used for determining time information corresponding to each text information segment included in the text information;

the matching module is specifically configured to match a target voice-generating user identifier corresponding to a target time period and a target text information segment corresponding to the target time period according to time information corresponding to each text information segment in the text information and time information of the voice-generating voice segment in the voice information, so as to obtain the dialog information.

Optionally, the first determining module is specifically configured to identify the voice information to obtain a user characteristic of at least one voice-generating user, where the user characteristic includes at least one characteristic of tone, frequency, and voiceprint; and generating at least one sounding user identifier according to the number of the user characteristics, wherein the number of the user characteristics is consistent with that of the sounding user identifiers.

Optionally, the voice information is stored in a block storage manner.

Optionally, the recognition module is specifically configured to analyze the dialog information through a preset emotion recognition model to obtain an emotion type of the user.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the storage medium communicate through the bus, and the processor executes the machine-readable instructions to execute the steps of the voice information processing method according to any one of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the voice information processing method according to any one of the above first aspects.

In the embodiment of the application, by determining at least one sounding user identifier in the voice information, the sounding voice segment corresponding to each sounding user identifier and the time information of the sounding voice segment in the voice information, according to the time information of the sounding voice segment in the voice information, each sounding user identifier is matched with the text information segment corresponding to each sounding voice segment to obtain the dialogue information, and according to the dialogue information, the emotion type of the user is identified. By generating the dialogue information between the sound-producing users according to the voice information and determining the emotion types of the sound-producing users according to the dialogue information, the situation that the emotion types are determined according to the voice information of the single sound-producing users is avoided, and the accuracy and flexibility of determining the emotion types of the sound-producing users are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 illustrates a scene schematic diagram related to a voice information processing method provided by an embodiment of the present application;

FIG. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 200, which may implement the concepts of the present application, of some embodiments of the present application;

fig. 3 is a schematic flowchart illustrating a method for processing voice information according to an embodiment of the present application;

FIG. 4 is a flow chart of another speech information processing method provided in the embodiment of the present application;

fig. 5 is a block diagram illustrating a speech information processing apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of another speech information processing apparatus provided in an embodiment of the present application;

FIG. 7 is a block diagram of another speech information processing apparatus provided in an embodiment of the present application;

fig. 8 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

Fig. 1 illustrates a scene schematic diagram related to a voice information processing method provided by an embodiment of the present application; as shown in fig. 1, the scenario involved in the speech information processing method may include: vehicle 110, terminal 120, and at least one vocalizing user 130.

When vehicle 110 is in a driving process, at least one user 130 (driver and/or passenger) in vehicle 110 may send out voice information, terminal 120 may obtain the voice information of user 130, convert and recognize the voice information to obtain text information and a user identifier corresponding to the voice information, match the user identifier and the text information to determine a user identifier corresponding to each text information segment in the text information, that is, determine which user 130 corresponding to the user identifier sends out each sentence in the text information, so as to form dialog information between users 130, and finally determine a type of emotion of each user 130 according to the dialog information, for example, determine that the emotion of each user 130 is in a positive, neutral, or negative state.

In the process of converting and recognizing the voice information, the terminal 120 may perform the operations of converting and recognizing the voice information at the same time.

For example, the terminal 120 may convert the voice information into text information in a machine-readable format through a preset voice recognition system, and meanwhile, the terminal 120 may extract a plurality of features included in the voice information through a preset speaker recognition system to obtain a plurality of features such as voice, emotion, specific information of the uttering user 130, and so on, thereby determining the number of the uttering users 130, generating an uttering user identifier, and finally matching the uttering user identifier with the text information to obtain the dialog information.

It should be noted that, in practical applications, the terminal 120 may further send the acquired voice information to the server, and the server may receive the voice information, perform the above processing on the voice information, and determine the emotion type of the speaking user 130, which is not limited in this embodiment of the application.

In addition, in practical applications, the voice information processing method provided by the present application may be applied to a plurality of scenarios in which the voice information of the uttering user 130 can be acquired, and the present embodiment is only described by taking a scenario in which the vehicle 110 travels as an example, and the application scenario of the voice information processing method provided by the present application is not limited.

Fig. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 200, which may implement the concepts of the present application, according to some embodiments of the present application. For example, a processor may be used on the electronic device 200 and to perform the functions herein.

The electronic device 200 may be a general purpose computer or a special purpose computer, both of which may be used to implement the vector acquisition methods of the present application. Although only a single computer is shown, for convenience, the functions described herein may be implemented in a distributed fashion across multiple similar platforms to balance processing loads.

For example, the electronic device 200 may include a network port 210 connected to a network, one or more processors 220 for executing program instructions, a communication bus 230, and a different form of storage medium 240, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The electronic device 200 also includes an Input/Output (I/O) interface 250 between the computer and other Input/Output devices (e.g., keyboard, display screen).

For ease of illustration, only one processor is depicted in the electronic device 200. However, it should be noted that the electronic device 200 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 200 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together or separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

In order to improve the accuracy of determining the emotion types of the users, the judgment can be carried out according to the conversation among the users, and whether the emotion of the users is in an excited state is determined, so that the early warning can be carried out according to the determined emotion of the users, the conflict or safety accident caused by the emotion excitation of the users can be avoided, and the personal safety of the users can be improved.

For example, if the driver and the passenger are located in the vehicle during the driving of the vehicle, whether the driver or the passenger is in an emotional state can be determined according to the conversation between the driver and the passenger, so as to avoid the collision between the driver and the passenger.

For the sake of simple description, the present application is only described by taking an environment in which a driver and a passenger are located in a vehicle during the driving of the vehicle as an example, and of course, the present application is also applicable to other scenes in which the emotion of the user can be determined according to voice, and the present application is not limited to this.

Fig. 3 is a flowchart illustrating a voice information processing method according to an embodiment of the present application. The execution subject of the method may be a terminal or the like as shown in fig. 1, and is not limited herein. As shown in fig. 3, the method includes:

s301, determining at least one voice user identification in the voice information, a voice section corresponding to each voice user identification and time information of the voice section in the voice information.

Wherein the voice information is audio data comprising the voice of at least one speaking user.

In order to determine whether the emotion of the user is in an excited state, the voice of the sound-producing user can be recognized, and different voices and sentences produced by different sound-producing users can be determined, so that the emotion type of each user can be determined according to conversation among the sound-producing users.

Therefore, the method can firstly identify according to the acquired voice information, and determine the sound production user identifications of different sound production users, the sound production voice sections corresponding to each sound production user identification, and the time information of each sound production voice section in the voice information.

The system comprises a voice information server, a voice generation user identification and a voice information server, wherein each voice generation user identification is used for indicating different voice generation users, voice generation sections corresponding to the voice generation user identifications are used for representing sounds generated by the voice generation user identifications, and time information of the voice generation sections in the voice information is used for representing time periods occupied by each voice generation section in the voice information.

For example, the voice information is a conversation between the uttering user a and the uttering user B, the user a utters speech at 0 to 5S (seconds), 12 to 20S, and 30 to 35S, and the user B utters speech at 6 to 9S, 22 to 30S, and 36 to 40S, and then after recognizing the voice information, the uttering user id a indicating the uttering user a, the uttering user id B indicating the uttering user B, the uttering speech segments corresponding to the uttering user id a at 0 to 5S, 12 to 20S, and 30 to 35S, and the uttering speech segments corresponding to the uttering user id B at 6 to 9S, 22 to 30S, and 36 to 40S can be obtained, where each piece of time information is time information corresponding to each uttering speech segment.

And S302, matching each voice-producing user identification with the text information segment corresponding to each voice-producing segment according to the time information of the voice-producing segments in the voice information to obtain the dialogue information.

The text information segment is obtained by splitting each dialogue in the text information, and the text information is obtained by converting the text information according to the voice information.

After the voice information is recognized, matching can be performed according to the recognized marks, the voice sections and the time information, so that the matched text information can form a dialogue between the voice users.

Specifically, the vocalizing voice segments corresponding to the vocalizing user identifications can be determined first, the time information corresponding to each vocalizing voice segment is determined, the time information corresponding to each text information segment in the text information is determined, each vocalizing voice segment can be matched with each text information segment, if the time information corresponding to a certain vocalizing voice segment is consistent with the time information corresponding to a certain text information segment, the vocalizing voice segment can be determined to be matched with the text information segment, the vocalizing user identification belonging to the vocalizing voice segment is marked for the text information segment, and the vocalizing user identification corresponding to the text information segment is shown to be uttered by the vocalizing user corresponding to the vocalizing user identification.

After each voice section and each text information section are matched and a voice user identifier is added to each text information section, the dialogue information among different voice users can be generated according to the sequence of each text information section.

And S303, identifying the emotion type of the user according to the conversation information.

After the dialogue information among different sound-producing users is obtained, the obtained dialogue information can be analyzed through a preset emotion recognition model, the emotion state of each sound-producing user is judged, and the emotion type of each sound-producing user is determined.

Specifically, the dialogue information may be input into the emotion recognition model, so that the emotion recognition model processes each sentence in the dialogue information to obtain a word vector, then performs processing such as removing stop words, summing vectors, taking an average value, setting a label, and the like on the word vector, and finally outputs an emotion type of each uttering user, for example, it is determined that the emotion of each uttering user is in each of different types such as positive, negative, or neutral.

For example, the dialog information includes: the user A: "Master, zhaba Siro digital science and technology park, troublesome you fast, late to work! "user B: "good, but now early peak, not soon! "user a: "I see the navigation and they are not too blocked around a little! "by inputting the above-mentioned dialogue information into the emotion recognition model, the emotion recognition model can determine that the user is currently in an emotional excited state, that is, a state of positive emotion, according to the sentence of the user a, and can determine that the user B is in a state of neutral emotion if the emotion of the user B cannot be determined to be positive or negative according to the sentence of the user B.

In summary, the voice information processing method provided by the embodiment of the present application matches each of the voice user identifiers with the text information segment corresponding to each of the voice segments of the utterances according to the time information of the voice segments of the utterances in the voice information by determining at least one of the voice user identifiers in the voice information, the uttered voice segment corresponding to each of the voice user identifiers, and the time information of the uttered voice segment in the voice information, so as to obtain the dialog information, and identifies the emotion type of the user according to the dialog information. By generating the dialogue information between the sound-producing users according to the voice information and determining the emotion types of the sound-producing users according to the dialogue information, the situation that the emotion types are determined according to the voice information of a single sound-producing user is avoided, and the accuracy and the flexibility of determining the emotion types of the sound-producing users are improved.

Fig. 4 is a schematic flow chart illustrating another speech information processing method according to an embodiment of the present application. The execution subject of the method may be a terminal or the like as shown in fig. 1, without limitation. As shown in fig. 4, the method includes:

s401, audio data are obtained.

In order to improve the accuracy of determining the emotion of the user, it may be judged whether the emotion of the speaking user is excited by the voice of the speaking user. Accordingly, audio data including the voice of the sound-emitting user and the conversation between the sound-emitting user and other sound-emitting users can be acquired by the terminal, so that in the subsequent step, the emotion type of the sound-emitting user can be determined from the audio data.

Furthermore, in order to reduce the calculation amount of the terminal and avoid the terminal from storing redundant information, the terminal can detect the sound in the current environment, and if the sound-producing user is detected to make a sound, the audio data of the current environment is acquired.

Optionally, the terminal may detect whether the current environment includes voice information sent by at least one sound-producing user, and may acquire the audio data in the current environment if the voice information sent by any sound-producing user is detected.

Specifically, the terminal may detect the sound in the current environment, and if it is detected that the frequency, amplitude, and timbre of the sound in the current environment are similar to the features of the dialogue sound, it may be determined that the speaking user in the current environment is speaking, so as to obtain the audio data in the current environment.

It should be noted that the terminal may also filter the time period in which the user that uttered the sound is not uttered in the current environment by using Voice Activity Detection (VAD), and certainly, may also filter the time period in which the user that uttered the sound is not uttered in the audio data by using other methods, which is not limited in this embodiment of the present application.

For example, the terminal may acquire audio data of the current environment in real time, and identify the time period in which the user is not speaking in the audio data in a VAD manner, so as to filter the time period in which the user is not speaking, and reserve the time period in which the user is speaking.

S402, converting the audio data into text information.

After the audio data are obtained, the terminal can analyze and process the audio data to obtain text information corresponding to the audio data. However, the current environment includes not only the voice information uttered by the user, but also noise in the current environment.

Therefore, in order to improve the accuracy of the text information, noise in the audio data can be filtered to obtain voice information, and then the voice information is converted to obtain the text information.

Optionally, the terminal may perform denoising processing on the audio data to obtain voice information, and convert the voice information into text information through a preset voice recognition model.

Specifically, the terminal may perform denoising processing on the audio data by using a preset algorithm, and remove a sound in the audio data, the frequency of which is not consistent with that of the dialogue sound, to obtain the voice information. And inputting the voice information into a preset voice recognition model, so that the voice recognition model converts the voice information to obtain text information corresponding to the voice information.

For example, the Speech information may be converted by using a Sphinx4 (Java Speech recognition library), a Bing Speech API (mandatory Speech interface), or a Google Speech API (Google Speech interface), which is not limited in this embodiment.

Furthermore, in order to divide each sentence in the text message, a plurality of text message segments are obtained, so that the dialogue information among different sound producing users is formed according to the plurality of text message segments. Therefore, the terminal can determine the time information corresponding to each text information segment included in the text information.

Specifically, each sentence in the text information may be divided according to a sentence division algorithm, so as to obtain a plurality of text information segments composed of a plurality of sentences. And, the time period corresponding to each sentence is determined, thereby determining the time information corresponding to each text information segment.

For example, after each piece of text information is determined, a time period corresponding to each sentence in the text information may be determined, so as to obtain a start time and an end time corresponding to each sentence, and then a hash algorithm is used to calculate the start time and the end time, so as to obtain a hash code corresponding to each sentence, so as to obtain time information corresponding to each piece of text information.

It should be noted that, when the terminal stores the voice information, the voice information may be stored in a block storage manner, so that in the subsequent step, each storage block may be identified, and the sound-generating user identifier corresponding to each storage block may be determined.

S403, determining at least one sound-producing user identification in the voice information, a sound-producing voice segment corresponding to each sound-producing user identification and time information of the sound-producing voice segment in the voice information.

S403 is similar to S301, and is not described herein again.

However, in the process of determining at least one voice user identifier in the voice information, the terminal may identify the voice information, obtain the user characteristics of at least one user, and generate at least one voice user identifier according to the number of the user characteristics.

Wherein the user characteristics may include at least one of tone, frequency, and voiceprint, and the number of user characteristics is consistent with the number of spoken user identifications.

For example, the voiced user identifier of each voiced user can be obtained by combining a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to each voice in the voice information with a Dynamic Time Warping (DTW) algorithm and performing recognition by using Euclidean, correlation, or cantera for feature matching.

It should be noted that, after determining each of the vocalizing user identifiers, the storage blocks corresponding to the vocalizing voice segments may be marked according to the time information to which the vocalizing voice segments corresponding to each of the vocalizing user identifiers belong, so that in the subsequent step, the dialog information may be generated according to the vocalizing user identifiers corresponding to each of the storage blocks.

Further, similar to S402, when determining the time information of each sounding speech segment corresponding to each sounding user identifier, a hash algorithm may also be used to calculate the start time and the end time corresponding to each sounding speech segment to obtain a hash code, so as to obtain the time information corresponding to each sounding speech segment.

In addition, S402 and S403 may be executed simultaneously, or S403 may be executed first and then S402 may be executed, which is not limited in this embodiment.

S404, matching the target voice-producing user identification corresponding to the target time period and the target text information segment corresponding to the target time period according to the time information corresponding to each text information segment in the text information and the time information of the voice-producing voice segment in the voice information to obtain the dialogue information.

After the terminal converts and identifies the voice information, the text information obtained by conversion and the voice user identification obtained by identification can be matched, so that each text information segment in the text information is marked with the corresponding voice user identification, and therefore the dialogue information is formed.

Specifically, the terminal may obtain time information corresponding to the target text information segment, match a target time period indicated by the time information with a time period indicated by time information corresponding to each vocalizing voice segment, and if the target time period corresponding to the target text information segment is consistent with the time period corresponding to the target vocalizing voice segment, may use a vocalizing user identifier to which the target vocalizing voice segment belongs as the target vocalizing user identifier matched with the target text information segment.

The target text information segment is any one of a plurality of text information segments in the text information.

After each text information segment in the text information is matched in the above manner, the sound-producing user identifier corresponding to each text information segment is determined, and then the dialog information can be generated according to the sound-producing user identifier corresponding to each text information segment.

It should be noted that, because the hash code indicating the time information can be obtained by using a hash algorithm in S402 and S403, in the process of matching the sound-producing user identifier and the text information segment, the hash code can be analyzed first to obtain the corresponding start time and end time, so as to perform matching according to each start time and each end time and determine the sound-producing user identifier corresponding to each text information segment.

S405, analyzing the dialogue information through a preset emotion recognition model to obtain the emotion type of the user.

S405 is similar to S303 and will not be described herein.

It should be noted that the preset emotion recognition model may be constructed by using a Long Short-Term Memory network (LSTM), or may be constructed by using other types of neural networks, which is not limited in this embodiment of the present application.

Moreover, in the process of analyzing the dialogue information, the dialogue information can be preprocessed. For example, word2vec (a correlation model for generating word vectors) may be used to process the dialogue information to obtain a plurality of word vectors, and the word vectors are processed through operations such as stop word removal, vector summation, mean value taking, label setting and the like to obtain processed word vectors, and finally the processed word vectors are input into a preset emotion recognition model to obtain the emotion type of the user.

In summary, the voice information processing method provided by the embodiment of the present application matches each of the uttered user identifiers with the text information segment corresponding to each of the uttered voice segments according to the time information of the uttered voice segments in the voice information by determining at least one of the uttered user identifiers in the voice information, the uttered voice segment corresponding to each of the uttered user identifiers, and the time information of the uttered voice segment in the voice information, so as to obtain the dialog information, and identifies the emotion type of the user according to the dialog information. By generating the dialogue information between the sound-producing users according to the voice information and determining the emotion types of the sound-producing users according to the dialogue information, the situation that the emotion types are determined according to the voice information of a single sound-producing user is avoided, and the accuracy and the flexibility of determining the emotion types of the sound-producing users are improved.

Fig. 5 is a block diagram illustrating a speech information processing apparatus provided in an embodiment of the present application, where the functions implemented by the speech information processing apparatus correspond to the steps executed by the method. The apparatus may be understood as a terminal as shown in fig. 1, and as shown in the figure, the voice information processing apparatus may include:

a first determining module 501, configured to determine at least one voice generating user identifier in the voice information, a voice generating speech segment corresponding to each voice generating user identifier, and time information of the voice generating speech segment in the voice information;

a matching module 502, configured to match each of the uttered user identifiers with a text information segment corresponding to each of the uttered voice segments according to time information of the uttered voice segment in the voice information, so as to obtain dialog information;

and the identifying module 503 is configured to identify an emotion type of the user according to the dialog information.

Optionally, referring to fig. 6, the apparatus may further include:

an obtaining module 504, configured to obtain audio data;

a conversion module 505, configured to convert the audio data into text information.

Optionally, the obtaining module 504 is specifically configured to detect whether the current environment includes voice information sent by at least one voice-emitting user; and if the voice information sent by any sound-producing user is detected, acquiring the audio data in the current environment.

Optionally, the conversion module 505 is specifically configured to perform denoising processing on the audio data to obtain the voice information; and converting the voice information into the text information through a preset voice recognition model.

Optionally, referring to fig. 7, the apparatus may further include:

a second determining module 506, configured to determine time information corresponding to each text information segment included in the text information;

the matching module 502 is specifically configured to match a target utterance user identifier corresponding to a target time period and a target text information segment corresponding to the target time period according to time information corresponding to each text information segment in the text information and time information of the utterance speech segment in the speech information, so as to obtain the dialog information.

Optionally, the first determining module 501 is specifically configured to identify the voice information to obtain a user characteristic of at least one voice-generating user, where the user characteristic includes at least one characteristic of tone, frequency, and voiceprint; and generating at least one sound-producing user identifier according to the number of the user features, wherein the number of the user features is consistent with the number of the sound-producing user identifiers.

Optionally, the voice information is stored in a block storage manner.

Optionally, the recognition module 503 is specifically configured to analyze the dialog information through a preset emotion recognition model to obtain an emotion type of the user.

To sum up, the speech information processing apparatus provided in the embodiment of the present application matches each utterance user identifier with a text information segment corresponding to each utterance speech segment according to time information of the utterance speech segment in the speech information by determining at least one utterance user identifier in the speech information, the utterance speech segment corresponding to each utterance user identifier, and time information of the utterance speech segment in the speech information, so as to obtain dialog information, and identifies an emotion type of a user according to the dialog information. By generating the dialogue information between the sound-producing users according to the voice information and determining the emotion types of the sound-producing users according to the dialogue information, the situation that the emotion types are determined according to the voice information of a single sound-producing user is avoided, and the accuracy and the flexibility of determining the emotion types of the sound-producing users are improved.

The modules may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof. The wireless connection may comprise a connection over a LAN, WAN, bluetooth, zigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.

As shown in fig. 8, a schematic structural diagram of an electronic device provided in an embodiment of the present application includes: a processor 801, a memory 802, and a bus 803.

The storage medium stores machine-readable instructions executable by the processor, the processor and the storage medium communicate via a bus when the electronic device is operated, the processor executes the machine-readable instructions, and the machine-readable instructions when executed by the processor 801 perform the following:

and identifying the emotion type of the user according to the dialogue information.

In a specific implementation, the processing executed by the processor 801 includes, before determining at least one utterance user id, an utterance speech segment corresponding to each utterance user id, and time information of the utterance speech segment in the speech information:

acquiring audio data;

the audio data is converted into text information.

In a specific implementation, in the processing performed by the processor 801, the acquiring the audio data includes:

In a specific implementation, in the processing performed by the processor 801, the converting the audio data into text information includes:

denoising the audio data to obtain the voice information;

In a specific implementation, in the processing executed by the processor 801, after the converting the speech information into the text information by the preset speech recognition model, the method further includes:

In a specific implementation, in the processing performed by the processor 801, the determining at least one utterance user identifier in the speech information includes:

In a specific implementation, the processor 801 executes the processing to store the voice information in a block storage manner.

In a specific implementation, in the processing performed by the processor 801, the identifying the emotion type of the user according to the dialog information includes:

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of performing the method of processing speech information according to any one of the embodiments.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the method embodiment, and is not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing speech information, comprising:

matching each voice production user identification with a text information segment corresponding to each voice production segment according to the time information of the voice production segments in the voice information to obtain dialogue information; the time information corresponding to the sounding voice segment is obtained by carrying out Hash calculation on the starting time and the ending time of the sounding voice segment;

recognizing the emotion type of the user according to the dialogue information; wherein, according to the time information of the voice production speech segment in the speech information, matching each voice production user identifier with the text information segment corresponding to each voice production speech segment, includes:

matching a target sound production user identifier corresponding to a target time period and a target text information segment corresponding to the target time period according to time information corresponding to each text information segment and time information of the sound production voice segment in the voice information to obtain dialogue information; the time information corresponding to the text information segment is obtained by performing hash calculation on the starting time and the ending time of the text information segment.

2. The method of claim 1, wherein prior to determining at least one of the voiced user identifications, the voiced speech segments corresponding to each of the voiced user identifications, and the time information of the voiced speech segments in the speech information, the method comprises:

acquiring audio data;

the audio data is converted into text information.

3. The method of claim 2, wherein the obtaining audio data comprises:

4. The method of claim 2, wherein converting the audio data into text information comprises:

denoising the audio data to obtain the voice information;

5. The method according to claim 4, wherein after converting the speech information into the text information through a preset speech recognition model, the method further comprises:

and determining time information corresponding to each text information segment included in the text information.

6. The method of claim 1, wherein determining at least one originating user identification in the voice message comprises:

7. The method according to any of claims 1 to 6, characterized in that the speech information is stored in a block storage.

8. The method according to any one of claims 1 to 6, wherein the identifying the emotion type of the user according to the dialogue information comprises:

9. A speech information processing apparatus characterized by comprising:

the matching module is used for matching each voice production user identification with the text information segment corresponding to each voice production speech segment according to the time information of the voice production speech segment in the speech information to obtain dialogue information; the time information corresponding to the sounding voice segment is obtained by carrying out Hash calculation on the starting time and the ending time of the sounding voice segment;

the recognition module is used for recognizing the emotion type of the user according to the dialogue information;

the matching module is specifically configured to match a target utterance user identifier corresponding to a target time period and a target text information segment corresponding to the target time period according to time information corresponding to each text information segment in the text information and time information of the utterance speech segment in the speech information, so as to obtain the dialog information; the time information is generated according to the hash codes respectively corresponding to the sentences in the text information segment, and the hash codes are obtained by performing hash calculation on the starting time and the ending time respectively corresponding to the sentences in the text information segment.

10. The apparatus of claim 9, further comprising:

the acquisition module is used for acquiring audio data;

11. The device according to claim 10, wherein the obtaining module is specifically configured to detect whether a current environment includes voice information uttered by at least one uttering user; and if the voice information sent by any sound-producing user is detected, acquiring the audio data in the current environment.

12. The apparatus according to claim 10, wherein the conversion module is specifically configured to perform denoising processing on the audio data to obtain the speech information; and converting the voice information into the text information through a preset voice recognition model.

13. The apparatus of claim 12, further comprising:

and the second determining module is used for determining the time information corresponding to each text information segment included in the text information.

14. The apparatus according to claim 9, wherein the first determining module is specifically configured to recognize the voice message to obtain a user characteristic of at least one uttering user, where the user characteristic includes at least one of a tone, a frequency, and a voiceprint; and generating at least one sound-producing user identifier according to the number of the user features, wherein the number of the user features is consistent with that of the sound-producing user identifiers.

15. The apparatus according to any of claims 9 to 14, wherein said speech information is stored in a block storage.

16. The device according to any one of claims 9 to 14, wherein the recognition module is specifically configured to analyze the dialogue information through a preset emotion recognition model to obtain the emotion type of the user.

17. An electronic device, comprising: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the storage medium communicate with each other through the bus, and the processor executes the machine-readable instructions to execute the steps of the voice information processing method according to any one of claims 1 to 8.

18. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the speech information processing method according to one of claims 1 to 8.