CN117688344A

CN117688344A - Multi-mode fine granularity trend analysis method and system based on large model

Info

Publication number: CN117688344A
Application number: CN202410159489.0A
Authority: CN
Inventors: 邓雅月; 许墨寒; 唐遥
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-03-12
Anticipated expiration: 2044-02-04
Also published as: CN117688344B

Abstract

The invention relates to the technical field of voice text analysis, and provides a multi-mode fine granularity tendency analysis method and a system based on a large model, wherein the method comprises the following steps: acquiring voice data, dividing the voice data into a plurality of voice sub-data, and encoding each voice sub-data into voice sub-vectors; acquiring text sub-data corresponding to each voice sub-data, and encoding each text sub-data into a text sub-vector based on a preset encoder; inputting the speech sub-vector into a pre-trained first model, the first model outputting a first emotional tendency vector; inputting the text sub-vector into a pre-trained second model, the second model outputting a second emotional tendency vector; the first emotion tendency vector and the second emotion tendency vector comprise pigeon emotion values and hawk emotion values, and the emotion tendency of the voice sub-data and the text sub-data which correspond to each other is determined based on the first emotion tendency vector and the second emotion tendency vector.

Description

Multi-mode fine granularity trend analysis method and system based on large model

Technical Field

The invention relates to the technical field of voice text analysis, in particular to a multi-mode fine granularity trend analysis method and system based on a large model.

Background

Emotion is the basis of human experience and affects multiple daily tasks such as cognition, perception and the like in human life. In the research of artificial intelligence, it is one of the indispensable functions to have the ability to recognize and analyze emotion.

The emotion in the expression process of people refers to emotion and emotion states which are expressed by people in the expression process, proper emotion expression can enable a speech to be more vivid and infectious, the expressed emotion has great influence on feelings of speakers and receivers, the emotion of the speakers usually corresponds to actual viewpoint tendency of the speakers, and sometimes the actual viewpoint tendency of the speakers corresponds to social influence.

Whereas in prior art speech-text analysis schemes, it is often only the classification of emotion types and not the actual opinion trends of the speaker.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention provide a large model-based multi-modal fine-grained trend analysis method that obviates or mitigates one or more of the disadvantages of the prior art.

One aspect of the present invention provides a multi-modal fine-grained trend analysis method based on a large model, the steps of the method comprising:

acquiring voice data, dividing the voice data into a plurality of voice sub-data, and encoding each voice sub-data into voice sub-vectors based on a preset encoder;

acquiring text sub-data corresponding to each voice sub-data, and encoding each text sub-data into a text sub-vector based on a preset encoder;

inputting the speech sub-vector into a pre-trained first model, the first model outputting a first emotional tendency vector;

inputting the text sub-vector into a pre-trained second model, the second model outputting a second emotional tendency vector;

the first emotion tendency vector and the second emotion tendency vector comprise pigeon emotion values and hawk emotion values, and the emotion tendency of the voice sub-data and the text sub-data which correspond to each other is determined based on the first emotion tendency vector and the second emotion tendency vector.

By adopting the scheme, the existing method has limitation in capturing fine and implicit information, and the scheme is different from the traditional language emotion analysis method, can analyze text materials, can also combine acoustic materials to perform multi-modal analysis by adopting a strong pre-training language model, and performs fine granularity analysis on a sentence level to capture more information, so that fine emotion tendencies in language communication can be more accurately captured and quantified, and the actual viewpoint tendencies of detection sentences can be more accurately judged by the fine analysis, thereby providing more valuable insight for market participants.

In some embodiments of the present invention, in the step of encoding each of the voice sub-data into a voice sub-vector based on a preset encoder, the voice sub-data is mel-converted to obtain a mel spectrum, the mel spectrum is constructed as a mel image, and the mel image is encoded into the voice sub-vector by the encoder.

In some embodiments of the present invention, in the step of encoding each of the text sub-data into a text sub-vector based on a preset encoder, each of the text sub-data is encoded into a text sub-vector using a preset text encoder.

In some embodiments of the present invention, in the step of determining the emotion tendencies of the voice sub-data and the text sub-data corresponding to each other based on the first emotion tendencies vector and the second emotion tendencies vector, the emotion tendencies values are calculated based on the pigeon-pie emotion value and the eagle-pie emotion value in the first emotion tendencies vector and the second emotion tendencies vector, respectively, and the emotion tendencies are determined based on the emotion tendencies values obtained by the first emotion tendencies vector and the second emotion tendencies vector.

In some embodiments of the present invention, in the step of calculating the emotion tendencies values based on the pigeon pie emotion values and the hawk pie emotion values in the first and second emotion tendencies vectors, respectively, the emotion tendencies values are calculated using the following formula:

wherein,representing emotional tendency value, < >>Representing the mood value of Pigeon>Representing the hawk-style emotion value.

In some embodiments of the invention, the steps of the method further comprise:

acquiring a first emotion tendency vector and a second emotion tendency vector corresponding to each voice sub-data in the voice data, acquiring emotion tendency values corresponding to the first emotion tendency vector and the second emotion tendency vector, and calculating a combined tendency value based on the two emotion tendency values;

acquiring second emotion tendency vectors corresponding to the front preset number of voice sub-data and the rear preset number of voice sub-data of each voice sub-data, calculating a front tendency value based on the second emotion tendency vectors corresponding to the front preset number of voice sub-data, and calculating a rear tendency value based on the second emotion tendency vectors corresponding to the rear preset number of voice sub-data;

mapping the merging tendency value, the front tendency value and the rear tendency value into pixel change values respectively based on a preset mapping comparison table;

acquiring a preset template image, wherein an image area is arranged in the template image corresponding to each voice sub-data, modifying the pixel value of the image area corresponding to each voice sub-data based on the combination trend value, the front trend value and the pixel change value corresponding to the rear trend value of each voice sub-data, modifying the template image into a judging image, and determining the emotion trend corresponding to the voice data based on the judging image.

In some embodiments of the present invention, in the calculating of the pre-trend value based on the second emotional tendency vectors corresponding to the pre-preset number of voice sub-data, the calculating of the post-trend value based on the second emotional tendency vectors corresponding to the post-preset number of voice sub-data,

calculating a weighted average value based on emotion tendency values of second emotion tendency vectors corresponding to the voice sub-data of the preset quantity to obtain the pre-tendency value;

and calculating a weighted average value based on emotion tendency values of second emotion tendency vectors corresponding to the post-preset number of voice sub-data to obtain the post-tendency value.

In some embodiments of the present invention, a weighted average is calculated based on emotion tendency values of a second emotion tendency vector corresponding to a pre-preset number of speech sub-data, to obtain the pre-tendency value; and calculating a weighted average value based on emotion tendency values of second emotion tendency vectors corresponding to the post-preset number of voice sub-data, and determining a weight parameter based on the distance between each voice sub-data and the determined voice sub-data in the step of obtaining the post-tendency value.

In some embodiments of the present invention, in the step of determining the emotional tendency corresponding to the voice data based on the determination image, the determination image is input to a pre-trained third model, and the emotional tendency corresponding to the voice data is obtained.

The second aspect of the present invention also provides a multimodal fine grain trend analysis system based on a large model, the system comprising a computer device comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the system implementing the steps of the method as hereinbefore described when the computer instructions are executed by the processor.

The third aspect of the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps implemented by the foregoing large model-based multi-modal fine-grained trend analysis method.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention.

FIG. 1 is a schematic diagram of one embodiment of a multi-modal fine-grained trend analysis method based on a large model according to the invention;

FIG. 2 is a schematic diagram of an embodiment of a multi-modal fine-grained trend analysis method based on a large model according to the invention;

FIG. 3 is a schematic diagram of another embodiment of the multi-modal fine-grained trend analysis method based on a large model according to the invention.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

In a specific implementation process, the scheme is used for comprehensively analyzing the influence of public speaking on the market, and the scheme is particularly focused on analyzing non-speech emotion information in the speaking through voice data, which is ignored by a traditional rule or dictionary-based method.

In practice, some public utterances in the current market have a significant impact on the market. Traditional analysis methods, such as dictionary-based emotion analysis, while providing some assistance, have limitations in capturing subtle, implicit information. Furthermore, existing studies tend to focus only on reporting of classification accuracy. Thus, the present approach is a finer and comprehensive analysis framework to make up for the deficiencies of existing methods and to provide a deeper market response analysis.

The method specifically comprises the following steps:

as shown in fig. 1 and 2, the present invention proposes a multi-modal fine-grained tendency analysis method based on a large model, the steps of the method comprising:

step S100, voice data is obtained, the voice data is divided into a plurality of voice sub-data, and each voice sub-data is encoded into a voice sub-vector based on a preset encoder;

in the implementation process, the voice data may be audio data of a voice speech, and the voice sub-data is audio data of dividing the voice data by sentence units.

Step S200, text sub-data corresponding to each voice sub-data is obtained, and each text sub-data is encoded into a text sub-vector based on a preset encoder;

in some embodiments of the present invention, the text sub-data is text data corresponding to voice sub-data.

Step S300, inputting the voice sub-vector into a pre-trained first model, and outputting a first emotion tendency vector by the first model;

step S400, inputting the text sub-vector into a pre-trained second model, and outputting a second emotion tendency vector by the second model;

in some embodiments of the invention, the first and second models may be a wav2vec2.0 model comprising multiple convolution layers.

In a specific implementation process, the first emotion tendencies vector and the second emotion tendencies vector each include a pigeon pie emotion value, an hawk pie emotion value and a neutral emotion value.

Step S500, the first emotion tendency vector and the second emotion tendency vector comprise pigeon pie emotion values and hawk pie emotion values, and the emotion tendency of the voice sub-data and the text sub-data corresponding to each other is determined based on the first emotion tendency vector and the second emotion tendency vector.

In some embodiments of the present invention, the emotional tendency of the voice sub-data and the text sub-data corresponding to each other may be determined by calculating average values of the first emotional tendency vector and the second emotional tendency vector, respectively, each including the pigeon pie emotion value and the eagle pie emotion value, and comparing the magnitudes of the two average values.

In a specific implementation process, mel conversion is a process of converting an audio signal from a frequency domain to a mel frequency domain, where the mel frequency domain is a domain for describing the perception of sound frequency by human ears, and is matched with the perception characteristics of human ears. In the mel frequency domain, the frequency of the audio signal is converted into a mel scale, which corresponds to the degree of perception of different frequencies by the human ear.

In a specific implementation process, in the step of constructing the mel spectrum into a mel image, the mel spectrum is constructed into a spectrogram, that is, the mel image.

In a specific implementation, each of the text sub-data is encoded into a text sub-vector using word embedding (word embedding) technology, which is a technique that represents each word in a vocabulary as a real number vector, which is obtained by training a neural network to learn word senses and context relationships.

In some embodiments of the invention, the pigeon-pie emotion value and the hawk-pie emotion value are values of two dimensions in the first emotion tendencies vector and the second emotion tendencies vector.

By adopting the above scheme, if the emotion tendency value is closer to 1, the tendency of the pigeon pie is closer, otherwise, the tendency of the pigeon pie is closer to the hawk pie, and by adopting the above formula, the actual viewpoint tendency of each sentence in the voice data can be distinguished, and in practical application, if in the public speech of some scenes, actions corresponding to the lecturer tendency pigeon pie and hawk pie can be used for assisting judgment by staff, specifically, a regression analysis mode can be further adopted to analyze the influence of the emotion tendency value on the market.

As shown in fig. 3, in some embodiments of the invention, the steps of the method further include:

step S600, a first emotion tendency vector and a second emotion tendency vector corresponding to each voice sub-data in the voice data are obtained, emotion tendency values corresponding to the first emotion tendency vector and the second emotion tendency vector are obtained, and a combined tendency value is calculated based on the two emotion tendency values;

in a specific implementation process, the combined tendency value may adopt a mode of calculating an average value of the emotion tendency values corresponding to the first emotion tendency vector and the second emotion tendency vector.

Step S700, second emotion tendency vectors corresponding to the front preset number and the rear preset number of voice sub-data of each voice sub-data are obtained, a front tendency value is calculated based on the second emotion tendency vectors corresponding to the front preset number of voice sub-data, and a rear tendency value is calculated based on the second emotion tendency vectors corresponding to the rear preset number of voice sub-data;

in the specific implementation process, the pre-preset number of voice sub-data is the preset number of voice sub-data before the time point corresponding to the voice sub-data currently determined in the voice data, and the post-preset number of voice sub-data is the preset number of voice sub-data after the time point corresponding to the voice sub-data currently determined in the voice data.

In the specific implementation process, in the step of calculating the pre-trend value based on the second emotion trend vectors corresponding to the pre-preset number of voice sub-data, the pre-trend value is calculated based on the emotion trend values corresponding to the second emotion trend vectors of the pre-preset number of voice sub-data; in the step of calculating the post-trend value based on the second emotion trend vector corresponding to the post-preset number of voice sub-data, the post-trend value is calculated based on the emotion trend value corresponding to the second emotion trend vector of the post-preset number of voice sub-data.

Step S800, mapping the merging tendency value, the front tendency value and the rear tendency value into pixel change values respectively based on a preset mapping comparison table;

in a specific implementation process, the mapping comparison table stores mapping relations of the merging tendency value, the pre-tendency value, the post-tendency value and the pixel change value.

Step S900, obtaining a preset template image, wherein an image area is arranged in the template image corresponding to each voice sub-data, modifying the pixel value of the image area corresponding to the voice sub-data based on the merging trend value, the front trend value and the pixel change value corresponding to the rear trend value of each voice sub-data, modifying the template image into a judging image, and determining the emotion trend corresponding to the voice data based on the judging image.

In the implementation process, the template image is a preset image, and the pixel values of all the pixel points in the template image are the same pixel values.

In a specific implementation process, an image area is set in the template image corresponding to each voice sub-data, the image area corresponding to each voice sub-data at least comprises three sub-areas, the three sub-areas respectively correspond to a front trend value, a merging trend value and a rear trend value, and each sub-area at least comprises one pixel point.

In the implementation process, the positions of the image areas corresponding to the voice sub-data in the template image are sequentially set according to the time positions of the voice sub-data in the voice data, so that the time positions of the voice sub-data in the voice data are displayed in the template image, and the comprehensiveness of the data in the final judging image is ensured.

By adopting the scheme, in the step of judging the emotion tendency of the whole voice data, as the emotion tends to be advanced or lagged relative to the text in the process of expression of a person, the scheme considers the situation, and when the image area corresponding to each voice sub-data is processed, the pre-preset number and the post-preset number of the voice sub-data are taken into consideration, and the lag or the advanced condition of the voice relative to the text is considered, so that the data in the final judgment image is more comprehensive and accurate; further, the template image of the scheme is provided with an image area corresponding to each voice sub-data, so that the whole voice data can be represented in the judging image, and finally the voice data can be judged in an integral way.

In a specific implementation process, in the step of calculating the weighted average value based on the emotion tendency values of the second emotion tendency vectors corresponding to the pre-preset number of voice sub-data, the weighted average value of the emotion tendency values of the second emotion tendency vectors corresponding to the pre-preset number of voice sub-data is calculated by adopting the following formula:

wherein,a weighted average of emotion tendency values representing second emotion tendency vectors corresponding to a pre-preset number of speech sub-data,/->Emotion tendency value corresponding to second emotion tendency vector of speech sub-data representing the previous time position of currently determined speech sub-data in speech data, +.>Emotion tendency value corresponding to second emotion tendency vector of voice sub-data representing first two time positions of currently determined voice sub-data in voice data, +.>Representing the front +.>Emotional tendency value corresponding to the second emotional tendency vector of the voice sub-data of the time position,/->Weight parameter corresponding to the speech sub-data representing the previous time position of the currently determined speech sub-data in the speech data,/for the speech sub-data>Weight parameter corresponding to the voice sub-data representing the first two time positions of the currently determined voice sub-data in the voice data,/for the voice sub-data>Representing the front +.>And the weight parameters corresponding to the voice sub-data of each time position.

In a specific implementation process, in the step of calculating the weighted average value based on the emotion tendency values of the second emotion tendency vectors corresponding to the post-preset number of voice sub-data, the weighted average value of the emotion tendency values of the second emotion tendency vectors corresponding to the post-preset number of voice sub-data is calculated by adopting the following formula:

wherein,a weighted average of emotion tendency values representing second emotion tendency vectors corresponding to a later preset number of speech sub-data,/->Emotion tendency value corresponding to second emotion tendency vector of voice sub-data representing next time position of currently determined voice sub-data in voice data, +.>Emotion tendency value corresponding to second emotion tendency vector of voice sub-data representing the next two time positions of currently determined voice sub-data in voice data, +.>Representing the post +.>Emotional tendency value corresponding to the second emotional tendency vector of the voice sub-data of the time position,/->Weight parameter corresponding to the speech sub-data representing the next time position of the currently determined speech sub-data in the speech data,/for the speech sub-data>Weight parameter corresponding to the speech sub-data representing the last two time positions of the currently determined speech sub-data in the speech data,/for the speech sub-data>Representing the post +.>And the weight parameters corresponding to the voice sub-data of each time position.

In the implementation process, the more recent the voice sub-data is, the larger the weight parameter corresponding to the voice sub-data is,，/>。

By adopting the scheme, in the process of calculating the pre-trend value and the post-trend value, the position relation between each voice sub-data and the currently determined voice sub-data is considered, and the probability that the voice sub-data with the closer position relation to the currently determined voice sub-data has the higher correlation relation is increased, so that the sum of the scheme considers the position relation of the voice sub-data and is reflected in the construction process of the determination image.

In a specific implementation process, the third model is a pre-trained graphic neural network model, and emotional tendency is output through a classification layer, specifically including hawk, pigeon pie and neutrality.

The beneficial effect of this scheme includes:

1. the scheme improves the accuracy and the fineness of emotion analysis, and can more accurately capture and quantify the fine emotion and the position in communication by adopting a strong pre-training model, so that the limitation of the traditional dictionary-based method is exceeded;

2. the scheme not only analyzes text materials, but also combines acoustic materials to perform multi-mode analysis, and performs fine-granularity analysis on a sentence level to capture more information, unlike coarse-granularity analysis on a dialogue level only.

The embodiment of the invention also provides a multi-mode fine granularity tendency analysis system based on a large model, which comprises computer equipment, wherein the computer equipment comprises a processor and a memory, the memory is stored with computer instructions, the processor is used for executing the computer instructions stored in the memory, and the system realizes the steps realized by the method when the computer instructions are executed by the processor.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, is configured to implement the steps implemented by the foregoing large-model-based multi-modal fine-grained trend analysis method. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-modal fine-grained trend analysis method based on a large model, characterized in that the steps of the method comprise:

2. The large model-based multi-modal fine-grained tendency analysis method according to claim 1, wherein in the step of encoding each of the speech sub-data into a speech sub-vector based on a preset encoder, the speech sub-data is mel-converted to obtain a mel spectrum, the mel spectrum is constructed as a mel image, and the mel image is encoded into the speech sub-vector by the encoder.

3. The large model based multi-modal fine grain trend analysis method of claim 1, wherein in the step of encoding each of the text sub-data into text sub-vectors based on a preset encoder, each of the text sub-data is encoded into text sub-vectors using a preset text encoder.

4. The large-model-based multi-modal fine-grained tendency analysis method according to claim 1, wherein in the step of determining the emotional tendency of the mutually corresponding speech sub-data and text sub-data based on the first and second emotional tendency vectors, the emotional tendency values are calculated based on the pigeon-pie-emotion value and the hawk-pie-emotion value in the first and second emotional tendency vectors, respectively, and the emotional tendency is determined based on the emotional tendency values obtained by the first and second emotional tendency vectors.

5. The large model-based multi-modal fine-grained tendency analysis method according to claim 4, wherein in the step of calculating the emotion tendency values based on the pigeon pie emotion values and the hawk pie emotion values in the first emotion tendency vector and the second emotion tendency vector, respectively, the emotion tendency values are calculated using the following formula:

6. The large model based multi-modal fine grain trend analysis method of claim 4, wherein the steps of the method further comprise:

7. The method of claim 6, wherein in the step of calculating the pre-trend value based on the second emotion trend vectors corresponding to the pre-preset number of voice sub-data, calculating the post-trend value based on the second emotion trend vectors corresponding to the post-preset number of voice sub-data,

8. The large-model-based multi-modal fine-grained tendency analysis method according to claim 7, wherein the pre-trend value is obtained by calculating a weighted average of emotion trend values of the second emotion trend vectors corresponding to the pre-preset number of speech sub-data; and calculating a weighted average value based on emotion tendency values of second emotion tendency vectors corresponding to the post-preset number of voice sub-data, and determining a weight parameter based on the distance between each voice sub-data and the determined voice sub-data in the step of obtaining the post-tendency value.

9. The large model-based multi-modal fine-grained tendency analysis method according to claim 7 or 8, characterized in that in the step of determining the emotional tendency corresponding to the voice data based on the judgment image, the judgment image is input to a pre-trained third model, and the emotional tendency corresponding to the voice data is obtained.

10. A multimodal fine grain trend analysis system based on a large model, characterized in that the system comprises a computer device comprising a processor and a memory, said memory having stored therein computer instructions, said processor being adapted to execute the computer instructions stored in said memory, the system realizing the steps of the method according to any of claims 1-9 when said computer instructions are executed by the processor.