CN113808577A - Intelligent extraction method and device of voice abstract, electronic equipment and storage medium - Google Patents

Intelligent extraction method and device of voice abstract, electronic equipment and storage medium Download PDF

Info

Publication number
CN113808577A
CN113808577A CN202111098139.0A CN202111098139A CN113808577A CN 113808577 A CN113808577 A CN 113808577A CN 202111098139 A CN202111098139 A CN 202111098139A CN 113808577 A CN113808577 A CN 113808577A
Authority
CN
China
Prior art keywords
voice
speech
text
preset
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111098139.0A
Other languages
Chinese (zh)
Inventor
陈杭
史文鑫
李骁
黄荣丽
王泽世
赖众程
张茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111098139.0A priority Critical patent/CN113808577A/en
Publication of CN113808577A publication Critical patent/CN113808577A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention relates to the field of artificial intelligence, and discloses an intelligent extraction method, an intelligent extraction device, electronic equipment and a storage medium for a voice abstract, wherein the method comprises the following steps: acquiring user voice, performing signal extraction on the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal; performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text; recognizing emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text; selecting a second key sentence which accords with a preset service rule from the voice text; and combining the first key sentence and the second key sentence to be used as a key abstract sentence of the user voice. In addition, the invention also relates to a block chain technology, and the emotional features can be stored in the block chain. The invention can improve the accuracy of extracting the voice abstract.

Description

Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to an intelligent extraction method and device of a voice abstract, electronic equipment and a computer readable storage medium.
Background
The extraction of the voice abstract refers to a process of automatically extracting user requirement information from a section of voice text, and can be applied to the fields of customer service, finance, securities and the like, for example, in the field of agent customer service, the extraction of the voice abstract can extract key user requirements from user voice to help customer service personnel to quickly locate the customer requirements.
At present, the traditional speech abstraction extraction is usually realized by training an automatic abstraction extraction model by a natural language processing technology (NLP), but in an actual business scene, because a user has certain emotional characteristics, if the automatic abstraction model is trained only by the natural language processing technology, the information characteristics such as emotion of the user cannot be accurately identified, so that some important information in the speech of the user can be omitted, and the accuracy of extracting the speech abstraction is low.
Disclosure of Invention
The invention provides an intelligent extraction method and device of a voice abstract, electronic equipment and a computer readable storage medium, and mainly aims to improve the extraction accuracy of the voice abstract.
In order to achieve the above object, the present invention provides an intelligent method for extracting a speech abstract, comprising:
acquiring user voice, performing signal extraction on the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal;
performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
recognizing emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text;
selecting a second key sentence which accords with a preset service rule from the voice text;
and combining the first key sentence and the second key sentence to be used as the key abstract sentence of the user voice.
Optionally, the extracting the user speech to obtain a speech signal includes:
carrying out audio segmentation on the user voice to obtain a plurality of segmented audios;
detecting voice energy information of each segmented audio, and screening the segmented audio meeting preset conditions from the multiple segmented audios according to each voice energy information;
and performing signal enhancement on the screened segmented audio to obtain a voice signal.
Optionally, the detecting voice energy information of each of the segmented audios includes:
measuring the voice energy information of each segmented audio by using the following formula:
Figure BDA0003269787890000021
wherein E isnRepresenting speech energy information, n representingThe time at which the segmented audio is located, m represents the sequence number of the segmented audio, x (m) represents the short-time average energy of the segmented audio, and w (m) represents the window function of the segmented audio.
Optionally, the extracting the spectral feature of the speech signal includes:
performing frequency domain conversion on the voice signal to obtain a frequency domain signal;
and carrying out Mel spectrum filtering on the frequency domain signal, and carrying out cepstrum analysis on the frequency domain signal subjected to the Mel spectrum filtering to obtain the frequency spectrum characteristics of the voice signal.
Optionally, the performing text conversion on the spectrum feature by using a preset speech recognition model to obtain a speech text includes:
calculating the phoneme sequence probability of the spectrum characteristics by using an acoustic network in the preset speech recognition model;
and recognizing the character sequence of the frequency spectrum characteristic by using a language network in the preset voice recognition model according to the phoneme sequence probability, and generating a voice text according to the character sequence.
Optionally, before recognizing the emotion feature of the speech text by using the preset emotion recognition model, the method further includes:
acquiring character vectors and corresponding labels in a training voice text, and calculating the state values of the character vectors in the training voice text by using an input gate in a pre-constructed emotion recognition model;
calculating an activation value of a character vector in the training voice text by using a forgetting gate in the pre-constructed emotion recognition model;
calculating a state updating value of a character vector in the training voice text according to the state value and the activation value, and calculating a state sequence of the state updating value by using an output gate in the pre-constructed emotion recognition model;
calculating the emotion category probability of the state sequence by utilizing a classification layer in the pre-constructed emotion recognition model to obtain the predicted emotion characteristics of the character vector;
calculating a loss value of the predicted emotional characteristic and the label;
if the loss value is larger than a preset threshold value, adjusting the parameters of the pre-constructed emotion recognition model, returning to the step of calculating the state value of the character vector by using an input gate in the pre-constructed emotion recognition model,
and if the loss value is not greater than the preset threshold value, obtaining the preset emotion recognition model.
Optionally, the loss function comprises:
Figure BDA0003269787890000031
wherein Loss represents a Loss function value, k is the number of training speech texts collected in advance, i represents a serial number of the label and the training value, y represents the number of the label and the training valueiRepresents a label, yi' denotes a training value.
In order to solve the above problem, the present invention further provides an intelligent speech abstraction extraction device, which includes:
the characteristic extraction module is used for acquiring user voice, extracting signals of the user voice to obtain voice signals and extracting frequency spectrum characteristics of the voice signals;
the voice recognition module is used for performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
the key sentence recognition module is used for recognizing the emotion characteristics of the voice text by using a preset emotion recognition model and extracting a first key sentence of the emotion characteristics from the voice text;
the key sentence recognition module is also used for selecting a second key sentence which accords with a preset service rule from the voice text;
and the abstract statement generating module is used for merging the first key statement and the second key statement to be used as the key abstract statement of the user voice.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to implement the above-mentioned intelligent speech summarization method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the above-mentioned intelligent extraction method for a speech abstract.
It can be seen that, in the embodiment of the present invention, firstly, signal and spectrum feature extraction is performed on user speech, so that speech information except for sound emitted by a user can be filtered, a spectrum feature vector of the user speech is obtained, accuracy of subsequent speech processing is ensured, and text conversion is performed on the spectrum feature in combination with a preset speech recognition model to obtain a speech text of the user speech, thereby realizing a precondition of extracting a subsequent speech abstract; secondly, the emotion characteristics of the voice text are recognized by using a preset emotion recognition model, the first key sentences of the emotion characteristics are extracted from the voice text, information omission of the subsequent voice text in the abstract extraction process can be avoided, the abstract extraction accuracy of the voice of the user is improved, and the second key sentences which accord with preset business rules are selected from the voice text, so that the comprehensiveness of the abstract extraction of the voice text is guaranteed, and the key abstract extraction accuracy of the voice of the user is further guaranteed; furthermore, the first key statement and the second key statement are combined to serve as the key abstract statement of the user voice, so that the user appeal of the user voice can be accurately and comprehensively positioned, and the accurate extraction of the user voice abstract can be guaranteed.
Drawings
Fig. 1 is a schematic flow chart of a method for intelligently extracting a speech abstract according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for intelligently extracting a speech abstract according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device implementing an intelligent speech summarization method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides an intelligent voice abstract extracting method. The execution subject of the intelligent extraction method of the voice abstract includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like. In other words, the intelligent extraction method of the speech abstract may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of an intelligent speech summarization method according to an embodiment of the present invention. In the embodiment of the invention, the intelligent extraction method of the voice abstract comprises the following steps:
s1, obtaining user voice, extracting the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal.
In the embodiment of the invention, the user voice is generated based on different service scenes, for example, in an agent service scene, the user voice can be complaint voice, consultation voice and the like, and it should be understood that the user voice contains a lot of background voices, such as external noisy sound.
As an embodiment of the present invention, the extracting the user speech to obtain a speech signal includes: the method comprises the steps of carrying out audio segmentation on user voice to obtain a plurality of segmented audios, detecting voice energy information of each segmented audio, screening the segmented audio meeting preset conditions from the segmented audios according to each voice energy information, and carrying out signal enhancement on the screened segmented audio to obtain a voice signal.
The audio segmentation is to perform signal framing on the user speech, and usually takes 10-30ms as one frame, so as to divide the audio with indefinite length in the user speech into small segments with fixed length, thereby avoiding the difficulty of speech processing caused by different audio lengths.
In an alternative embodiment, the speech energy information of each of the segmented audio is measured using the following formula:
Figure BDA0003269787890000051
wherein E isnRepresenting speech energy information, n representing the time instant at which the segmented audio is located, m representing the sequence number of the segmented audio, x (m) representing the short-time average energy of the segmented audio, and w (m) representing the window function of the segmented audio.
In an optional embodiment, the preset condition may be that the energy value of the speech energy information is greater than 0, that is, the energy value of the speech energy information is 0, the framed audio of the speech energy information is deleted, and if the energy value of the speech energy information is greater than 0, the framed audio of the speech energy information is screened out.
In an alternative embodiment, the filtered segmented audio is signal enhanced using the following formula:
S(n)=S(n-1)+q*rand()
wherein S (n) represents a speech signal, S (n-1) represents the filtered segmented audio, q represents a signal enhancement coefficient, and rand represents a signal enhancement range.
Furthermore, the invention extracts the frequency spectrum feature of the voice signal to obtain the frequency spectrum feature vector of the voice signal, thereby realizing the premise of subsequent voice conversion. As an embodiment of the present invention, the extracting a spectral feature of the speech signal includes: and performing frequency domain conversion on the voice signal to obtain a frequency domain signal, performing Mel spectrum filtering on the frequency domain signal, and performing cepstrum analysis on the frequency domain signal subjected to the Mel spectrum filtering to obtain the frequency spectrum characteristics of the voice signal.
The frequency domain conversion is to convert a time domain speech signal into a frequency domain signal, the mel spectrum filtering is used for shielding a sound signal which does not conform to a preset frequency range in the frequency domain signal so as to obtain a spectrogram which conforms to the hearing habit of human ears, and the cepstrum analysis is to perform secondary spectrum analysis on the frequency domain signal after the mel spectrum filtering so as to extract contour information of the frequency domain signal and obtain characteristic data of the frequency domain signal.
In an alternative embodiment, the speech signal is frequency domain converted using the following equation:
Figure BDA0003269787890000061
wherein, F (omega) represents frequency domain signal, F (t) represents voice signal, e represents wireless non-cyclic decimal.
In an alternative embodiment, the Mel-spectrum filtering of the frequency domain signal is performed by a Mel-filter, and the predetermined frequency range is 200HZ to 500 HZ.
In an alternative embodiment, the mel-spectrum filtered frequency domain signal is cepstrum analyzed using the following formula:
Figure BDA0003269787890000062
wherein f ismelRepresenting the spectral characteristics of the speech signal, lg representing a logarithmic function, and f representing a mel-spectrum filtered frequency-domain signal.
And S2, performing text conversion on the frequency spectrum characteristics by using a preset speech recognition model to obtain a speech text.
The invention implements the text conversion of the frequency spectrum characteristics through a preset voice recognition model to acquire the text data of the user voice and realize the premise of extracting the subsequent voice abstract. Wherein the acoustic network may be constructed by a hidden Markov algorithm and the language model may be constructed by an N-Gram algorithm.
As an embodiment of the present invention, the performing text conversion on the spectrum feature by using a preset speech recognition model to obtain a speech text includes: calculating the phoneme sequence probability of the frequency spectrum feature by using the acoustic network in the preset voice recognition model, recognizing the character sequence of the frequency spectrum feature by using the language network in the preset voice recognition model according to the phoneme sequence probability, and generating a voice text according to the character sequence.
Wherein, the phoneme sequence probability refers to the syllable probability of the generated text, and if the text is "safe", the syllables thereof include: and calculating the probability of the phoneme sequence of the frequency spectrum feature to clarify the syllable of the character "peace" which can be generated subsequently, thereby obtaining the character sequence of the frequency spectrum feature, wherein the character sequence refers to the information relation of the character generated by the phoneme sequence and is used for generating the audio recognition result of the frequency spectrum feature.
It should be noted that, in the embodiment of the present invention, the preset speech recognition model refers to a model trained in advance, and has stronger robustness and speech recognition capability.
S3, recognizing the emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text.
In the embodiment of the present invention, the preset emotion recognition model may be constructed by a Long Short Term Memory network (LSTM) for solving the problem of Long Term dependence on a recurrent neural network, and in the present invention, the LSTM network is used for emotion characteristics of a speech text, and it should be noted that, in the embodiment of the present invention, the preset emotion recognition model is obtained by collecting a large number of speech texts and corresponding labels in advance for training, and the labels are used for representing real emotion characteristics of the collected speech texts, such as emotion characteristics of peace, excitement, happiness, sadness, and the like.
Furthermore, in the embodiment of the invention, the emotion characteristics of the voice text are identified through the preset emotion recognition model, so that information omission of the subsequent voice text in the abstract extraction process is avoided, and the abstract extraction accuracy of the voice text is improved. The preset emotion recognition model comprises an input door, a forgetting door, an output door and a classification layer, wherein the input door is used for receiving and storing input voice text information, the forgetting door is used for recording and updating the voice text information, the output door is used for calculating state information of the voice text, and the classification layer is used for calculating emotion category probability of each word in the voice text so as to output emotion characteristics of the voice text.
Further, before the recognizing the emotion characteristics of the voice text by using the preset emotion recognition model, the embodiment of the present invention further includes: acquiring character vectors and corresponding labels in a training voice text, and calculating the state values of the character vectors in the training voice text by using an input gate in a pre-constructed emotion recognition model; calculating an activation value of a character vector in the training voice text by using a forgetting gate in the pre-constructed emotion recognition model; calculating a state updating value of a character vector in the training voice text according to the state value and the activation value, and calculating a state sequence of the state updating value by using an output gate in the pre-constructed emotion recognition model; calculating the emotion category probability of the state sequence by utilizing a classification layer in the pre-constructed emotion recognition model to obtain the predicted emotion characteristics of the character vector; calculating a loss value of the predicted emotional characteristic and the label; and if the loss value is not greater than the preset threshold value, obtaining the preset emotion recognition model.
In an alternative embodiment, the state values of the character vectors are calculated using the following formula:
Figure BDA0003269787890000071
wherein itThe value of the state is represented by,
Figure BDA0003269787890000072
indicates the offset of the cell unit in the input gate, wiDenotes the activation factor of the input gate, ht-1Representing the peak, x, of the character vector at time t-1 of the input gatetRepresenting the character vector at time t, biRepresenting the weight of the cell units in the input gate.
In an alternative embodiment, the activation value of the character vector is calculated using the following formula:
Figure BDA0003269787890000073
wherein f istThe value of the activation is represented by,
Figure BDA0003269787890000081
indicating the bias of the cell unit in the forgetting gate, wfAn activation factor that indicates that the door was forgotten,
Figure BDA0003269787890000082
represents the peak value, x, of the character vector at the moment of the forgetting gate t-1tRepresenting the character vector entered at time t, bfRepresenting the weight of the cell unit in the forgetting gate.
In an alternative embodiment, the state update value for the character vector is calculated using the following formula:
Figure BDA0003269787890000083
wherein, ctRepresents the state update value, ht-1Representing the peak of the character vector at time t-1 of the input gate,
Figure BDA0003269787890000084
representing the peak of the character vector at the moment of forgetting gate t-1.
In an alternative embodiment, the state sequence of state update values is calculated using the following formula:
ot=tanh(ct)
wherein o istRepresenting a sequence of states, tanh representing an activation function of the output gate, ctRepresenting the state update value.
In an alternative embodiment, the calculation of the emotion classification probability for the sequence of states may be performed by an activation function in the classification layer, such as a softmax function.
In an alternative embodiment, the loss value of the predicted emotional characteristic and the tag is calculated using the following formula:
LC=mglogmp+(1-mg)log(1-mp)
wherein LC represents a loss value, mgRepresenting predicted emotional characteristics, mpA label is represented. Optionally, the preset threshold is 0.1, and may also be according to the realityAnd setting a service scene.
Further, to ensure privacy and security of the emotional features, the emotional features may also be stored in a blockchain node.
Further, the embodiment of the invention extracts the first key sentence of the emotion feature from the voice text, that is, extracts the position sentence of the emotion feature in the voice text, so as to ensure the accuracy of extracting the key abstract of the voice of the subsequent user. For example, if the emotion feature in the voice text is recognized to include excitation and excitement through the preset emotion recognition model, the characteristic word of the emotion feature in the voice text is searched, and the sentence corresponding to the characteristic word is extracted as the first key sentence.
And S4, selecting a second key sentence which accords with a preset business rule from the voice text.
In the embodiment of the present invention, the preset service rule is set based on different service scenarios, for example, in the seat service scenario, the service rule may be set by combining one or more of the following ways, i.e., selecting a sentence with keywords (looking at | asking | to know | consulting | complaint | reflecting) and the like; a second mode is that a statement with a word expression form (such | is such) is selected; way three, pick a statement that exists (is XXX problem | is XXX fact | before i | about | why | but) and so on.
It should be noted that, if a key sentence is not extracted from the voice text according to the preset service rule, the present invention implements the first three rounds of dialogs in the voice text as a second key sentence. Based on the screening of the key sentences, the key appeal positions of the user voices in the voice texts can be located, so that the accuracy of extracting key abstracts of the follow-up user voices can be guaranteed.
And S5, combining the first key sentence and the second key sentence to be used as the key abstract sentence of the user voice.
According to the embodiment of the invention, the first key sentence and the second key sentence are combined to be used as the key abstract sentence of the user voice, so that the user appeal of the user voice can be accurately and comprehensively positioned, and the accurate extraction of the user voice abstract can be ensured.
It can be seen that, in the embodiment of the present invention, firstly, signal and spectrum feature extraction is performed on user speech, so that speech information except for sound emitted by a user can be filtered, a spectrum feature vector of the user speech is obtained, accuracy of subsequent speech processing is ensured, and text conversion is performed on the spectrum feature in combination with a preset speech recognition model to obtain a speech text of the user speech, thereby realizing a precondition of extracting a subsequent speech abstract; secondly, the emotion characteristics of the voice text are recognized by using a preset emotion recognition model, the first key sentences of the emotion characteristics are extracted from the voice text, information omission of the subsequent voice text in the abstract extraction process can be avoided, the abstract extraction accuracy of the voice of the user is improved, and the second key sentences which accord with preset business rules are selected from the voice text, so that the comprehensiveness of the abstract extraction of the voice text is guaranteed, and the key abstract extraction accuracy of the voice of the user is further guaranteed; furthermore, the first key statement and the second key statement are combined to serve as the key abstract statement of the user voice, so that the user appeal of the user voice can be accurately and comprehensively positioned, and the accurate extraction of the user voice abstract can be guaranteed.
Fig. 2 is a functional block diagram of the intelligent speech abstraction apparatus according to the present invention.
The intelligent voice abstract extracting device 100 can be installed in electronic equipment. According to the realized functions, the intelligent extraction device for the voice abstract can comprise a feature extraction module 101, a voice recognition module 102, a key sentence recognition module 103 and an abstract sentence generation module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and is stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the feature extraction module 101 is configured to obtain a user voice, perform signal extraction on the user voice to obtain a voice signal, and extract a spectrum feature of the voice signal;
the speech recognition module 102 is configured to perform text conversion on the spectrum feature by using a preset speech recognition model to obtain a speech text;
the key sentence recognition module 103 is configured to recognize an emotion feature of the voice text by using a preset emotion recognition model, and extract a first key sentence of the emotion feature from the voice text;
the key sentence recognition module 103 is further configured to select a second key sentence which meets a preset service rule from the voice text;
the abstract statement generating module 104 is configured to combine the first key statement and the second key statement to obtain a key abstract statement of the user speech.
In detail, when the modules in the intelligent speech abstract extraction apparatus 100 according to the embodiment of the present invention are used, the same technical means as the above-described intelligent speech abstract extraction method shown in fig. 1 are adopted, and the same technical effects can be produced, which is not described herein again.
Fig. 3 is a schematic structural diagram of an electronic device 1 for implementing the intelligent speech summarization method according to the present invention.
The electronic device 1 may include a processor 10, a memory 11, a communication bus 12, and a communication interface 13, and may further include a computer program, such as an intelligent extraction program of a speech digest, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., an intelligent extraction program for performing a speech digest, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of an intelligent extraction program of a speech digest, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device 1 and other devices, and includes a network interface and an employee interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices 1. The employee interface may be a Display (Display), an input unit, such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visual staff interface.
Fig. 3 shows only the electronic device 1 with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The intelligent extraction program of the speech summary stored in the memory 11 of the electronic device 1 is a combination of a plurality of computer programs, which when executed in the processor 10, can realize:
acquiring user voice, performing signal extraction on the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal;
performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
recognizing emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text;
selecting a second key sentence which accords with a preset service rule from the voice text;
and combining the first key sentence and the second key sentence to be used as the key abstract sentence of the user voice.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-volatile computer-readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device 1, may implement:
acquiring user voice, performing signal extraction on the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal;
performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
recognizing emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text;
selecting a second key sentence which accords with a preset service rule from the voice text;
and combining the first key sentence and the second key sentence to be used as the key abstract sentence of the user voice.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An intelligent extraction method of a voice abstract is characterized by comprising the following steps:
acquiring user voice, performing signal extraction on the user voice to obtain a voice signal, and extracting the frequency spectrum characteristic of the voice signal;
performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
recognizing emotion characteristics of the voice text by using a preset emotion recognition model, and extracting a first key sentence of the emotion characteristics from the voice text;
selecting a second key sentence which accords with a preset service rule from the voice text;
and combining the first key sentence and the second key sentence to be used as the key abstract sentence of the user voice.
2. The intelligent method for extracting speech abstract according to claim 1, wherein said extracting the user speech signal to obtain the speech signal comprises:
carrying out audio segmentation on the user voice to obtain a plurality of segmented audios;
detecting voice energy information of each segmented audio, and screening the segmented audio meeting preset conditions from the multiple segmented audios according to each voice energy information;
and performing signal enhancement on the screened segmented audio to obtain a voice signal.
3. The intelligent extraction method of speech abstract of claim 2, wherein said detecting speech energy information of each of said segmented audios comprises:
measuring the voice energy information of each segmented audio by using the following formula:
Figure FDA0003269787880000011
wherein E isnRepresenting speech energy information, n representing the time instant at which the segmented audio is located, m representing the sequence number of the segmented audio, x (m) representing the short-time average energy of the segmented audio, and w (m) representing the window function of the segmented audio.
4. The intelligent method for extracting speech excerpt according to claim 1, wherein said extracting spectral features of the speech signal comprises:
performing frequency domain conversion on the voice signal to obtain a frequency domain signal;
and carrying out Mel spectrum filtering on the frequency domain signal, and carrying out cepstrum analysis on the frequency domain signal subjected to the Mel spectrum filtering to obtain the frequency spectrum characteristics of the voice signal.
5. The method for intelligently extracting a speech abstract according to any one of claims 1 to 4, wherein the performing text conversion on the spectral features by using a preset speech recognition model to obtain a speech text comprises:
calculating the phoneme sequence probability of the spectrum characteristics by using an acoustic network in the preset speech recognition model;
and recognizing the character sequence of the frequency spectrum characteristic by using a language network in the preset voice recognition model according to the phoneme sequence probability, and generating a voice text according to the character sequence.
6. The method for intelligently extracting a speech abstract according to claim 1, wherein before recognizing the emotion feature of the speech text by using the preset emotion recognition model, the method further comprises:
acquiring character vectors and corresponding labels in a training voice text, and calculating the state values of the character vectors in the training voice text by using an input gate in a pre-constructed emotion recognition model;
calculating an activation value of a character vector in the training voice text by using a forgetting gate in the pre-constructed emotion recognition model;
calculating a state updating value of a character vector in the training voice text according to the state value and the activation value, and calculating a state sequence of the state updating value by using an output gate in the pre-constructed emotion recognition model;
calculating the emotion category probability of the state sequence by utilizing a classification layer in the pre-constructed emotion recognition model to obtain the predicted emotion characteristics of the character vector;
calculating a loss value of the predicted emotional characteristic and the label;
if the loss value is larger than a preset threshold value, adjusting the parameters of the pre-constructed emotion recognition model, returning to the step of calculating the state value of the character vector by using an input gate in the pre-constructed emotion recognition model,
and if the loss value is not greater than the preset threshold value, obtaining the preset emotion recognition model.
7. The intelligent extraction method of speech excerpt of claim 6, wherein the loss function comprises:
Figure FDA0003269787880000021
wherein Loss represents a Loss function value, k is the number of training speech texts collected in advance, i represents a serial number of the label and the training value, y represents the number of the label and the training valueiDenotes tag, y'iRepresenting a training value.
8. An intelligent speech abstraction apparatus, comprising:
the characteristic extraction module is used for acquiring user voice, extracting signals of the user voice to obtain voice signals and extracting frequency spectrum characteristics of the voice signals;
the voice recognition module is used for performing text conversion on the frequency spectrum characteristics by using a preset voice recognition model to obtain a voice text;
the key sentence recognition module is used for recognizing the emotion characteristics of the voice text by using a preset emotion recognition model and extracting a first key sentence of the emotion characteristics from the voice text;
the key sentence recognition module is also used for selecting a second key sentence which accords with a preset service rule from the voice text;
and the abstract statement generating module is used for merging the first key statement and the second key statement to be used as the key abstract statement of the user voice.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the intelligent method of speech summarization of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for intelligent extraction of a speech summary according to any one of claims 1 to 7.
CN202111098139.0A 2021-09-18 2021-09-18 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium Pending CN113808577A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111098139.0A CN113808577A (en) 2021-09-18 2021-09-18 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111098139.0A CN113808577A (en) 2021-09-18 2021-09-18 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113808577A true CN113808577A (en) 2021-12-17

Family

ID=78896075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111098139.0A Pending CN113808577A (en) 2021-09-18 2021-09-18 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113808577A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334367A (en) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 Video summary information generation method, device, server and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018061839A1 (en) * 2016-09-29 2018-04-05 株式会社村田製作所 Transmission device, transmission method, and transmission program
WO2018121275A1 (en) * 2016-12-29 2018-07-05 北京奇虎科技有限公司 Method and apparatus for error connection of voice recognition in smart hardware device
CN109767791A (en) * 2019-03-21 2019-05-17 中国—东盟信息港股份有限公司 A kind of voice mood identification and application system conversed for call center
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method
CN112269875A (en) * 2020-10-23 2021-01-26 中国平安人寿保险股份有限公司 Text classification method and device, electronic equipment and storage medium
CN112466337A (en) * 2020-12-15 2021-03-09 平安科技(深圳)有限公司 Audio data emotion detection method and device, electronic equipment and storage medium
CN112651342A (en) * 2020-12-28 2021-04-13 中国平安人寿保险股份有限公司 Face recognition method and device, electronic equipment and storage medium
CN112732915A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Emotion classification method and device, electronic equipment and storage medium
CN113095076A (en) * 2021-04-20 2021-07-09 平安银行股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN113327586A (en) * 2021-06-01 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018061839A1 (en) * 2016-09-29 2018-04-05 株式会社村田製作所 Transmission device, transmission method, and transmission program
WO2018121275A1 (en) * 2016-12-29 2018-07-05 北京奇虎科技有限公司 Method and apparatus for error connection of voice recognition in smart hardware device
CN109767791A (en) * 2019-03-21 2019-05-17 中国—东盟信息港股份有限公司 A kind of voice mood identification and application system conversed for call center
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method
CN112269875A (en) * 2020-10-23 2021-01-26 中国平安人寿保险股份有限公司 Text classification method and device, electronic equipment and storage medium
CN112466337A (en) * 2020-12-15 2021-03-09 平安科技(深圳)有限公司 Audio data emotion detection method and device, electronic equipment and storage medium
CN112651342A (en) * 2020-12-28 2021-04-13 中国平安人寿保险股份有限公司 Face recognition method and device, electronic equipment and storage medium
CN112732915A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Emotion classification method and device, electronic equipment and storage medium
CN113095076A (en) * 2021-04-20 2021-07-09 平安银行股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN113327586A (en) * 2021-06-01 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334367A (en) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 Video summary information generation method, device, server and storage medium
CN115334367B (en) * 2022-07-11 2023-10-17 北京达佳互联信息技术有限公司 Method, device, server and storage medium for generating abstract information of video

Similar Documents

Publication Publication Date Title
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN112001175B (en) Flow automation method, device, electronic equipment and storage medium
CN111613212A (en) Speech recognition method, system, electronic device and storage medium
CN114007131B (en) Video monitoring method and device and related equipment
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
CN112863529A (en) Speaker voice conversion method based on counterstudy and related equipment
CN113327586A (en) Voice recognition method and device, electronic equipment and storage medium
CN113704410A (en) Emotion fluctuation detection method and device, electronic equipment and storage medium
CN113807103B (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN113205814B (en) Voice data labeling method and device, electronic equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN116705034A (en) Voiceprint feature extraction method, speaker recognition method, model training method and device
CN113903363B (en) Violation behavior detection method, device, equipment and medium based on artificial intelligence
CN115525750A (en) Robot phonetics detection visualization method and device, electronic equipment and storage medium
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
CN113221990B (en) Information input method and device and related equipment
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN114186028A (en) Consult complaint work order processing method, device, equipment and storage medium
CN114401346A (en) Response method, device, equipment and medium based on artificial intelligence
CN113808616A (en) Voice compliance detection method, device, equipment and storage medium
CN113870478A (en) Rapid number-taking method and device, electronic equipment and storage medium
CN113889145A (en) Voice verification method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination