CN114373448B - Topic detection method and device, electronic equipment and storage medium - Google Patents

Topic detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114373448B
CN114373448B CN202210279908.5A CN202210279908A CN114373448B CN 114373448 B CN114373448 B CN 114373448B CN 202210279908 A CN202210279908 A CN 202210279908A CN 114373448 B CN114373448 B CN 114373448B
Authority
CN
China
Prior art keywords
voice
text
topic
topic detection
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210279908.5A
Other languages
Chinese (zh)
Other versions
CN114373448A (en
Inventor
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wofeng Times Data Technology Co ltd
Original Assignee
Beijing Wofeng Times Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wofeng Times Data Technology Co ltd filed Critical Beijing Wofeng Times Data Technology Co ltd
Priority to CN202210279908.5A priority Critical patent/CN114373448B/en
Publication of CN114373448A publication Critical patent/CN114373448A/en
Application granted granted Critical
Publication of CN114373448B publication Critical patent/CN114373448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention provides a topic detection method, a topic detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target voice block based on voice data to be detected; and inputting the target voice block to the topic detection model to obtain a detection result output by the topic detection model. The topic detection method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the target voice block is generated based on the voice data to be detected, the target voice block is input into the topic detection model for optimization processing, the target voice block is subjected to automatic voice recognition to obtain the corresponding voice discrete representation and text content, and the topic detection is carried out by fusing the voice discrete representation after the text content is subjected to natural language processing. The method and the device can perform automatic voice recognition to obtain text content, perform natural language processing, and simultaneously supplement voice discrete representation information, thereby avoiding voice information loss and improving the recognition accuracy of topic detection to a certain extent.

Description

Topic detection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a topic detection method and apparatus, an electronic device, and a storage medium.
Background
With the rapid development of artificial intelligence technology, the permeability of artificial intelligence products in life is higher and higher, voice interaction is widely applied in scenes such as intelligent homes, intelligent automobiles, intelligent customer service, telephone/video conferences and the like, and the contradiction between information surplus and knowledge shortage is more and more prominent naturally.
Most of traditional topic detection is processed based on a pure text, and most of the traditional topic detection is based on key words, word elements and other processing, which are insufficient for the semantics and distribution of real voice communication. Therefore, in the prior art, topic detection performed on Speech is performed by firstly transcribing a text sequence by an independent Automatic Speech Recognition system (ASR), and then receiving and Processing the text sequence of the ASR system by an independent Natural Language Processing system (NLP), which shows that the performance of the ASR system seriously affects the final detection result of the NLP system, and particularly under the condition that the accent of a user is heavy and the environmental noise is large, the accuracy of topic detection and Recognition on Speech is not high.
Disclosure of Invention
The invention provides a topic detection method, a topic detection device, electronic equipment and a storage medium, which are used for solving the defect that topic detection information of voice data is too much lost in the prior art, realizing that text content is obtained by automatic voice recognition, and voice content with discrete characteristics is supplemented for topic detection while natural language processing is carried out, and improving the recognition accuracy of the topic detection of the voice data.
The invention provides a topic detection method, which comprises the following steps:
acquiring a target voice block based on voice data to be detected;
inputting the target voice block to a topic detection model, and obtaining a detection result output by the topic detection model;
the topic detection model is obtained by training based on sample text data, a labeled topic corresponding to the sample text data, sample voice data and a labeled topic corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus;
the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and carrying out topic detection to obtain the detection result.
According to the topic detection method provided by the invention, the topic detection model comprises a voice recognition layer, a subject word recognition extraction layer and a topic detection layer;
the voice recognition layer is used for carrying out voice recognition on the target voice block; the subject term identification and extraction layer is used for extracting subject terms from the identified text content; and the topic detection layer is used for carrying out topic detection after the output contents of the voice recognition layer and the subject term recognition extraction layer are fused.
According to a topic detection method provided by the present invention, the method for inputting the target speech block to a topic detection model and obtaining a detection result output by the topic detection model comprises:
inputting the target voice block into the voice recognition layer to obtain voice representation and text representation;
inputting the text representation into the subject word recognition and extraction layer to obtain a subject word set;
and inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result.
According to a topic detection method provided by the present invention, the inputting the target speech block into the speech recognition layer to obtain a speech representation and a text representation includes:
respectively utilizing a first language model and a second language model, and combining an acoustic model to carry out quantization operation on the target voice block to obtain a first voice representation and a second voice representation;
respectively carrying out text recognition conversion on the target voice block by respectively utilizing the first language model and the second language model and combining an acoustic model to obtain a corresponding first text representation and a corresponding second text representation;
the first language model is obtained by fusing a language model trained on the field corpus and a language model trained on the general corpus, and the second language model is obtained by fusing a language model trained on the syllables of the field corpus and a language model trained on the syllables of the general corpus.
According to the topic detection method provided by the invention, the step of inputting the text representation into the topic word recognition extraction layer to obtain a topic word set comprises the following steps:
respectively identifying and extracting subject words of the first text representation and the second text representation, and combining the subject words into a subject word set;
and receiving a subject text input by a user, and adding the subject text to the subject word set.
According to the topic detection method provided by the invention, the first text characterization comprises a first target characterization and a first candidate characterization, and the second text characterization comprises a second target characterization and a second candidate characterization.
According to a topic detection method provided by the present invention, the inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result includes: inputting the first voice characterization, the second voice characterization, the first target characterization, the first candidate characterization, the second target characterization, the second candidate characterization and the topic word set into the topic detection layer, performing topic detection, and obtaining the detection result.
The present invention also provides a topic detection apparatus, comprising:
the voice blocking module is used for acquiring a target voice block based on the voice data to be detected;
the topic detection module inputs the target voice block to a topic detection model and obtains a detection result output by the topic detection model;
the topic detection model is obtained by training based on sample text data, a labeled topic corresponding to the sample text data, sample voice data and a labeled topic corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus;
the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and carrying out topic detection to obtain the detection result.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the topic detection method.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a topic detection method as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the topic detection method as described in any one of the above.
The topic detection method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the target voice block is generated based on the voice data to be detected, the target voice block is input into the topic detection model for optimization processing, the target voice block is subjected to automatic voice recognition to obtain the corresponding voice discrete representation and text content, and the topic detection is carried out by fusing the voice discrete representation after the text content is subjected to natural language processing. The method can be used for automatically recognizing the voice to obtain the text content, processing the natural language, simultaneously supplementing the voice discrete representation information, avoiding the loss of the voice information and improving the recognition accuracy of topic detection to a certain extent.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a topic detection method provided by the present invention;
fig. 2 is a schematic structural diagram of a topic detection device provided by the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a schematic flow diagram of a topic detection method provided by the present invention. As shown in fig. 1, the topic detection method provided in the embodiment of the present invention includes: step 101, acquiring a target voice block based on voice data to be detected.
The main execution unit of the topic detection method provided in the embodiment of the present invention is a topic detection device.
The topic detection method provided by the embodiment of the invention is applied in the scene that a plurality of enterprises and organizations communicate in a voice or video mode, and effective topic knowledge is obtained from a large number of voice call records through structured analysis for subsequent examination and review.
The topic detection method provided by the embodiment of the application is suitable for the topic detection of any voice data by a user through electronic equipment.
The electronic device described above may be implemented in various forms. For example, the electronic devices described in the embodiments of the present application may include mobile terminals such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, a smart band, a smart hand, a digital camera, and the like, and fixed terminals such as a digital TV, a desktop computer, and the like. In the following, it is assumed that the electronic device is a mobile terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly for moving purposes.
Specifically, in step 101, the topic detection apparatus partitions the voice data to be detected into blocks according to a preset interval, and acquires a target voice block.
And 102, inputting the target voice block to the topic detection model, and obtaining a detection result output by the topic detection model.
The topic detection model is obtained by training based on the sample text data, the labeled topic corresponding to the sample text data, the sample voice data and the labeled topic corresponding to the sample voice data. The sample text data includes a domain corpus and a general corpus.
And the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and carrying out topic detection to obtain a detection result.
It should be noted that the topic detection model is obtained by performing semantic fusion on the basis of the sample text data, the topics pre-labeled to the sample text data, the sample voice data, and the topics pre-labeled to the sample voice data to generate a total sample and training the total sample.
The topic detection model can be an artificial intelligence model, and the embodiment of the invention does not specifically limit the model type.
For example, the topic detection model can be a neural network model, consisting of an input layer, a hidden layer, and an output layer. The hidden layer at least comprises an ASR module and an NLP module, the ASR module processes speech to acquire corresponding discrete characteristics and text content, the NLP module preliminarily extracts subject words from the text content, and finally, topic detection is performed after the discrete characteristics of the speech data, the recognized text content and the subject words are fused.
The sample data includes sample voice data and sample text data corresponding to the sample data, and corresponding label content. The sample data is divided into a training set and a test set, and the content and the proportion of the samples in the training set and the test set are not specifically limited in the embodiment of the invention.
Preferably, the sample text data contains a domain corpus and a general corpus, and the subjective individuals manually mark topic contents contained in each sample text data.
The domain linguistic data refers to text contents with strong industrial characteristics such as domain question and speech, knowledge bases, FAQs, work orders and the like.
The general corpus is the text content covering common fields.
The sample voice data comprises voice linguistic data, and a plurality of subjective individuals manually mark topic contents contained in each voice linguistic data.
Specifically, in step 102, the topic detection device sets a topic detection model according to the trained model parameters, and then performs topic detection on the target speech block generated in step 101 through the model, so as to obtain a detection result corresponding to the target speech block to be detected.
The detection result may be a probability value combination or a keyword text content, and the form of the detection result is not specifically limited in the embodiment of the present invention.
If the detection result can be a combination of probability values, it can indicate the probability that a keyword exists in the target speech block through each probability value.
If the topic detection result can be a keyword text content, an intermediate numerical result can be obtained through the model, and if the numerical result meets the preset judgment condition, the keyword text content distributed to the corresponding target voice block is the numerical result.
For example, an intermediate value result is obtained through the topic detection model, and is used for representing the probability that a certain keyword exists in the target speech block. And according to a preset judgment condition, when the probability is greater than a preset threshold value, generating a label description for expressing that the target speech block is the keyword, and endowing the target speech block with the detection text content corresponding to the label.
For topic detection of voice data, an independent ASR system is usually used to convert the voice data into corresponding text data, and then an independent NLP system performs topic detection on the text data.
The method and the device have the advantages that the voice data are converted into the corresponding discrete characteristics and the text data by the ASR processing system built in the topic detection model, then the NLP processing system extracts the subject words from the text data output by the ASR, the extracted subject words are fused with the discrete characteristics of the voice data and the text data in a splicing mode, the discrete characteristics and the text data are used as input information of topic detection, the ASR processing system and the NLP processing system can be highly coupled, rich semantic information is given to processing results of the ASR processing system and the NLP processing system, and the accuracy of the topic detection is improved.
The embodiment of the invention generates the target voice block based on the voice data to be detected, inputs the target voice block into the topic detection model for optimization processing, obtains the corresponding voice discrete representation and text content after performing automatic voice recognition on the target voice block, and performs topic detection by fusing the voice discrete representation after performing natural language processing on the text content. The method can be used for automatically recognizing the voice to obtain the text content, processing the natural language, simultaneously supplementing the voice discrete representation information, avoiding the loss of the voice information and improving the recognition accuracy of topic detection to a certain extent.
On the basis of any embodiment, the topic detection model comprises a voice recognition layer, a subject word recognition extraction layer and a topic detection layer.
The voice recognition layer is used for carrying out voice recognition on the target voice block. And the subject word identification and extraction layer is used for extracting subject words from the identified text content. And the topic detection layer is used for carrying out topic detection after the output contents of the voice recognition layer and the subject word recognition extraction layer are fused.
Specifically, the topic detection model built in the topic detection device is composed of an input layer, a hidden layer and an output layer. Wherein:
the input layer directly receives the target speech block generated in step 101 at the foremost part of the entire network.
The hidden layer is used for carrying out automatic voice recognition on the target voice block and acquiring discrete characteristics and text contents of corresponding voice. And carrying out natural language processing on the text content, splicing and fusing the processing result and the discrete representation of the voice, and further carrying out topic detection on the fused information to obtain a detection result containing the topic in the voice data.
The output layer is the last layer, and outputs a detection result about a topic included in the target speech block, and outputs a type of the detection result according to different requirements, where the value may be a classification vector value, or a continuous value generated like linear regression, or another complex type of value or vector, which is not specifically limited in the embodiment of the present invention.
The structure of the hidden layer is not particularly limited in the embodiments of the present invention.
Preferably, the hidden layer comprises at least three layers, namely a voice recognition layer, a subject word recognition extraction layer and a topic detection layer. Wherein:
the voice recognition layer is used for performing voice recognition on the target voice block to obtain the corresponding discrete characteristics and the transcribed text content.
The topic word recognition and extraction layer is used for performing preliminary topic extraction on the character contents transcribed by the voice recognition layer and storing the key words contained in the text into a topic word set.
The topic detection layer is used for fusing the voice discrete characteristics output by the voice recognition layer, the transcribed text content and the topic word set output by the topic word recognition extraction layer, executing topic detection and acquiring key words contained in the voice, namely topics in the field.
The training process of the topic detection model in the embodiment of the present invention is not particularly limited.
Illustratively, the specific implementation process of the training topic detection model is as follows:
(1) the domain specific nouns and new words are mined, collected and cleaned as part of a system dictionary (namely domain linguistic data) for subsequent recognition.
(2) And (4) cleaning historical text data with strong industrial characteristics such as field problem and operation, knowledge base, FAQ, worksheet and the like as field linguistic data.
(3) And training a domain language model by utilizing the domain linguistic data, and interpolating and fusing the domain language model with a built-in general language model to generate a first language model LM _ word.
(4) And converting the cleaned domain linguistic data and the general linguistic data into syllables which are respectively used as training language models of the training set, and fusing the two trained language models into a second language model LM _ syllabe.
(5) And performing quantization operation on the sample voice data by adopting a first language model LM _ word and a first language model LM _ syllabe and combining an acoustic model to obtain discrete characterization SL1 and SL 2.
(6) And converting the sample text data M into syllables S, randomly replacing the entity words in the M with homophones/similar words, marking the replaced text as J, converting the J into syllables and marking the syllables as K.
(7) M, S, J, K are quantized to obtain a text representation ML.
(8) And (4) performing subject word extraction on the text representation ML, and taking all obtained subject words as a subject word set T.
(9) And performing fusion splicing on the obtained discrete representations SL1, SL2, ML and the topic word set T, and taking the fusion spliced discrete representations SL2, ML and the topic word set T as the input of an upstream task to train to obtain a topic detection model.
The embodiment of the invention carries out voice recognition on a target voice block based on a voice recognition layer, carries out subject word extraction on text contents output by the voice recognition layer through a subject word recognition extraction layer, carries out topic detection after splicing and fusing voice discrete characteristics and the text contents output by the voice recognition layer through a topic detection layer and the extraction results of the text contents by the subject word recognition extraction layer. The method and the device can perform automatic voice recognition to obtain text content, perform natural language processing, and simultaneously supplement voice discrete representation information, thereby avoiding voice information loss and improving the recognition accuracy of topic detection to a certain extent.
On the basis of any of the above embodiments, inputting the target speech block to the topic detection model, and obtaining the detection result output by the topic detection model, includes: and inputting the target voice block into the voice recognition layer to obtain a voice representation and a text representation.
Specifically, in step 102, the speech recognition layer of the topic detection model receives the target speech block in step 101, quantizes the target speech block to obtain a discrete speech representation, and obtains a recognized text representation.
And inputting the text representation into a topic word recognition and extraction layer to obtain a topic word set.
Specifically, the topic word recognition and extraction layer of the topic detection model receives the text representation output by the voice recognition layer, respectively performs recognition and extraction on topic words, and combines the topic words into a topic word set.
And inputting the voice representation, the text representation and the topic word set into a topic detection layer to obtain a detection result.
Specifically, the topic detection layer of the topic detection model receives the voice representation and the text representation output by the voice recognition layer and the topic word set output by the topic word recognition extraction layer, and performs topic detection after splicing and fusion to obtain a corresponding detection result.
The embodiment of the invention obtains the voice representation and the text representation corresponding to the target voice block based on the voice recognition layer, extracts the subject word set from the text representation through the subject word recognition and extraction layer, and performs topic detection after splicing and fusing the voice representation, the text representation and the subject word set through the topic detection layer. The method can be used for automatically recognizing the voice to obtain the text content, processing the natural language, simultaneously supplementing the voice discrete representation information, avoiding the loss of the voice information and improving the recognition accuracy of topic detection to a certain extent.
On the basis of any of the above embodiments, inputting a target speech block into a speech recognition layer to obtain speech representations and text representations, including: and respectively utilizing the first language model and the second language model, and combining the acoustic model to carry out quantization operation on the target voice block to obtain a first voice representation and a second voice representation.
The first language model is obtained by fusing a language model based on field corpus training and a language model based on general corpus training, and the second language model is obtained by fusing a language model based on field corpus syllable training and a language model based on general corpus syllable training.
The first language model is generated by interpolating and fusing a domain language model trained using the domain corpus and a built-in general language model (or a general language model trained using the general corpus).
The first language model is used to perform speech to text conversion.
The second language model is generated by interpolating and fusing the language model trained by the syllables converted by the domain corpus and the language model trained by the syllables converted by the general corpus.
The second language model is used to perform speech to syllable conversion.
Specifically, in the processing process of the voice recognition layer, the topic detection device performs quantization operation on the target voice block by combining the first language model with the acoustic model, and obtains a discretized first voice representation. Similarly, the second language model is combined with the acoustic model to perform quantization operation on the target speech block, and a discretized second speech representation is obtained.
The first speech representation refers to a discrete representation corresponding to the target speech block in the process of converting the text by speech. Illustratively, the first speech is characterized as SL1, using the model training process described above as an example.
The second speech representation is a discrete representation corresponding to the target speech block in the process of converting the syllable by speech. Illustratively, the second speech is characterized as SL2, exemplified by the model training process described above.
And respectively carrying out text recognition conversion on the target voice block by utilizing the first language model and the second language model and combining the acoustic model to obtain a corresponding first text representation and a corresponding second text representation.
Specifically, in the process that the speech recognition layer adopts the first language model and the second language model and is respectively combined with the acoustic model for processing, the target speech block is further recognized and processed to obtain a first text representation and a second text representation corresponding to the target speech block.
The first text representation refers to a conversion result of the voice conversion text, namely, text content corresponding to the target voice block.
The second text representation refers to a conversion result of the voice conversion syllable, namely the syllable content corresponding to the target voice block. Illustratively, taking the above model training process as an example, the first text characterization and the second text characterization may constitute ML for the subject word recognition extraction layer to perform the subject extraction.
The method and the device for processing the target speech block are based on a first language model and a second language model which are built in a speech recognition layer and are respectively processed by combining an acoustic model, and a first speech feature and a first text feature, and a second speech feature and a second text feature which correspond to the target speech block are obtained. The text content can be obtained by carrying out automatic voice recognition, the voice structure is considered, information loss in the voice recognition process is avoided, and then the recognition accuracy of topic detection can be improved.
On the basis of any of the above embodiments, inputting the text representation into the topic word recognition extraction layer to obtain a topic word set, including: and respectively identifying and extracting the subject words of the first text representation and the second text representation, and combining the subject words into a subject word set.
Specifically, in the processing process of the subject word recognition and extraction layer, the topic detection device splices and fuses a first text representation and a second text representation corresponding to a target speech block output by the speech recognition layer, extracts a subject word, and adds a main feature component (i.e., a subject word) of the extracted data to a subject word set corresponding to the target speech block.
The subject term extraction algorithm includes, but is not limited to, Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation model (LDA), Latent Semantic Analysis (LSA), and other subject model algorithms.
And receiving a subject text input by a user, and adding the subject text to the subject word set.
Specifically, after the topic word recognition and extraction layer obtains the topic word set, the information input by the user can be received, and the topic text extracted from the input information is added to the topic word set.
The embodiment of the invention does not specifically limit the way of artificially expanding the subject word set.
Alternatively, the user may enter custom topic text and add it as a set of seed words to the set of topic words.
Optionally, the user may determine a topic according to task requirements, and add a topic word set corresponding to the topic word set as a seed word set.
It can be understood that the topic detection model can be retrained as conditional feedback to form a closed loop by expanding the topic detection model into a general corpus in sample text data after being cleaned according to the topic language operation updated in real time in the field, so that the detectable topic type is increased, and the model performance is continuously improved.
The embodiment of the invention carries out feature extraction on the fused first text characterization and second text characterization based on the subject word recognition and extraction layer to generate a subject word set. And the subject word set and the subject text are expanded by receiving the input of the user. The topic detection breadth and the topic detection accuracy can be improved.
On the basis of any of the above embodiments, the first text representation comprises a first target representation and a first candidate representation, and the second text representation comprises a second target representation and a second candidate representation.
Specifically, in the process that the speech recognition layer adopts the first language model and combines with the acoustic model for processing, the target speech block is further recognized and processed, and the obtained first text representation comprises a first target representation and a plurality of first candidate representations.
The first target representation refers to an optimal recognition result converted from voice to text.
The first candidate characterization is a suboptimal recognition result of converting speech into text, and the suboptimal recognition result is used as a candidate recognition result, so that the condition that the optimal recognition result is obtained by misleading recognition due to environmental factors is avoided.
Similarly, during the processing of the speech recognition layer using the second language model in combination with the acoustic model, the target speech block is further subjected to recognition processing, and the obtained second text representation includes a second target representation and a plurality of second candidate representations.
The second target representation refers to an optimal recognition result converted from voice into syllables.
The first candidate characterization is a sub-optimal recognition result of the syllable converted from the voice, and the sub-optimal recognition result is used as a candidate recognition result, so that the condition that the optimal recognition result is obtained by misleading recognition due to environmental factors is avoided.
Exemplarily, taking a description of a user corresponding to a target speech block as "what new phone we want to buy a mobile phone and ask for" as an example, a specific implementation process of a speech recognition layer is given below:
recognizing a target speech block by combining a first language model LM _ word built in a speech recognition layer and an acoustic model to obtain a first target characterization R1 and a plurality of first candidate characterizations M1, wherein:
R1: ask for which kinds of personal insurance
M1 asking for which kinds of personal insurance
Asking about what kind of ginseng is fresh
Ask about what kind of personal insurance
Correspondingly, the target speech block is identified by combining a second language model LM _ syllabe arranged in the speech identification layer with the acoustic model, and a second target representation R2 and a plurality of second candidate representations M2 are obtained, wherein:
R2: qing|wen|you|na|xie|ren|shen|bao|xian
M2: qing|wen|you|na|xie|ren|shen|bao|xian
qing|wen|you|na|xie|ren|shen|tou|xian
qin|wen|you|na|xie|ren|shen|bao|xian
the embodiment of the invention outputs the optimal first target characterization and second target characterization and the plurality of first candidate characterization and second candidate characterization serving as alternatives based on the voice recognition layer, so that the subject word recognition extraction layer can perform comprehensive subject word extraction by combining rich output results of the voice recognition layer, and further, the topic detection breadth and the topic recognition accuracy can be improved.
On the basis of any one of the above embodiments, inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result, including: and inputting the first voice characterization, the second voice characterization, the first target characterization, the first candidate characterization, the second target characterization, the second candidate characterization and the topic word set into a topic detection layer, and performing topic detection to obtain a detection result.
Specifically, in the processing process of the topic detection layer, the topic detection device combines the first voice representation and the second voice representation quantized by the voice recognition layer to the target voice block, and the first target representation, the first candidate representation, the second target representation and the second candidate representation of the voice recognition layer, and after the first voice representation, the first candidate representation, the second target representation and the second candidate representation of the voice recognition layer are spliced and fused, the topic representation is used as the input information of the topic detection layer to perform topic detection, and the detected keyword is used as a detection result.
According to the embodiment of the invention, after the first voice representation, the second voice representation, the first target representation, the first candidate representation, the second target representation, the second candidate representation and the subject word set which are output by the voice recognition layer and the subject word recognition extraction layer are fused based on the topic detection layer, the topic detection layer can comprehensively detect topics according to rich input information, and further the topic detection breadth and the recognition accuracy rate can be improved.
Fig. 2 is a schematic structural diagram of the topic detection device provided in the present invention. On the basis of any of the above embodiments, as shown in fig. 2, the apparatus includes: a voice chunking module 210 and a topic detection module 220, wherein:
The voice blocking module 210 is configured to obtain a target voice block based on the voice data to be detected.
The topic detection module 220 inputs the target speech block to the topic detection model to obtain a detection result output by the topic detection model.
The topic detection model is obtained by training based on the sample text data, the labeled topic corresponding to the sample text data, the sample voice data and the labeled topic corresponding to the sample voice data. The sample text data includes a domain corpus and a general corpus.
And the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result.
Specifically, the voice chunking module 210 and the topic detection module 220 are electrically connected in sequence.
The voice blocking module 210 blocks the voice data to be detected according to the preset interval to obtain a target voice block.
After the topic detection module 220 sets the topic detection model according to the trained model parameters, the topic detection module performs topic detection on the target speech block generated by the speech block module 210 through the model, so as to obtain a detection result corresponding to the target speech block to be detected.
Optionally, the topic detection model comprises a speech recognition layer, a subject word recognition extraction layer and a topic detection layer.
The voice recognition layer is used for carrying out voice recognition on the target voice block. And the subject term identification and extraction layer is used for extracting the subject terms from the identified text content. And the topic detection layer is used for carrying out topic detection after the output contents of the voice recognition layer and the subject word recognition extraction layer are fused.
Optionally, the topic detection module 220 comprises a speech recognition unit, a natural language processing unit and a topic detection unit, wherein:
and the voice recognition unit is used for inputting the target voice block into the voice recognition layer to obtain voice representation and text representation.
And the natural language processing unit is used for inputting the text representation into the subject word recognition and extraction layer to obtain a subject word set.
And the topic detection unit is used for inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result.
Optionally, the speech recognition unit comprises a quantization subunit and a recognition subunit, wherein:
and the quantization subunit is used for performing quantization operation on the target speech block by respectively utilizing the first language model and the second language model and combining the acoustic model to acquire a first speech characteristic and a second speech characteristic.
And the recognition subunit is used for respectively performing text recognition conversion on the target speech block by respectively utilizing the first language model and the second language model and combining the acoustic model to acquire a corresponding first text representation and a corresponding second text representation.
The first language model is obtained by fusing a language model based on field corpus training and a language model based on general corpus training, and the second language model is obtained by fusing a language model based on field corpus syllable training and a language model based on general corpus syllable training.
Optionally, the natural language processing unit includes an extraction subunit and an expansion subunit, wherein:
and the extraction subunit is used for respectively identifying and extracting the first text representation and the second text representation, and synthesizing the subject word set.
And the expansion subunit is used for receiving the subject text input by the user and adding the subject text to the subject word set.
Optionally, the first text token comprises a first target token and a first candidate token and the second text token comprises a second target token and a second candidate token.
Optionally, the topic detection unit is specifically configured to input the first voice token and the second voice token, and the first target token, the first candidate token, the second target token, the second candidate token, and the topic word set into the topic detection layer, perform topic detection, and acquire a detection result.
The topic detection device provided in the embodiment of the present invention is configured to execute the topic detection method provided in the present invention, and an implementation manner of the topic detection device is consistent with an implementation manner of the topic detection method provided in the present invention, and the same beneficial effects can be achieved, and details are not described here again.
The embodiment of the invention generates the target voice block based on the voice data to be detected, inputs the target voice block into the topic detection model for optimization processing, obtains the corresponding voice discrete representation and text content after performing automatic voice recognition on the target voice block, and performs topic detection by fusing the voice discrete representation after performing natural language processing on the text content. The method and the device can perform automatic voice recognition to obtain text content, perform natural language processing, and simultaneously supplement voice discrete representation information, thereby avoiding voice information loss and improving the recognition accuracy of topic detection to a certain extent.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication Interface (Communications Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. Processor 310 may invoke logic instructions in memory 330 to perform a topic detection method comprising: acquiring a target voice block based on voice data to be detected; inputting a target voice block to the topic detection model to obtain a detection result output by the topic detection model; the topic detection model is obtained by training based on sample text data, labeled topics corresponding to the sample text data, sample voice data and labeled topics corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus; and the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the topic detection method provided by the above methods, the method comprising: acquiring a target voice block based on voice data to be detected; inputting a target voice block to the topic detection model to obtain a detection result output by the topic detection model; the topic detection model is obtained by training based on sample text data, labeled topics corresponding to the sample text data, sample voice data and labeled topics corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus; and the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the topic detection method provided by the above methods, the method comprising: acquiring a target voice block based on voice data to be detected; inputting a target voice block to the topic detection model to obtain a detection result output by the topic detection model; the topic detection model is obtained by training based on sample text data, labeled topics corresponding to the sample text data, sample voice data and labeled topics corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus; and the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (6)

1. A topic detection method, comprising:
acquiring a target voice block based on voice data to be detected;
inputting the target voice block to a topic detection model, and obtaining a detection result output by the topic detection model;
the topic detection model is obtained by training based on sample text data, labeled topics corresponding to the sample text data, sample voice data and labeled topics corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus;
the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result;
the topic detection model comprises a voice recognition layer, a subject word recognition extraction layer and a topic detection layer;
the voice recognition layer is used for performing voice recognition on the target voice block; the subject term identification and extraction layer is used for extracting subject terms from the identified text content; the topic detection layer is used for carrying out topic detection after the output contents of the voice recognition layer and the subject term recognition extraction layer are fused;
The inputting the target voice block to a topic detection model, and obtaining a detection result output by the topic detection model, including:
inputting the target voice block into the voice recognition layer to obtain voice representation and text representation;
inputting the text representation into the subject word recognition and extraction layer to obtain a subject word set;
inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result;
inputting the target speech block into the speech recognition layer to obtain speech representation and text representation, including:
respectively utilizing a first language model and a second language model, and combining an acoustic model to carry out quantization operation on the target voice block to obtain a first voice representation and a second voice representation;
respectively performing text recognition conversion on the target voice block by respectively utilizing the first language model and the second language model and combining an acoustic model to obtain a corresponding first text representation and a corresponding second text representation;
the first language model is obtained by fusing a language model trained on the basis of the domain corpus and a language model trained on the basis of the general corpus, and the second language model is obtained by fusing a language model trained on the basis of syllables of the domain corpus and a language model trained on the basis of syllables of the general corpus;
Inputting the text representation into the topic word recognition extraction layer to obtain a topic word set, including:
respectively identifying and extracting subject words of the first text representation and the second text representation, and combining the subject words into a subject word set;
receiving a subject text input by a user, and adding the subject text to the subject word set;
wherein the first text is characterized as a conversion result of a speech converted text and the second text is characterized as a conversion result of a speech converted syllable.
2. The topic detection method of claim 1, wherein the first text tokens comprise a first target token and a first candidate token and the second text tokens comprise a second target token and a second candidate token.
3. The topic detection method according to claim 2, wherein the inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result comprises: inputting the first voice characterization, the second voice characterization, the first target characterization, the first candidate characterization, the second target characterization, the second candidate characterization and the topic word set into the topic detection layer, performing topic detection, and obtaining the detection result.
4. A topic detection device, comprising:
the voice blocking module is used for acquiring a target voice block based on the voice data to be detected;
the topic detection module inputs the target voice block to a topic detection model and obtains a detection result output by the topic detection model;
the topic detection model is obtained by training based on sample text data, labeled topics corresponding to the sample text data, sample voice data and labeled topics corresponding to the sample voice data; the sample text data comprises a field corpus and a general corpus;
the topic detection model is used for carrying out voice recognition on the target voice block to obtain a voice discrete representation and text content, fusing a natural language processing result of the text content with the voice discrete representation, and then carrying out topic detection to obtain a detection result;
the topic detection model comprises a voice recognition layer, a subject word recognition extraction layer and a topic detection layer;
the voice recognition layer is used for performing voice recognition on the target voice block; the subject term identification and extraction layer is used for extracting subject terms from the identified text content; the topic detection layer is used for carrying out topic detection after the output contents of the voice recognition layer and the subject term recognition extraction layer are fused;
The topic detection module comprises a voice recognition unit, a natural language processing unit and a topic detection unit, wherein:
the voice recognition unit is used for inputting the target voice block into the voice recognition layer to obtain a voice representation and a text representation;
the natural language processing unit is used for inputting the text representation into the subject word recognition and extraction layer to obtain a subject word set;
the topic detection unit is used for inputting the voice representation, the text representation and the topic word set into the topic detection layer to obtain a detection result;
the voice recognition unit is specifically configured to perform quantization operation on the target voice block by using a first language model and a second language model respectively and combining an acoustic model to obtain a first voice characterization and a second voice characterization;
respectively performing text recognition conversion on the target voice block by respectively utilizing the first language model and the second language model and combining an acoustic model to obtain a corresponding first text representation and a corresponding second text representation;
the first language model is obtained by fusing a language model trained on the basis of the domain corpus and a language model trained on the basis of the general corpus, and the second language model is obtained by fusing a language model trained on the basis of syllables of the domain corpus and a language model trained on the basis of syllables of the general corpus;
The natural language processing unit is specifically configured to identify and extract the first text characterization and the second text characterization respectively, and combine the first text characterization and the second text characterization into a topic word set;
receiving a subject text input by a user, and adding the subject text to the subject word set;
wherein the first text is characterized as a conversion result of a speech converted text and the second text is characterized as a conversion result of a speech converted syllable.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the topic detection method as claimed in any one of claims 1 to 3.
6. A non-transitory computer readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the topic detection method according to any one of claims 1 to 3.
CN202210279908.5A 2022-03-22 2022-03-22 Topic detection method and device, electronic equipment and storage medium Active CN114373448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210279908.5A CN114373448B (en) 2022-03-22 2022-03-22 Topic detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210279908.5A CN114373448B (en) 2022-03-22 2022-03-22 Topic detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114373448A CN114373448A (en) 2022-04-19
CN114373448B true CN114373448B (en) 2022-06-14

Family

ID=81145676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210279908.5A Active CN114373448B (en) 2022-03-22 2022-03-22 Topic detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114373448B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307948A (en) * 2020-10-29 2021-02-02 北京嘀嘀无限科技发展有限公司 Feature fusion method, device and storage medium
CN112287675B (en) * 2020-12-29 2021-04-30 南京新一代人工智能研究院有限公司 Intelligent customer service intention understanding method based on text and voice information fusion
CN113257237B (en) * 2021-06-25 2021-10-22 北京沃丰时代数据科技有限公司 Voice interaction intention recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114373448A (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
Macary et al. On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition
CN110136749A (en) The relevant end-to-end speech end-point detecting method of speaker and device
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN112233680B (en) Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN112348111A (en) Multi-modal feature fusion method and device in video, electronic equipment and medium
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN114360557A (en) Voice tone conversion method, model training method, device, equipment and medium
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
CN114969195B (en) Dialogue content mining method and dialogue content evaluation model generation method
Jia et al. A deep learning system for sentiment analysis of service calls
CN114373448B (en) Topic detection method and device, electronic equipment and storage medium
CN111414748A (en) Traffic data processing method and device
EP4024393A2 (en) Training a speech recognition model
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN115827831A (en) Intention recognition model training method and device
CN115705705A (en) Video identification method, device, server and storage medium based on machine learning
CN115081459B (en) Spoken language text generation method, device, equipment and storage medium
CN112466286A (en) Data processing method and device and terminal equipment
CN114938462B (en) Intelligent editing method, system, electronic equipment and storage medium of teaching video
CN114818644B (en) Text template generation method, device, equipment and storage medium
CN114420109B (en) Voice gender joint recognition method and device, electronic equipment and storage medium
CN117453895B (en) Intelligent customer service response method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant