CN115019802A - Speech intention recognition method and device, computer equipment and storage medium - Google Patents

Speech intention recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115019802A
CN115019802A CN202210743461.2A CN202210743461A CN115019802A CN 115019802 A CN115019802 A CN 115019802A CN 202210743461 A CN202210743461 A CN 202210743461A CN 115019802 A CN115019802 A CN 115019802A
Authority
CN
China
Prior art keywords
audio data
voiceprint
user
model
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210743461.2A
Other languages
Chinese (zh)
Inventor
南海顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210743461.2A priority Critical patent/CN115019802A/en
Publication of CN115019802A publication Critical patent/CN115019802A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of artificial intelligence, and discloses a method, a device, a computer device and a storage medium for recognizing a voice intention, when the method is executed, whether the audio data contains the pronunciations of multiple persons is judged, if yes, the individual audio data of different persons is obtained by separating the voices, corresponding voiceprint features are extracted, then the target voiceprint features which are the same as the pre-stored voiceprint features of the first ID user are searched in the extracted voiceprint features, and finally the target individual audio data corresponding to the target voiceprint features are obtained, the target individual audio data can be regarded as the first ID user to make a sound, and finally, text conversion is carried out based on the target individual audio data to carry out intention identification, so that after the interference of the audio data of other people is eliminated, the real intention of the first ID user improves the accuracy of intention identification when the intelligent outbound robot meets multiple voices.

Description

Speech intention recognition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a storage medium for speech intention recognition.
Background
The intelligent outbound robot is an important communication tool for providing effective service for the client by enterprises. It is important for an intelligent outbound robot to be able to accurately recognize the intent of the client's voice.
When the existing intelligent calling robot recognizes the intention of a client, firstly, the voice of the client needs to be converted into a text through a voice-to-text model, and then the intention recognition model recognizes the text corresponding to the voice of the client, so as to recognize the intention of the client.
The inventor finds that when the voice-to-text is used, a client is often in a public place with many people, and at the moment, due to other voices around, the voice-to-text is mixed with texts converted by other voices, so that errors occur when the intention model identifies the intention of the client. For example: when a client says ' help me check balance ', but other people say ' i want to watch a movie ' in the background at the moment, the received voice is changed into ' help me check me see surplus electric shadow ' through the superposition of two voices ', so that intention identification cannot be identified. Therefore, how to solve the problem that the intelligent outbound robot is low in accuracy of the intention recognition model when encountering multiple voices is an urgent need to solve.
Disclosure of Invention
The application mainly aims to provide a method, a device, a computer device and a storage medium for voice intention recognition, and aims to solve the technical problem that when an intelligent outbound robot encounters multiple voices, the accuracy of an intention recognition model is low.
In order to achieve the above object, the present application provides a method for recognizing a speech intention, the method comprising:
acquiring audio data to be identified sent by a first ID user, and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library;
judging whether the audio data contains the pronunciations of a plurality of persons;
if the audio data contains pronunciations of a plurality of people, carrying out voice separation on the audio data through a pre-trained voice separation model to obtain single audio data of a plurality of different people;
respectively analyzing the voiceprint characteristics of the individual audio data of a plurality of different people to obtain the voiceprint characteristics of a plurality of different people;
extracting target voiceprint features which are the same as the pre-stored voiceprint features from the voiceprint features of a plurality of different persons;
extracting individual audio data corresponding to the target voiceprint feature as target individual audio data from the individual audio data of a plurality of different persons;
and converting the target individual audio data into text data, and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
Further, the determining whether the audio data includes multiple human pronunciations includes:
intercepting first audio data of a first time length in the audio data;
dividing the first audio data into second audio data and third audio data;
respectively extracting the voiceprint features of the first audio data, the second audio data and the third audio data to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature;
judging whether the first voiceprint feature, the second voiceprint feature and the third voiceprint feature are the same;
and if not, judging that the audio data contains the pronunciations of a plurality of persons.
Further, before extracting a target voiceprint feature, which is the same as a pre-stored voiceprint feature corresponding to the first ID user, from the voiceprint features of the plurality of different people, the method includes:
judging whether prestored voiceprint characteristics of the first ID user are stored or not;
if yes, executing a step of extracting target voiceprint characteristics which are the same as the prestored voiceprint characteristics corresponding to the first ID user from the voiceprint characteristics of a plurality of different people;
if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
Further, before the acquiring the audio data to be identified sent by the first ID user, the method includes:
acquiring second historical sound data of each historical user;
extracting the warehousing voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model;
and performing one-to-one mapping on the ID of each historical user and the corresponding warehousing voiceprint characteristics, and putting the mapped IDs and the corresponding warehousing voiceprint characteristics into the voiceprint library.
Further, the voiceprint registration model is a modified model that a SpecAug layer is added before the first layer of the ecapa-tdnn model, and the SpecAug layer is used for making a random mask for an input fbank vector.
Further, before the acquiring the audio data to be identified sent by the first ID user, the method includes:
training a voice separation basic model to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model in which a SpecAug layer is added before a first layer of an ecapa-tdnn model, and a softmax layer of the capa-tdnn model is replaced by a focaloss layer.
Further, the training of the human voice separation basic model, before obtaining the pre-trained human voice separation model, includes:
acquiring sound segments of all historical users with IDs, wherein the sound segments are provided with corresponding ID marks;
splicing every two sound segments to obtain a plurality of spliced sound segments;
slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks;
and (4) collecting all the fragment sound samples to form a sample set for training a human voice separation basic model.
The present application further provides a device for voice intention recognition, the device comprising:
the acquiring unit is used for acquiring audio data to be identified sent by a first ID user and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library;
the judging unit is used for judging whether the audio data contains the pronunciations of a plurality of persons;
the voice separation unit is used for carrying out voice separation on the audio data through a pre-trained voice separation model to obtain separate audio data of a plurality of different persons if the audio data contains pronunciations of a plurality of persons;
the analysis unit is used for respectively analyzing the voiceprint characteristics of the individual audio data of the plurality of different people to obtain the voiceprint characteristics of the plurality of different people;
a first extracting unit, configured to extract a target voiceprint feature that is the same as the pre-stored voiceprint feature from the voiceprint features of a plurality of different people;
a second extraction unit configured to extract, as target individual audio data, individual audio data corresponding to the target voiceprint feature from among the individual audio data of a plurality of different persons;
and the intention recognition unit is used for converting the target individual audio data into text data and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
According to the voice intention identification method, the voice intention identification device, the computer equipment and the storage medium, when the method is executed, whether the audio data contain pronunciations of multiple persons is judged firstly, if the voice data contain pronunciations of the multiple persons, voice separation is carried out to obtain separate audio data of different persons and corresponding voiceprint features are extracted, then target voiceprint features identical to pre-stored voiceprint features of a first ID user are searched in the extracted voiceprint features, finally target separate audio data corresponding to the target voiceprint features are obtained, the target separate audio data can be regarded as the voice of the first ID user, text conversion is carried out based on the target separate audio data, intention identification is carried out, after interference of the audio data of other persons is eliminated, the real intention of the first ID user is obtained, and the accuracy of intention identification when the intelligent outbound robot meets the multiple persons is improved.
Drawings
FIG. 1 is a flowchart illustrating a method for recognizing a speech intent according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of determining whether audio data includes multiple utterances according to an embodiment of the present application;
FIG. 3 is a block diagram illustrating an exemplary speech intent recognition apparatus according to an embodiment of the present application;
FIG. 4 is a block diagram illustrating a computer device according to an embodiment of the present invention.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a method for recognizing a speech intention, including the following steps S1 to S7:
s1, acquiring audio data to be identified sent by a first ID user, and searching a preset voiceprint library for a pre-stored voiceprint feature corresponding to the first ID user;
s2, judging whether the audio data contains the pronunciations of a plurality of persons;
s3, if the audio data contain pronunciations of a plurality of people, carrying out voice separation on the audio data through a pre-trained voice separation model to obtain single audio data of a plurality of different people;
s4, analyzing the voiceprint characteristics of the individual audio data of the plurality of different people respectively to obtain the voiceprint characteristics of the plurality of different people;
s5, extracting target voiceprint characteristics which are the same as the pre-stored voiceprint characteristics from the voiceprint characteristics of a plurality of different people;
s6, extracting individual audio data corresponding to the target voiceprint features from the individual audio data of a plurality of different persons to serve as target individual audio data;
s7, converting the target individual audio data into text data, and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
The voice intention recognition method is applied to the artificial intelligence fields of intelligent voice recognition, intelligent voice response and the like. For example, the present application is applied to an intelligent outbound robot, which is a device or application that connects audio data input by a user and then responds accordingly.
As described in step S1, the first ID user is the user with the first ID. The first ID user is a user who is currently inputting a voice, and the voice (audio data to be recognized) which is input by the user generally includes a sound emitted by the user himself, and sounds in the surrounding environment thereof, such as wind, rain, sounds emitted from an automobile engine, sounds emitted from other people, and the like. That is, there may be a large amount of disturbing sounds in the audio data input by the first ID user, which is also a major cause of inaccuracy in recognition of the voice intention input thereto at present. The preset voiceprint library is a database in which a large number of voiceprint features are stored in advance, and each voiceprint feature in the voiceprint library corresponds to an ID. In the application, in different application scenarios, the user basically registers the account in advance, and the unique corresponding ID is generated when the account is registered. When the first ID user inputs audio data, the corresponding ID can be known, and then the corresponding pre-stored voiceprint features can be found in the voiceprint library.
As described in step S2, the multiple persons are two persons or more than two persons. The method for determining whether the audio data contains multiple human pronunciations may be: the method comprises the steps of intercepting audio data into multiple sections, extracting voiceprint features of the audio data, comparing whether the voiceprint features of the audio data are the same or not, judging that only one person sounds if the voiceprint features of the audio data are the same, and judging that multiple persons sound if the voiceprint features of the audio data are not the same. When only one person is judged to sound, the audio data can be judged to be the sound of the first ID user, and the actions of text conversion and intention identification can be directly carried out; and when the fact that multiple persons produce sound is judged, the audio data needs to be processed, and the single audio data corresponding to the first ID user is obtained.
As described in step S3 above, when it is determined that there are utterances of multiple persons in the audio data to be recognized, the person sound separation is performed. The separation of human voice is to separate the voice of different people. The voice separation model is obtained by training a large amount of sample data, the sample data is combined voice data containing different voice utterances of different people and is provided with corresponding marks, and supervised training is generally adopted in training. In the embodiment, the human voice separation model is a model trained based on an ecapa-tdnn model.
As described in the above steps S4-S6, after the individual audio data of different persons are extracted, it cannot be determined which individual audio data is the individual audio data of the first ID user, so it is necessary to extract the voiceprint feature of each individual audio data, and compare the extracted voiceprint feature with the prestored voiceprint feature of the first ID user to obtain the target voiceprint feature, and then regard the individual audio data corresponding to the target voiceprint feature as the target individual audio data and regard the target individual audio data as the individual audio data of the first ID user.
As described in the above step S7, the target individual audio data is converted into text data, and the text data is input into a pre-trained intention recognition model, so as to obtain the intention of the audio data to be recognized.
The voice intention recognition method comprises the steps of judging whether audio data contain pronunciations of multiple persons or not, if so, carrying out voice separation to obtain independent audio data of different persons and extracting corresponding voiceprint features, then searching target voiceprint features identical to prestored voiceprint features of a first ID user in the extracted voiceprint features, finally obtaining target independent audio data corresponding to the target voiceprint features, wherein the target independent audio data can be considered as the voice of the first ID user, finally carrying out text conversion based on the target independent audio data, carrying out intention recognition, namely the real intention of the first ID user after eliminating the interference of the audio data of other persons, and improving the accuracy of intention recognition when the intelligent outbound robot meets the multiple persons.
Referring to fig. 2, in an embodiment, the step S2 of determining whether the audio data includes multiple human utterances includes the following steps S21 to S25:
s21, intercepting first audio data with a first time length from the audio data;
s22, dividing the first audio data into second audio data and third audio data;
s23, extracting the voiceprint features of the first audio data, the second audio data and the third audio data respectively to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature;
s24, judging whether the first voiceprint feature, the second voiceprint feature and the third voiceprint feature are the same;
and S25, if the voice data are different, judging that the voice data contain the pronunciations of a plurality of persons.
As described in the above steps S21 to S25, first a segment of first audio data is intercepted from the audio data, then the first audio data is divided into second audio data and third audio data, then the voiceprint features of the three segments of audio data are respectively extracted and compared, if the voiceprint features of the three segments of audio data are all the same, it indicates that the first audio data is completely the sound made by the same person, and if the voiceprint features of the three segments of audio data are different, it indicates that there is a pronunciation of multiple persons in the first audio data, and further indicates that there is a pronunciation of multiple persons in the audio data to be identified sent by the first ID user. In a specific embodiment, the first time length is 3 seconds, and the second audio data and the third audio data are respectively 1.5 seconds of audio data. The method for judging whether the audio data contains the pronunciations of multiple persons in the embodiment can judge whether the pronunciations of multiple persons exist only by analyzing a small segment of audio data, so that the judgment is accurate, and the calculation resources are saved. In one embodiment, if the intelligent outbound robot needs to recognize the audio data in real time, the voice can be judged once every 3 seconds, and the voice of other people is inserted into the input audio data.
In an embodiment, before the step S4 of extracting a target voiceprint feature that is the same as a pre-stored voiceprint feature corresponding to the first ID user from the voiceprint features of the plurality of different persons, the method includes:
judging whether prestored voiceprint characteristics of the first ID user are stored or not;
if yes, executing the step of extracting target voiceprint characteristics which are the same as the pre-stored voiceprint characteristics corresponding to the first ID user from the voiceprint characteristics of a plurality of different people;
if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
As described above, the first ID user may be an unregistered user, but there is a history interaction record, and if the pre-stored voiceprint feature corresponding to the first ID user is not found in the voiceprint library, a new "ID-voiceprint feature" data needs to be established in the voiceprint library. The established method is to call the historical interaction record (voice interaction record) of the user, extract a piece of voice data only having one person uttered from the historical interaction record, namely the first historical voice data, establish the pre-stored voice print feature of the first ID user based on the first historical voice print data, and execute step S4 after the establishment is completed. The method of steps S21 to S25 described above can be used to determine which data in the voice data of the historical interaction record is a vocalization of a person.
In an embodiment, before the obtaining of the audio data to be identified sent by the first ID user, the method includes:
acquiring second historical sound data of each historical user;
extracting the warehousing voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model;
and performing one-to-one mapping on the ID of each historical user and the corresponding warehousing voiceprint characteristics, and putting the mapped IDs and the corresponding warehousing voiceprint characteristics into the voiceprint library.
As described above, this is the process of creating the voiceprint library. And respectively extracting the voiceprint characteristics of the second historical sound data of all the historical users, and then binding the voiceprint characteristics with corresponding IDs in a one-to-one mode to form a voiceprint library. It should be noted that each of the second historical sound data is also data uttered by one person.
In one embodiment, in the process of establishing the ID, the user can actively input audio data of a person for speaking to the intelligent outbound robot, so that the intelligent outbound robot directly obtains voiceprint features corresponding to the ID based on the audio data, and then stores the voiceprint features in the voiceprint library in a one-to-one mapping manner. The voiceprint registration model is a model for extracting voiceprint characteristics, and can be obtained by training an ecapa-tdnn model.
In one embodiment, the above-described voiceprint registration model is a modified model in which a SpecAug layer is added before the first layer of the ecapa-tdnn model, and the SpecAug layer is used to make a random mask for the input fbank vector.
As described above, when training the voiceprint registration model, the sample number is preprocessed to obtain fbank vectors, and then the fbank vectors are input into the SpecAug layer, which makes a random mask for the input fbank vectors. After random mask is made for the originally generated fbank vector, the fbank vector is input into a subsequent network of an ecapa-tdnn model, and because the data is equivalently expanded after the random mask is made for the originally generated fbank vector, the generalization capability of the model is increased.
In one embodiment, preprocessing the number of samples to obtain fbank vectors comprises: pre-emphasis, framing, windowing, fast fourier transform, mel-filter processing, etc.
In an embodiment, before the obtaining of the audio data to be identified sent by the first ID user, the method includes:
acquiring sound clips of all historical users with IDs, wherein the sound clips are provided with corresponding ID marks;
splicing every two sound segments to obtain a plurality of spliced sound segments;
slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks;
all the obtained fragment sound sample sets form a sample set of a training human voice separation basic model;
and training a voice separation basic model based on the sample set to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model in which a SpecAug layer is added before a first layer of an ecapa-tdnn model, and a softmax layer of the capa-tdnn model is replaced by a focaloss layer.
As described above, before the voice separation model is obtained through training, a sample data set for voice separation is constructed, and when the sample data set for voice separation is constructed, a voice fragment of a historical user is directly selected, and because the voice fragment is a voice fragment of the historical user, a unique ID mark is correspondingly arranged; and then splicing the sound segments of two different persons to form the spliced sound segment, and then splitting the spliced sound segment to obtain a split sound sample. For example, a certain spliced sound segment is formed by splicing sound segments with IDs of No.1 and No.2, a boundary frame at the spliced position of the two sound segments is denoted as ch, in the slicing process, the spliced sound segment is sliced with a frame length of 25 milliseconds per frame, that is, a plurality of sliced sound samples marked as No.1 are obtained, and a plurality of sliced sound samples marked as No.2 and one sliced sound sample simultaneously containing No.1 and No.2, that is, a sliced sound sample containing the boundary frame ch are obtained. The resulting set of sliced sound samples can be denoted as [ No.1, No.1, …, No.1, ch, No.2, … No.2 ]. After the plurality of spliced sound segments are respectively segmented, a sample data set which has enough large data volume and is used for training the human voice separation model can be constructed. According to the above description, the sample data volume of the ch label is small in the constructed sample data set, so that the original softmax loss function is improved into focaloss which has a better unbalanced effect on the samples. Similarly, a SpecAug layer is added before the first layer of the ecapa-tdnn model, because after the fbank vector random mask generated relative to the original data, the data is expanded, and the generalization capability of the model is increased.
Referring to fig. 3, an embodiment of the present application further provides an apparatus for recognizing a speech intention, where the apparatus includes:
the acquiring unit 10 is configured to acquire audio data to be identified sent by a first ID user, and search a preset voiceprint library for a pre-stored voiceprint feature corresponding to the first ID user;
a judging unit 20, configured to judge whether the audio data includes vocalizations of multiple persons;
a voice separation unit 30, configured to, if the audio data includes multiple voices of people, perform voice separation on the audio data through a pre-trained voice separation model to obtain separate audio data of multiple different people;
an analyzing unit 40, configured to analyze voiceprint features of individual audio data of a plurality of different people respectively to obtain voiceprint features of the plurality of different people;
a first extracting unit 50 for extracting a target voiceprint feature identical to the pre-stored voiceprint feature from the voiceprint features of a plurality of different persons;
a second extraction unit 60 configured to extract, as target individual audio data, individual audio data corresponding to the target voiceprint feature among the individual audio data of a plurality of different persons;
and the intention recognition unit 70 is used for converting the target individual audio data into text data and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
In an embodiment, the determining unit is specifically configured to: intercepting first audio data of a first time length in the audio data; dividing the first audio data into second audio data and third audio data; respectively extracting the voiceprint features of the first audio data, the second audio data and the third audio data to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature; judging whether the first voiceprint feature, the second voiceprint feature and the third voiceprint feature are the same; and if not, judging that the audio data contains the pronunciations of a plurality of persons.
In one embodiment, the apparatus for recognizing speech intention further includes:
the first judging unit is used for judging whether prestored voiceprint characteristics of the first ID user are stored or not;
an execution unit, configured to execute a step of extracting, if any, a target voiceprint feature that is the same as a pre-stored voiceprint feature corresponding to the first ID user from the voiceprint features of a plurality of different people; if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
In one embodiment, the apparatus for recognizing speech intention further includes:
a first acquisition unit configured to acquire second history sound data of each history user;
the third extraction unit is used for extracting the in-storage voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model;
and the putting unit is used for carrying out one-to-one mapping on the ID of each historical user and the corresponding warehousing voiceprint characteristics and putting the ID and the corresponding warehousing voiceprint characteristics into the voiceprint library.
In one embodiment, the above-described voiceprint registration model is a modified model in which a SpecAug layer is added before the first layer of the ecapa-tdnn model, and the SpecAug layer is used to make a random mask for the input fbank vector.
In an embodiment, the apparatus for recognizing speech intention further includes:
and the voice separation training unit is used for training a voice separation basic model to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model which is formed by adding a SpecAug layer in front of a first layer of an ecapa-tdnn model and replacing a softmax layer of the capa-tdnn model with a focaloss layer.
In one embodiment, the apparatus for recognizing speech intention further includes:
the data set constructing unit is used for acquiring sound segments of all historical users with IDs, wherein the sound segments are provided with corresponding ID marks; splicing every two sound segments to obtain a plurality of spliced sound segments; slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks; and (4) collecting all the fragment sound samples to form a sample set for training a human voice separation basic model.
Referring to fig. 4, a computer device is also provided in the embodiment of the present application, where the computer device may be the management server described above, or a server corresponding to a management node, and its internal structure may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operating system and the running of computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as voiceprint characteristics and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech intent recognition.
The processor executes the method for recognizing the voice intention, and the method comprises the following steps:
acquiring audio data to be identified sent by a first ID user, and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library;
judging whether the audio data contains the pronunciations of a plurality of persons;
if the audio data contains pronunciations of a plurality of people, carrying out voice separation on the audio data through a pre-trained voice separation model to obtain single audio data of a plurality of different people;
respectively analyzing the voiceprint characteristics of the individual audio data of a plurality of different people to obtain the voiceprint characteristics of a plurality of different people;
extracting target voiceprint features which are the same as the pre-stored voiceprint features from the voiceprint features of a plurality of different people;
extracting individual audio data corresponding to the target voiceprint feature as target individual audio data from the individual audio data of a plurality of different persons;
and converting the target individual audio data into text data, and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
In one embodiment, the determining whether the audio data includes a plurality of human utterances includes:
intercepting first audio data of a first time length in the audio data;
dividing the first audio data into second audio data and third audio data;
respectively extracting the voiceprint features of the first audio data, the second audio data and the third audio data to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature;
judging whether the first voiceprint feature, the second voiceprint feature and the third voiceprint feature are the same;
and if not, judging that the audio data contains the pronunciations of a plurality of persons.
In one embodiment, before extracting a target voiceprint feature that is the same as a pre-stored voiceprint feature corresponding to the first ID user from the voiceprint features of the plurality of different people, the method includes:
judging whether prestored voiceprint characteristics of the first ID user are stored or not;
if yes, executing the step of extracting target voiceprint characteristics which are the same as the pre-stored voiceprint characteristics corresponding to the first ID user from the voiceprint characteristics of a plurality of different people;
if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
In one embodiment, before acquiring the audio data to be identified sent by the first ID user, the method includes:
acquiring second historical sound data of each historical user;
extracting the warehousing voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model;
and performing one-to-one mapping on the ID of each historical user and the corresponding warehousing voiceprint characteristics, and putting the mapped IDs and the corresponding warehousing voiceprint characteristics into the voiceprint library.
In one embodiment, the voiceprint registration model is a modified model that adds a SpecAug layer before the first layer of the ecapa-tdnn model, where the SpecAug layer is used to make a random mask for the input fbank vector.
In one embodiment, before acquiring the audio data to be identified sent by the first ID user, the method includes:
training a voice separation basic model to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model in which a SpecAug layer is added before a first layer of an ecapa-tdnn model, and a softmax layer of the capa-tdnn model is replaced by a focaloss layer.
In an embodiment, the training of the human voice separation basic model, before obtaining the pre-trained human voice separation model, includes:
acquiring sound segments of all historical users with IDs, wherein the sound segments are provided with corresponding ID marks;
splicing every two sound segments to obtain a plurality of spliced sound segments;
slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks;
and (4) collecting all the fragment sound samples to form a sample set for training a human voice separation basic model.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.
According to the computer equipment, when the processor executes the method, the verification rule interface is arranged on the execution engine of the foreground system, the first rule created by the user can be directly sent to the corresponding background test end for verification, the user does not need to copy the rule through storage equipment such as a U disk and the like, and then finds the developer for rule verification, or sends the mail to the developer, and therefore the working efficiency of the user is improved. Similarly, the developer does not need to connect the storage device such as the USB flash disk and the like to read the rule created by the user, and the workload of the developer is saved.
An embodiment of the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of speech intent recognition, comprising: acquiring audio data to be identified sent by a first ID user, and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library; judging whether the audio data contains pronunciations of a plurality of persons; if the audio data contains pronunciations of a plurality of people, carrying out voice separation on the audio data through a pre-trained voice separation model to obtain single audio data of a plurality of different people; respectively analyzing the voiceprint characteristics of the individual audio data of a plurality of different people to obtain the voiceprint characteristics of a plurality of different people; extracting target voiceprint features which are the same as the pre-stored voiceprint features from the voiceprint features of a plurality of different persons; extracting individual audio data corresponding to the target voiceprint feature as target individual audio data from the individual audio data of a plurality of different persons; and converting the target individual audio data into text data, and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
When the processor executes the method, whether the audio data contain pronunciations of multiple persons or not is judged, if yes, the voice is separated to obtain independent audio data of different persons, corresponding voiceprint features are extracted, then target voiceprint features identical to prestored voiceprint features of a first ID user are searched in the extracted voiceprint features, finally target independent audio data corresponding to the target voiceprint features are obtained, the target independent audio data can be regarded as the voice of the first ID user, finally text conversion is carried out based on the target independent audio data, intention identification is carried out, after the interference of the audio data of other persons is eliminated, the real intention of the first ID user can be obtained, and the accuracy of intention identification when the intelligent outbound robot meets the multiple persons is improved.
In one embodiment, the determining whether the audio data includes a plurality of human utterances includes: intercepting first audio data of a first time length in the audio data; dividing the first audio data into second audio data and third audio data; respectively extracting the voiceprint features of the first audio data, the second audio data and the third audio data to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature; judging whether the first voiceprint feature, the second voiceprint feature and the third voiceprint feature are the same; and if not, judging that the audio data contains the pronunciations of a plurality of persons.
In one embodiment, before extracting a target voiceprint feature that is the same as a pre-stored voiceprint feature corresponding to the first ID user from the voiceprint features of the plurality of different people, the method includes: judging whether prestored voiceprint characteristics of the first ID user are stored or not; if yes, executing the step of extracting target voiceprint characteristics which are the same as the pre-stored voiceprint characteristics corresponding to the first ID user from the voiceprint characteristics of a plurality of different people; if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
In one embodiment, before acquiring the audio data to be identified sent by the first ID user, the method includes: acquiring second historical sound data of each historical user; extracting the warehousing voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model; and performing one-to-one mapping on the ID of each historical user and the corresponding warehousing voiceprint characteristics, and putting the mapped IDs and the corresponding warehousing voiceprint characteristics into the voiceprint library.
In one embodiment, the voiceprint registration model is a modified model that adds a SpecAug layer before the first layer of the ecapa-tdnn model, where the SpecAug layer is used to make a random mask for the input fbank vector.
In one embodiment, before acquiring the audio data to be identified sent by the first ID user, the method includes: training a voice separation basic model to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model in which a SpecAug layer is added before a first layer of an ecapa-tdnn model, and a softmax layer of the capa-tdnn model is replaced by a focaloss layer.
In an embodiment, the training of the human voice separation basic model, before obtaining the pre-trained human voice separation model, includes: acquiring sound segments of all historical users with IDs, wherein the sound segments are provided with corresponding ID marks; splicing every two sound segments to obtain a plurality of spliced sound segments; slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks; and (4) collecting all the fragment sound samples to form a sample set for training a human voice separation basic model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A method of speech intent recognition, the method comprising:
acquiring audio data to be identified sent by a first ID user, and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library;
judging whether the audio data contains the pronunciations of a plurality of persons;
if the audio data contains pronunciations of a plurality of people, carrying out voice separation on the audio data through a pre-trained voice separation model to obtain single audio data of a plurality of different people;
respectively analyzing the voiceprint characteristics of the individual audio data of a plurality of different people to obtain the voiceprint characteristics of a plurality of different people;
extracting target voiceprint features which are the same as the pre-stored voiceprint features from the voiceprint features of a plurality of different persons;
extracting individual audio data corresponding to the target voiceprint feature as target individual audio data from the individual audio data of a plurality of different persons;
and converting the target individual audio data into text data, and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
2. The method of claim 1, wherein the determining whether the audio data contains multiple human utterances comprises:
intercepting first audio data of a first time length in the audio data;
dividing the first audio data into second audio data and third audio data;
respectively extracting the voiceprint features of the first audio data, the second audio data and the third audio data to obtain a first voiceprint feature, a second voiceprint feature and a third voiceprint feature;
judging whether the first voiceprint characteristic, the second voiceprint characteristic and the third voiceprint characteristic are the same;
and if not, judging that the audio data contains the pronunciations of a plurality of persons.
3. The method of voice intent recognition according to claim 1, wherein before extracting a target voiceprint feature that is the same as a pre-stored voiceprint feature corresponding to the first ID user from the voiceprint features of a plurality of different people, comprising:
judging whether prestored voiceprint characteristics of the first ID user are stored or not;
if yes, executing the step of extracting target voiceprint characteristics which are the same as the pre-stored voiceprint characteristics corresponding to the first ID user from the voiceprint characteristics of a plurality of different people;
if not, acquiring first historical voice data of the first ID user, establishing pre-stored voiceprint features of the first ID user based on the first historical voiceprint data, and then executing the step of extracting target voiceprint features which are the same as the pre-stored voiceprint features corresponding to the first ID user from the voiceprint features of a plurality of different people.
4. The method for recognizing voice intention according to claim 1, wherein before the obtaining of the audio data to be recognized sent by the first ID user, the method comprises:
acquiring second historical sound data of each historical user;
extracting the warehousing voiceprint characteristics corresponding to the second historical voice data based on a preset voiceprint registration model;
and performing one-to-one mapping on the ID of each historical user and the corresponding database-entering voiceprint characteristics, and putting the mapped IDs and the corresponding database-entering voiceprint characteristics into the voiceprint database.
5. The method according to claim 4, wherein the voiceprint registration model is a modified model in which a SpecAug layer is added before the first layer of the ecapa-tdnn model, and the SpecAug layer is used to make a random mask for the input fbank vector.
6. The method for recognizing voice intention according to claim 1, wherein before the obtaining of the audio data to be recognized sent by the first ID user, the method comprises:
training a voice separation basic model to obtain a pre-trained voice separation model, wherein the voice separation basic model is an improved model in which a SpecAug layer is added before a first layer of an ecapa-tdnn model, and a softmax layer of the capa-tdnn model is replaced by a focaloss layer.
7. The method of claim 6, wherein the training of the voice separation base model to obtain the pre-trained voice separation model comprises:
acquiring sound segments of all historical users with IDs, wherein the sound segments are provided with corresponding ID marks;
splicing every two sound segments to obtain a plurality of spliced sound segments;
slicing the spliced sound segments according to a preset time length to obtain sliced sound samples, wherein each spliced sound segment is divided into at least two sliced sound samples containing independent ID marks and one sliced sound sample containing two ID marks;
and (4) collecting all the fragment sound samples to form a sample set for training a human voice separation basic model.
8. An apparatus for speech intent recognition, the apparatus comprising:
the acquiring unit is used for acquiring audio data to be identified sent by a first ID user and searching a pre-stored voiceprint characteristic corresponding to the first ID user in a preset voiceprint library;
the judging unit is used for judging whether the audio data contains the pronunciations of a plurality of persons;
the voice separation unit is used for carrying out voice separation on the audio data through a pre-trained voice separation model to obtain separate audio data of a plurality of different persons if the audio data contains pronunciations of a plurality of persons;
the analysis unit is used for respectively analyzing the voiceprint characteristics of the individual audio data of the plurality of different people to obtain the voiceprint characteristics of the plurality of different people;
a first extracting unit, configured to extract, from the voiceprint features of a plurality of different people, a target voiceprint feature that is the same as the pre-stored voiceprint feature;
a second extraction unit configured to extract, as target individual audio data, individual audio data corresponding to the target voiceprint feature from among the individual audio data of a plurality of different persons;
and the intention recognition unit is used for converting the target individual audio data into text data and inputting the text data into a pre-trained intention recognition model to obtain the intention of the audio data to be recognized.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202210743461.2A 2022-06-27 2022-06-27 Speech intention recognition method and device, computer equipment and storage medium Pending CN115019802A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210743461.2A CN115019802A (en) 2022-06-27 2022-06-27 Speech intention recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210743461.2A CN115019802A (en) 2022-06-27 2022-06-27 Speech intention recognition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115019802A true CN115019802A (en) 2022-09-06

Family

ID=83077846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210743461.2A Pending CN115019802A (en) 2022-06-27 2022-06-27 Speech intention recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115019802A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796290A (en) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796290A (en) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium
CN116796290B (en) * 2023-08-23 2024-03-29 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium

Similar Documents

Publication Publication Date Title
CN107644638B (en) Audio recognition method, device, terminal and computer readable storage medium
CN111028827A (en) Interaction processing method, device, equipment and storage medium based on emotion recognition
CN111105782B (en) Session interaction processing method and device, computer equipment and storage medium
US20050049868A1 (en) Speech recognition error identification method and system
CN110689881B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
WO2013110125A1 (en) Voice authentication and speech recognition system and method
AU2013203139A1 (en) Voice authentication and speech recognition system and method
CN110010121B (en) Method, device, computer equipment and storage medium for verifying answering technique
CN112925945A (en) Conference summary generation method, device, equipment and storage medium
CN112541738B (en) Examination and approval method, device, equipment and medium based on intelligent conversation technology
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN110164417B (en) Language vector obtaining and language identification method and related device
CN111883140A (en) Authentication method, device, equipment and medium based on knowledge graph and voiceprint recognition
CN113506574A (en) Method and device for recognizing user-defined command words and computer equipment
CN111627432A (en) Active call-out intelligent voice robot multi-language interaction method and device
CN111986675A (en) Voice conversation method, device and computer readable storage medium
CN113192516A (en) Voice role segmentation method and device, computer equipment and storage medium
CN115019802A (en) Speech intention recognition method and device, computer equipment and storage medium
CN111126233A (en) Call channel construction method and device based on distance value and computer equipment
CN111312286A (en) Age identification method, age identification device, age identification equipment and computer readable storage medium
CN114360522B (en) Training method of voice awakening model, and detection method and equipment of voice false awakening
CN111986651A (en) Man-machine interaction method and device and intelligent interaction terminal
CN116631412A (en) Method for judging voice robot through voiceprint matching
US11615787B2 (en) Dialogue system and method of controlling the same
CN111177353A (en) Text record generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination