CN112562682A - Identity recognition method, system, equipment and storage medium based on multi-person call - Google Patents

Identity recognition method, system, equipment and storage medium based on multi-person call Download PDF

Info

Publication number
CN112562682A
CN112562682A CN202011394092.8A CN202011394092A CN112562682A CN 112562682 A CN112562682 A CN 112562682A CN 202011394092 A CN202011394092 A CN 202011394092A CN 112562682 A CN112562682 A CN 112562682A
Authority
CN
China
Prior art keywords
audio
sub
text
person
identity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011394092.8A
Other languages
Chinese (zh)
Inventor
李亚枫
任君
罗超
胡泓
李巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN202011394092.8A priority Critical patent/CN112562682A/en
Publication of CN112562682A publication Critical patent/CN112562682A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides an identity recognition method, a system, equipment and a storage medium based on multi-person communication, wherein the method comprises the following steps: converting original dialogue voice to obtain a first text; dividing conversational speech with a plurality of user participation into a plurality of conversational sentence audios; cutting each dialogue statement audio into a plurality of sections of sub-audio and a second text corresponding to each dialogue statement audio; extracting audio features from the sub-audio, inputting the audio features into a deep learning network, and obtaining voiceprint feature information of the sub-audio; acquiring a sub-audio set of a user based on the voiceprint characteristic information of each section of sub-audio; according to the method, the client side can assist in tidying multi-person call materials, time spent by workers on each call audio is reduced, manpower is greatly reduced, and working efficiency is improved; and the safety performance of scenes such as authorization and the like is improved.

Description

Identity recognition method, system, equipment and storage medium based on multi-person call
Technical Field
The invention relates to the field of voice recognition, in particular to an identity recognition method, system, equipment and storage medium based on multi-person communication.
Background
The multi-person communication condition is complex, various switching conditions exist, a plurality of speakers appear in the left sound channel of one telephone, the data are directly used for model training, the model effect is seriously influenced, if manual marking is carried out, the workload is too large, time and labor are wasted, and the manpower can be greatly reduced by using the method. Second, the time spent watching audio text is much less than listening to one-pass audio, and in some cases it is not feasible to listen at speed.
Therefore, the invention provides an identity identification method, system, equipment and storage medium based on multi-person communication.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide an identity identification method, system, equipment and storage medium based on multi-person communication, which overcome the difficulties in the prior art, can assist customers to arrange multi-person communication materials, reduce the time spent by workers on each communication audio frequency, greatly reduce manpower and improve the working efficiency.
The embodiment of the invention provides an identity recognition method based on multi-person communication, which comprises the following steps:
s110, converting original dialogue voice with a plurality of users to words to obtain a first text;
s120, carrying out segmentation based on silence suppression detection on the dialogue voice with the participation of a plurality of users to obtain a plurality of dialogue sentence audios;
s130, cutting each dialogue statement audio frequency into a plurality of sections of sub-audio frequencies according to a preset duration as a unit, and cutting the first text based on the time sequence of the dialogue statement audio frequency to obtain a second text corresponding to each sub-audio frequency;
s140, extracting audio features from the sub-audio, inputting the audio features into a deep learning network, and obtaining voiceprint feature information of the sub-audio;
s150, clustering the voiceprint characteristic information of each section of sub-audio to obtain sub-audio sets belonging to different users;
s160, summarizing the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputting the third text into an identity recognition neural network to obtain a preset identity corresponding to the sub-audio set.
Preferably, in step S130, the sub-audio with the duration not meeting the preset duration is locally copied, so that all the sub-audio meets the preset duration.
Preferably, in step S150, the sub-audio sets belonging to different users are obtained through the uisrnn network
Preferably, the step S160 further includes the following steps:
and S170, obtaining the dialogue voice corresponding to each preset identity and the corresponding third text.
Preferably, the step S170 is followed by the following steps:
s180, when the preset identity comprises a customer and a supplier, extracting at least one keyword with the largest occurrence frequency from a third text corresponding to the conversation voice of the customer, and searching a product containing the keyword in a product library preset by the supplier.
Preferably, the original dialogue voice has at least one user and a plurality of suppliers, the suppliers have respective product libraries, and at least one keyword with the largest occurrence number is extracted from a third text corresponding to the dialogue voice of the customer.
The embodiment of the present invention further provides an identity recognition system based on multi-person communication, which is used for implementing the above identity recognition method based on multi-person communication, and the identity recognition system based on multi-person communication includes:
the transcription module is used for converting the original dialogue voice with a plurality of users to words to obtain a first text;
the segmentation module is used for carrying out segmentation based on silence suppression detection on the conversation voice with the participation of a plurality of users to obtain a plurality of conversation sentence audios;
the cutting module is used for cutting each dialogue statement audio frequency into a plurality of sections of sub-audio frequencies according to a preset duration unit, and cutting the first text based on the time sequence of the dialogue statement audio frequency to obtain a second text corresponding to each sub-audio frequency;
the characteristic module extracts audio characteristics from the sub-audio, inputs the audio characteristics into the deep learning network and obtains voiceprint characteristic information of the sub-audio;
the clustering module is used for clustering the voiceprint characteristic information of each section of sub-audio to obtain sub-audio sets belonging to different users; and the identity module is used for summarizing the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputting the third text into an identity recognition neural network to obtain a preset identity corresponding to the sub-audio set.
The embodiment of the invention also provides an identity recognition device based on multi-person communication, which comprises:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the above-mentioned multi-person call based identification method via execution of the executable instructions.
An embodiment of the present invention further provides a computer-readable storage medium for storing a program, where the program is executed to implement the steps of the above-mentioned multi-person conversation-based identity recognition method.
The invention aims to provide an identity recognition method, an identity recognition system, identity recognition equipment and a storage medium based on multi-person conversation, which can assist customers to tidy multi-person conversation materials, reduce the time spent by workers on each communication audio frequency, automatically classify the conversation of original conversation voice of multi-person conversation and accurately label the conversation voice, facilitate model training, avoid manual labeling, greatly reduce manpower and improve the working efficiency.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
Fig. 1 is a flowchart of an identity recognition method based on multi-person communication according to the present invention.
Fig. 2 to 4 are schematic process diagrams of the identity recognition method based on multi-person conversation according to the present invention.
Fig. 5 is a schematic block diagram of a multi-person call based identification system according to the present invention.
Fig. 6 is a schematic structural diagram of an identification device based on multi-person conversation according to the present invention.
Fig. 7 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repetitive description will be omitted.
Fig. 1 is a flowchart of an identity recognition method based on multi-person communication according to the present invention. As shown in fig. 1, an embodiment of the present invention provides an identity recognition method based on multi-person communication, including the following steps:
s110, converting original dialogue voice with a plurality of users to words to obtain a first text;
s120, carrying out segmentation based on silence suppression detection on the dialogue voice with the participation of a plurality of users to obtain a plurality of dialogue sentence audios;
s130, cutting each dialogue statement audio frequency into a plurality of sections of sub-audio frequencies according to a preset duration as a unit, and cutting the first text based on the time sequence of the dialogue statement audio frequency to obtain a second text corresponding to each sub-audio frequency;
s140, extracting audio features from the sub-audio, inputting the audio features into a deep learning network, and obtaining voiceprint feature information of the sub-audio;
s150, clustering the voiceprint characteristic information of each section of sub-audio to obtain sub-audio sets belonging to different users;
s160, summarizing the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputting the third text into an identity recognition neural network to obtain a preset identity corresponding to the sub-audio set.
And S170, obtaining the dialogue voice corresponding to each preset identity and the corresponding third text.
S180, when the preset identity comprises a customer and a supplier, extracting at least one keyword with the largest occurrence frequency from a third text corresponding to the conversation voice of the customer, and searching a product containing the keyword in a product library preset by the supplier.
In a preferred embodiment, in the step S130, sub-audio with a duration not meeting a preset duration is locally copied, so that all the sub-audio meets the preset duration.
In a preferred embodiment, the preset time period ranges from 1 second to 10 seconds, for example: 1 second, 2 seconds, 3 seconds, 4 seconds, 5 seconds, 6 seconds, 7 seconds, 8 seconds, 9 seconds, 10 seconds, etc., but not limited thereto.
In a preferred embodiment, in step S150, the sub-audio sets belonging to different users are obtained through a uisrn network.
In a preferred embodiment, the original dialogue speech has at least one user and a plurality of suppliers, the suppliers have respective product libraries, and at least one keyword with the largest occurrence number is extracted from the third text corresponding to the dialogue speech of the customer.
In a preferred scheme, after searching products containing keywords in a product library preset by all suppliers and sorting recommendation values, at least one piece of product information with the highest recommendation value is sent to a customer.
The invention aims to solve the problem of providing an identity identification method suitable for multi-person call, namely segmenting the call by speakers and giving the identity type of each speaker, wherein the identity type of each speaker is one of four types of hotels, suppliers, guests and colleagues in a hotel scene.
The invention provides an identity recognition method based on multi-person communication. Mainly comprises the following steps:
1) channel separation is performed on audio, and Voice Activity Detection (VAD) is used for cutting according to sentences, so that only one speaker is basically guaranteed in each segment.
2) And extracting acoustic features stft (short-time Fourier transform audio features) of each section of audio, and extracting voiceprint features by using Res-CNN to serve as input of uisrnn (which can be Google UIS-RNN, namely unbounded staggered state recurrent neural network UIS-RNN algorithm), so that the segmented voice sections are divided into the same speaker.
3) The asr transcription is accessed, and the speaker identity of each audio segment is confirmed based on the text classification model.
The invention can make asr transferred chapter structure more clear, mark out each sentence corresponding speaker category, can let the staff directly and clearly understand the conversation situation without listening to the recording, directly reduce the time spent by the staff on each voice frequency, greatly reduce the manpower, and improve the working efficiency.
The invention provides an identity recognition method based on multi-person communication, which can comprise four parts of preprocessing, speaker clustering, identity class confirmation of each speaker and result integration and return. The technology comprises the following steps:
the method comprises the following steps: pre-treatment: most calls are 8k, and calls other than 8k are down-sampled to 8k to prevent special cases. The right channel is unified as the interior service, so the channel separation is carried out and only the left channel audio is selected. For real-time incoming audio streams, the cutting is done by VAD and the silence is removed. Extracting the stft characteristic of each section of audio frequency, and taking the stft characteristic as Res-CNN network input to obtain the voiceprint characteristic
Step two: speaker clustering: and directly dividing each segment of speakers into the same speaker by using the uisrnn network proposed by Google.
Step three: identity confirmation part of each speaker: asr (Automatic Speech Recognition technology) transcription is accessed, asr transcription texts of the same speaker are spliced to form a long text, and the sequential identification of each speaker and the corresponding text are stored in a dictionary form. And obtaining the identity of the section through a text classification model after processing jieba word segmentation (Chinese word segmentation in the ending) and the like, wherein the identity of the section is one of four types of guests, suppliers, hotels and co-workers in the company under the hotel scene. For the case that the identity prediction is the same among different speakers, the display is in the form of "identity _ i", for example, if the identities of two speakers in the speaker clustering part are both predicted as guests, the first speaker appears as "guest _ 1", and the second speaker appears as "guest _ 2".
Step four: and integrating the returned results of the three steps, and returning the final result according to the required format.
The invention provides a multi-caller identity confirmation system which is suitable for the condition of a large number of new-in users calling every day. The invention does not need to build a user voiceprint library, can save a large amount of storage cost and time, and can be used immediately after online. The method is applied to the model training data selection stage and the petrel platform display page, training data can be directly selected without being marked by a marking person, and the reliability of the model result is improved. The identity of each speaker is displayed on the petrel platform, and by combining asr transcription, the conversation situation can be clearly known under the condition without audio assistance, the time for workers to listen to each audio to know the situation is saved, the working speed of the workers is accelerated, and the labor cost is saved.
In one embodiment, the audio cutting module mainly comprises speaker change detection cutting, wherein the cutting is carried out by positioning points of speaker changes in audio, the existing method mainly comprises distance measurement and model search, and the two methods are based on acoustic characteristics to measure the difference between two sections of audio and carry out judgment according to a threshold value. Model search is mostly suitable for the situation of speakers in a set and is not suitable for a large number of new users every day. The clustering module mostly adopts a supervised clustering technology provided by Google and adopts uis-rnn network, thereby avoiding the problem of judging the number of speakers by the traditional clustering technology. For the identity confirmation module, the current method mainly aims at meeting records and other scenes, and mostly adopts closed set speaker identification, namely speakers appearing in audio belong to a known speaker set, but if a large number of new users enter a conversation every day, huge storage cost is needed for constructing a voiceprint library for the speakers, and the significance of judging actual identity information is not large.
In a preferred embodiment, the processing flow in the implementation of the present invention is mainly divided into four major parts: preprocessing, speaker clustering, identity class confirmation of each speaker and result integration and return. The specific implementation steps are as follows:
step 1: and (6) performing pretreatment.
A hotel mostly has 8k of voice frequency of a large number of telephone channels every day, but in case of special situations, the call of non-8 k is down-sampled to 8 k; the conversation is dual-channel, in any case, the right sound channel of each conversation is customer service and only has one speaker, and the left sound channel has a plurality of speakers due to the switching and other conditions, so that the channels need to be separated, and the left sound channel audio is selected for subsequent processing; after the channels are separated, the original part of the customer service speech becomes a mute section, and the mute section is cut according to sentences based on VAD and deleted. The VAD is detected every 50ms, and if the continuous 0.5s is a mute part, a speech is considered to be over, and the audio cutting is performed. In order to improve the accuracy, the audio segment longer than 4s is cut according to 4s of each segment, and the last segment less than 4s is copied by self; for the remaining fragments less than 4s, replicate itself to 4 s; finally, putting the model according to the stft characteristics of each segment to obtain 512-dimensional voiceprint characteristics
Step 2: speaker clustering
The speaker clustering directly adopts the uisrnn network structure, the dimension N M512 is input, N is the number of telephone channels, M is the number of segments per channel, and the appearance sequence identification of each corresponding speaker is directly output.
And step 3: per speaker identity class validation
A large number of new users enter the communication every day, the same user incoming line has the characteristics of frequent use in a period of time, silence in most of the time and the like, and the identification of the specific identity information of each speaker has no practical significance. Thus the speaker identities are summarized into four classes: suppliers, hotels, guests, and company-internal associates. The invention accesses asr to transfer, splices the texts of the same speaker in each call into long texts, and determines the identity category of each speaker based on text classification. The speaker and corresponding text are here in dictionary form for better integration of the returned results.
And 4, step 4: result integration return
Because the audio frequency cutting, the speaker clustering and the identification result of each speaker are separately returned, the corresponding speaking time period of each speaker is returned according to the speaker appearance sequence after the integration. For the condition that the identity prediction among different speakers in each communication is the same, the identity prediction is shown in the form of 'identity _ i', for example, if the identities of two speakers in the speaker clustering part are both predicted as guests, the first speaker appears as 'guest _ 1', and the second speaker appears as 'guest _ 2'.
Fig. 2 to 4 are schematic process diagrams of the identity recognition method based on multi-person conversation according to the present invention. Referring to fig. 2 to 4, the implementation of the present invention is as follows:
referring to fig. 2, a piece of original dialogue speech with a client a and a user B is converted from speech to text to obtain a first text, "do you have an intention to travel? You good! I want to go to the north seaway for skiing. Roughly when? Winter this year. How do you go with your family? To that, i and i are too two people together. "a dialogue voice with multiple users participating in is divided based on silence suppression detection to obtain multiple dialogue sentence audios, and the obtained 6 dialogue sentence audios are respectively: "do you have a travel intention? "11", "hello! I want to go to the north seaway for skiing. "12," when presumably? "13," winter this year. "14," how do you go with your family? "15," right, I and I are too two people together. "16.
Each dialogue sentence audio is cut into a plurality of sub-audios according to the unit of time length of 2 seconds, the first text is cut based on the time sequence of the dialogue sentence audio, and the second text corresponding to each sub-audio is obtained, for example, the dialogue sentence audio 12 ("hello | i am wanting to go to skiing in hokkaido"). The first text is cut according to the time sequence of the dialogue statement audio, which is cut into three sub-audios 121 (duration 2 seconds), 122 (duration 2 seconds) and 123 (duration 1 second) every two seconds, and three second texts "hello!corresponding to the three sub-audios are obtained! I am "," want to go to the north sea passage "," skiing ". The sub-sound 123 (duration 1 second) with duration not meeting the preset duration is locally copied to obtain the sub-sound 123' ("skiing" duration 2 seconds) to meet the preset duration.
Then, the voiceprint characteristic information of each section of the sub-audio is input into a voiceprint characteristic neural network for clustering to obtain sub-audio sets belonging to different users,
that is, the sub-audio set of the first user includes the dialogue statement audio 11, the dialogue statement audio 13, and the dialogue statement audio 15, and a third text is obtained by summarizing the second texts corresponding to the sub-audio in each sub-audio set, where the third text is "do you intend to travel? "when presumably? "how do you go with your family? Inputting a third text into the identity recognition neural network to obtain a preset identity corresponding to the sub-audio set as a tourist website customer service, and respectively marking labels of the tourist website customer service on a dialogue statement audio 11, a dialogue statement audio 13 and a dialogue statement audio 15 to distinguish the labels from other dialogue statement audios.
Similarly, the sub-audio set of the second user includes a dialogue statement audio 12, a dialogue statement audio 14, and a dialogue statement audio 16, and a third text is obtained by summarizing the second text corresponding to the sub-audio in each sub-audio set, where the third text is "hello |)! I want to go to the north seaway for skiing. "winter this year" right, i and i too many people go together. "the third text is input into the identity recognition neural network to obtain the preset identity corresponding to the sub-audio set as the travel product user, and labels of travel website customer service are respectively marked on the dialogue statement audio 12, the dialogue statement audio 14 and the dialogue statement audio 16 to distinguish from other dialogue statement audio.
And finally, extracting at least one keyword with the most occurrence frequency from a third text corresponding to the conversation voice of the travel product user, searching products containing the keyword in a product library preset by a travel website customer service, and sending product information to the travel product user. The travel website service in this embodiment may be a real person or an AI conversation robot, but is not limited thereto.
Fig. 5 is a schematic block diagram of a multi-person call based identification system according to the present invention. As shown in fig. 5, an embodiment of the present invention further provides an identity recognition system based on multi-person communication, for implementing the above identity recognition method based on multi-person communication, where the identity recognition system 5 based on multi-person communication includes:
a transcription module 501, which converts the original dialogue voice with multiple users to obtain a first text;
a segmentation module 502, which performs segmentation based on silence suppression detection on the conversational speech with multiple user participation to obtain multiple conversational sentence audios;
the cutting module 503 cuts each dialog sentence audio frequency into a plurality of segments of sub-audio frequencies according to a preset time length, and cuts the first text based on the time sequence of the dialog sentence audio frequency to obtain a second text corresponding to each sub-audio frequency;
the feature module 504 extracts audio features from the sub-audio, inputs the audio features into the deep learning network, and obtains voiceprint feature information of the sub-audio;
the clustering module 505 is used for inputting the voiceprint characteristic information of each section of sub audio frequency into a voiceprint characteristic neural network for clustering to obtain sub audio frequency sets belonging to different users;
the identity module 506 summarizes the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputs the third text into the identity recognition neural network to obtain the preset identities corresponding to the sub-audio sets.
The integration module 507 obtains a dialog voice corresponding to each preset identity and a corresponding third text.
The push module 508, which presets identities including a customer and a supplier, extracts at least one keyword having the largest occurrence frequency from a third text corresponding to a dialogue voice of the customer, searches a product library preset by the supplier for a product including the keyword, and sends product information to the customer. In a variation, the original dialogue voice may have at least one user and a plurality of suppliers, the suppliers have respective product libraries, at least one keyword with the largest occurrence number is extracted from a third text corresponding to the dialogue voice of the customer, products containing the keyword are searched in all product libraries preset by the suppliers, and after ranking of recommendation values, at least one product information with the highest recommendation value is sent to the customer, thereby improving the purchase conversion rate of the product information.
In another embodiment, the invention distinguishes the speaking segments of different speakers and marks the identities through the collected voice signal stream, and the system is mainly divided into 4 relatively independent modules: 1) an audio cutting module that removes a quiet noise part and cuts audio into a plurality of segments; 2) the voiceprint feature extraction module is used for extracting voiceprint features according to the features, and mainly comprises actuators and dvectors; 3) and the clustering module divides each segment into the same speaker by a clustering technology according to the extracted voiceprint characteristics. 4) And the identity confirmation module is used for determining the specific identity of each speaker.
The identity recognition system based on multi-person conversation can assist customers to tidy multi-person conversation materials, reduce time spent by workers on each communication audio frequency, automatically classify conversation and accurately label original conversation voice of multi-person conversation, facilitate model training, avoid manual labeling, greatly reduce manpower and improve working efficiency.
The embodiment of the invention also provides an identity recognition device based on multi-person communication, which comprises a processor. A memory having stored therein executable instructions of the processor. Wherein the processor is configured to perform the steps of the multi-person call based identification method via execution of the executable instructions.
As shown above, the identity recognition system based on multi-person conversation of the embodiment of the present invention can assist the customer to arrange multi-person conversation materials, reduce the time spent by the staff on each communication audio frequency, automatically classify the conversation of the original conversation voice of multi-person conversation and label the conversation accurately, facilitate model training, avoid manual labeling, greatly reduce manpower, and improve work efficiency.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
Fig. 6 is a schematic structural diagram of an identification device based on multi-person conversation according to the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the identity identification method based on the multi-person call when being executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.
As shown above, the identity recognition system based on multi-person conversation of the embodiment of the present invention can assist the customer to arrange multi-person conversation materials, reduce the time spent by the staff on each communication audio frequency, automatically classify the conversation of the original conversation voice of multi-person conversation and label the conversation accurately, facilitate model training, avoid manual labeling, greatly reduce manpower, and improve work efficiency.
Fig. 7 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention aims to provide an identity recognition method, system, device and storage medium based on multi-person conversation, and the identity recognition system based on multi-person conversation of the present invention can assist customers to arrange multi-person conversation materials, reduce the time spent by workers on each communication audio frequency, automatically classify the conversation of the original conversation voice of multi-person conversation and label the conversation accurately, facilitate model training, avoid manual labeling, greatly reduce manpower, and improve work efficiency.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (9)

1. An identity recognition method based on multi-person communication is characterized by comprising the following steps:
s110, converting original dialogue voice with a plurality of users to words to obtain a first text;
s120, carrying out segmentation based on silence suppression detection on the dialogue voice with the participation of a plurality of users to obtain a plurality of dialogue sentence audios;
s130, cutting each dialogue statement audio frequency into a plurality of sections of sub-audio frequencies according to a preset duration as a unit, and cutting the first text based on the time sequence of the dialogue statement audio frequency to obtain a second text corresponding to each sub-audio frequency;
s140, extracting audio features from the sub-audio, inputting the audio features into a deep learning network, and obtaining voiceprint feature information of the sub-audio;
s150, clustering the voiceprint characteristic information of each section of sub-audio to obtain sub-audio sets belonging to different users;
s160, summarizing the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputting the third text into an identity recognition neural network to obtain a preset identity corresponding to the sub-audio set.
2. The multi-person conversation-based identity recognition method of claim 1, wherein in step S130, sub-audio with duration not satisfying the preset duration is locally copied so that all the sub-audio satisfies the preset duration.
3. The multi-user call based identification method of claim 1, wherein in step S150, the sub audio sets belonging to different users are obtained through a uisrnn network.
4. The multi-person call based identification method according to any one of claims 1 to 3, further comprising the following steps after step S160:
and S170, obtaining the dialogue voice corresponding to each preset identity and the corresponding third text.
5. The multi-person conversation-based identity recognition method according to claim 5, further comprising the following steps after the step S170:
s180, when the preset identity comprises a customer and a supplier, extracting at least one keyword with the largest occurrence frequency from a third text corresponding to the conversation voice of the customer, and searching a product containing the keyword in a product library preset by the supplier.
6. The multi-person conversation-based identity recognition method of claim 6, wherein the original conversation voice comprises at least one user and a plurality of suppliers, the suppliers have respective product libraries, and at least one keyword with the largest occurrence number is extracted from the third text corresponding to the conversation voice of the customer.
7. An identity recognition system based on multi-person communication for implementing the identity recognition method based on multi-person communication of claim 1, comprising:
the transcription module is used for converting the original dialogue voice with a plurality of users to words to obtain a first text;
the segmentation module is used for carrying out segmentation based on silence suppression detection on the conversation voice with the participation of a plurality of users to obtain a plurality of conversation sentence audios;
the cutting module is used for cutting each dialogue statement audio frequency into a plurality of sections of sub-audio frequencies according to a preset duration unit, and cutting the first text based on the time sequence of the dialogue statement audio frequency to obtain a second text corresponding to each sub-audio frequency;
the characteristic module extracts audio characteristics from the sub-audio, inputs the audio characteristics into the deep learning network and obtains voiceprint characteristic information of the sub-audio;
the clustering module is used for clustering the voiceprint characteristic information of each section of sub-audio to obtain sub-audio sets belonging to different users;
and the identity module is used for summarizing the second texts corresponding to the sub-audios in each sub-audio set to obtain a third text, and inputting the third text into an identity recognition neural network to obtain a preset identity corresponding to the sub-audio set.
8. An identification device based on a multi-person call, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the multi-person call based identification method of any one of claims 1 to 6 via execution of the executable instructions.
9. A computer-readable storage medium storing a program which, when executed, performs the steps of the multi-person call based identification method of any one of claims 1 to 6.
CN202011394092.8A 2020-12-02 2020-12-02 Identity recognition method, system, equipment and storage medium based on multi-person call Pending CN112562682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011394092.8A CN112562682A (en) 2020-12-02 2020-12-02 Identity recognition method, system, equipment and storage medium based on multi-person call

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011394092.8A CN112562682A (en) 2020-12-02 2020-12-02 Identity recognition method, system, equipment and storage medium based on multi-person call

Publications (1)

Publication Number Publication Date
CN112562682A true CN112562682A (en) 2021-03-26

Family

ID=75047505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011394092.8A Pending CN112562682A (en) 2020-12-02 2020-12-02 Identity recognition method, system, equipment and storage medium based on multi-person call

Country Status (1)

Country Link
CN (1) CN112562682A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299279A (en) * 2021-05-18 2021-08-24 上海明略人工智能(集团)有限公司 Method, apparatus, electronic device and readable storage medium for associating voice data and retrieving voice data
CN113380271A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Emotion recognition method, system, device and medium
CN113571085A (en) * 2021-07-24 2021-10-29 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543402B1 (en) * 2010-04-30 2013-09-24 The Intellisis Corporation Speaker segmentation in noisy conversational speech
CN106782507A (en) * 2016-12-19 2017-05-31 平安科技(深圳)有限公司 The method and device of voice segmentation
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
CN109767757A (en) * 2019-01-16 2019-05-17 平安科技(深圳)有限公司 A kind of minutes generation method and device
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
CN111091835A (en) * 2019-12-10 2020-05-01 携程计算机技术(上海)有限公司 Model training method, voiceprint recognition method, system, device and medium
CN111105801A (en) * 2019-12-03 2020-05-05 云知声智能科技股份有限公司 Role voice separation method and device
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543402B1 (en) * 2010-04-30 2013-09-24 The Intellisis Corporation Speaker segmentation in noisy conversational speech
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
CN106782507A (en) * 2016-12-19 2017-05-31 平安科技(深圳)有限公司 The method and device of voice segmentation
CN109767757A (en) * 2019-01-16 2019-05-17 平安科技(深圳)有限公司 A kind of minutes generation method and device
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment
CN111105801A (en) * 2019-12-03 2020-05-05 云知声智能科技股份有限公司 Role voice separation method and device
CN111091835A (en) * 2019-12-10 2020-05-01 携程计算机技术(上海)有限公司 Model training method, voiceprint recognition method, system, device and medium
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299279A (en) * 2021-05-18 2021-08-24 上海明略人工智能(集团)有限公司 Method, apparatus, electronic device and readable storage medium for associating voice data and retrieving voice data
CN113571085A (en) * 2021-07-24 2021-10-29 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN113571085B (en) * 2021-07-24 2023-09-22 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN113380271A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Emotion recognition method, system, device and medium
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10692500B2 (en) Diarization using linguistic labeling to create and apply a linguistic model
CN112562682A (en) Identity recognition method, system, equipment and storage medium based on multi-person call
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
CN111933129A (en) Audio processing method, language model training method and device and computer equipment
US10062385B2 (en) Automatic speech-to-text engine selection
CN110689877A (en) Voice end point detection method and device
JP2019211749A (en) Method and apparatus for detecting starting point and finishing point of speech, computer facility, and program
CN111968679A (en) Emotion recognition method and device, electronic equipment and storage medium
CN107562760A (en) A kind of voice data processing method and device
CN113327609A (en) Method and apparatus for speech recognition
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
US20220238118A1 (en) Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
CN111489743A (en) Operation management analysis system based on intelligent voice technology
CN108364655B (en) Voice processing method, medium, device and computing equipment
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN114818649A (en) Service consultation processing method and device based on intelligent voice interaction technology
EP4352630A1 (en) Reducing biases of generative language models
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN112087726B (en) Method and system for identifying polyphonic ringtone, electronic equipment and storage medium
CN111949778A (en) Intelligent voice conversation method and device based on user emotion and electronic equipment
CN115691500A (en) Power customer service voice recognition method and device based on time delay neural network
CN114974294A (en) Multi-mode voice call information extraction method and system
CN108364654B (en) Voice processing method, medium, device and computing equipment
CN113393845A (en) Method and device for speaker recognition, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination