WO2021212998A1 - 基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质 - Google Patents

基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021212998A1
WO2021212998A1 PCT/CN2021/077577 CN2021077577W WO2021212998A1 WO 2021212998 A1 WO2021212998 A1 WO 2021212998A1 CN 2021077577 W CN2021077577 W CN 2021077577W WO 2021212998 A1 WO2021212998 A1 WO 2021212998A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
detection
video data
detection model
model
Prior art date
Application number
PCT/CN2021/077577
Other languages
English (en)
French (fr)
Inventor
王家桢
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021212998A1 publication Critical patent/WO2021212998A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a method, device, computer equipment, and storage medium for speech and verbal detection based on multi-level logic.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for speech and speech detection based on multi-level logic, which are intended to solve the problem of converting audio in audio or video recording into text in a speech quality inspection system in the prior art. , Relying only on synonyms, degree adverbs, negative words, derogatory words, emotional words, and the use of simple keyword matching, resulting in poor detection results.
  • the embodiments of the present application provide a multi-level logic-based speech and speech detection method, which includes:
  • the speech detection model includes several detection conditions, And each detection condition includes several detection operators, and the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected;
  • the stored speech detection model stores a speech detection model corresponding to the current scene identification information, perform speech and text content recognition on the to-be-detected video data according to the speech detection model to obtain a recognition result;
  • the stored speech detection model does not store the speech detection model corresponding to the current scene identification information, it will be used to indicate that the applicable detection model is not stored to the monitoring terminal.
  • an embodiment of the present application provides a multi-level logic-based speech and speech detection device, which includes:
  • the data receiving unit is used to receive the to-be-detected video data and user account information uploaded by the user terminal;
  • the model judging unit is configured to obtain the current scene identification information according to the user account information, and determine whether the stored speech skill detection model stores a speech skill detection model corresponding to the current scene identification information; wherein, the speech skill detection model includes There are several detection conditions, and each detection condition includes several detection operators, and the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected;
  • the text recognition unit is configured to, if a speech detection model corresponding to the current scene identification information is stored in the stored speech detection model, perform speech and text content recognition on the to-be-detected video data according to the speech detection model to obtain a recognition result ;as well as
  • a new prompting unit is added to the model, which is used to prompt the monitoring terminal for the prompt information that the applicable detection model is not stored if there is no technical detection model corresponding to the current scene identification information stored in the stored technical detection model.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
  • the speech detection model includes several detection conditions, And each detection condition includes several detection operators, and the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected;
  • the stored speech detection model stores a speech detection model corresponding to the current scene identification information, perform speech and text content recognition on the to-be-detected video data according to the speech detection model to obtain a recognition result;
  • the stored speech detection model does not store the speech detection model corresponding to the current scene identification information, it will be used to indicate that the applicable detection model is not stored to the monitoring terminal.
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
  • the speech detection model includes several detection conditions, And each detection condition includes several detection operators, and the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected;
  • the stored speech detection model stores a speech detection model corresponding to the current scene identification information, perform speech and text content recognition on the to-be-detected video data according to the speech detection model to obtain a recognition result;
  • the stored speech detection model does not store the speech detection model corresponding to the current scene identification information, it will be used to indicate that the applicable detection model is not stored to the monitoring terminal.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for voice and speech detection based on multi-level logic, including receiving the to-be-detected video data and user account information uploaded by the user terminal; obtaining according to the user account information The current scene identification information, to determine whether the stored speech detection model stores a speech detection model corresponding to the current scene identification information; wherein, the speech detection model includes several detection conditions, and each detection condition includes Several detection operators, the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected; if the stored speech detection model stores the speech detection model corresponding to the current scene identification information, according to The speech detection model performs speech and text content recognition on the video data to be detected, and the recognition result is obtained; if the stored speech detection model does not store the speech detection model corresponding to the current scene identification information, it will be used to prompt the failure Store the prompt information of the applicable detection model to the monitoring terminal.
  • This method realizes speech and speech detection through a speech and speech detection model with multi-level logic
  • FIG. 1 is a schematic diagram of an application scenario of a multi-level logic-based speech speech detection method provided by an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a method for detecting speech and speech based on multi-level logic according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of a sub-flow of a multi-level logic-based speech and speech detection method provided by an embodiment of this application;
  • FIG. 4 is a schematic block diagram of a multi-level logic-based speech and speech detection device provided by an embodiment of the application;
  • Fig. 5 is a schematic block diagram of subunits of a multi-level logic-based speech and speech detection device provided by an embodiment of the application;
  • Fig. 6 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • Figure 1 is a schematic diagram of an application scenario of a multi-level logic-based speech and speech detection method provided by an embodiment of the application
  • Figure 2 is a multi-level logic-based speech and speech detection provided by an embodiment of the application
  • the multi-level logic-based speech and speech detection method is applied to a server, and the method is executed by application software installed in the server.
  • the method includes steps S110 to S140.
  • S110 Receive the to-be-detected video data and user account information uploaded by the user terminal.
  • the server is configured and stored in a large number of scenario detection models, which can perform text sentiment analysis on the words of an agent (for example, an agent selling life insurance) for quality inspection.
  • an agent for example, an agent selling life insurance
  • the agent uses the client terminal (such as a tablet computer, etc.) to log in to the agent’s system account, and then communicates with the user
  • the video communication process between the agent and the user is recorded by the client terminal (the recorded video can be regarded as Video data to be detected)
  • the user terminal uploads the automatically collected video data to be detected to the server, and at the same time uploads the user account information to the server.
  • At least the scene identification information corresponding to the agent is set in the user account information, such as scene identification of auto insurance sales, life insurance sales, etc., so that when the server knows the scene identification information of the agent, You can proceed to the quality inspection of the speech technique.
  • each speech detection model corresponds to the detection scene (corresponding to the scene identification information of the agent) to which the model applies, such as speech
  • the scene identification corresponding to the technical detection model 1 is bank double recording
  • the scene identification corresponding to the speech detection model 2 is insurance double recording (more specifically, it can be set to the auto insurance sales as the example above)
  • the scene identification corresponding to the speech detection model 3 For the promotion of financial products, the scene corresponding to the speech detection model 4 is identified as the remote loan review and so on.
  • the server needs to determine whether the stored speech detection model has a speech detection model corresponding to the current scene identification information, so as to detect according to the corresponding speech
  • the model performs quality inspection on the video data to be inspected.
  • step S130 includes:
  • S133 Perform word segmentation on the text recognition result through a word segmentation model based on probability statistics to obtain a corresponding word segmentation result;
  • the storage area stores, and sends the to-be-detected video data and corresponding user account information to the monitoring terminal.
  • the to-be-detected video data obtained by the server includes the encoding algorithm H264 (that is, the highly compressed digital data proposed by the joint video group jointly formed by the ITU-T Video Coding Expert Group and the ISO/IEC Moving Picture Expert Group).
  • Video codec standards video substreams with a resolution of 1080P, and other video data can also include the encoding algorithm EAC3 (full name Enhanced AC-3 bitstreams, that is, enhanced AC-3 bitstreams, which is an audio encoding format ), audio data such as audio substream with a sampling rate of 48KH.
  • EAC3 full name Enhanced AC-3 bitstreams, that is, enhanced AC-3 bitstreams, which is an audio encoding format
  • audio data such as audio substream with a sampling rate of 48KH.
  • the pre-stored parser in the calling server performs audio and video data separation on the to-be-detected video data, thereby obtaining audio data and video data.
  • the audio data can be recognized through the N-gram model to obtain a text recognition result; wherein the N-gram model is a pre-trained N-gram model.
  • the N-gram model is a language model (Language Model, LM).
  • the language model is a probability-based discriminant model. Its input is a sentence (the sequence of words), and the output is the probability of the sentence, that is, these words The joint probability (joint probability).
  • the N-gram model can also be used for speech and text recognition.
  • N-Gram language model is expressed by the formula as follows:
  • conditional probability of each word in the sentence T can be obtained by counting in the corpus. Then the n-ary model is as follows:
  • C (w in-1, ... , w i) represents a string w in-1, ..., w i is the number of times in the corpus.
  • the method before step S131, the method further includes:
  • the training set corpus is received, and the training set corpus is input to the initial N-gram model for training to obtain the N-gram model.
  • the training set corpus is a general corpus
  • the N-gram model for speech recognition can be obtained by inputting the training set corpus to the initial N-gram model for training.
  • the detection condition included in the speech detection model can be obtained, and it is determined whether the word segmentation result meets the detection condition.
  • the speech skill detection model includes multiple "conditions", and each "condition” has:
  • condition 1&& condition 2
  • Condition 2 is "when condition 1 is met";
  • the method further includes:
  • conditional association relationship includes an association relationship, or an association relationship, and a non-association relationship
  • the detection operator is a keyword or a regular expression
  • the operator association relationship includes an association relationship, or an association relationship, and a non-association relationship.
  • part of the target words in the word segmentation result are selected as operators (that is, the participles that are the same word as the preset operator in the word segmentation result or satisfy the regular expression corresponding to the operator can be used as target words and used as operators), Participate in the judgment of the detection condition.
  • operators that is, the participles that are the same word as the preset operator in the word segmentation result or satisfy the regular expression corresponding to the operator can be used as target words and used as operators
  • the recognition result of the video data to be detected is set to fail the speech quality inspection, and the video data to be detected is sent to the preset failing quality inspection data storage area Store it, and send the to-be-detected video data and corresponding user account information to the monitoring terminal. That is, in order to effectively divide the video data that passed and failed the quality inspection in the server, a storage area for passed and failed quality inspection data can be preset in the server. After that, the video data to be detected that passed the quality inspection is sent to the data storage area that has passed the quality inspection for storage, and the video data that has not passed the quality inspection to be detected is sent to the data storage area that fails the quality inspection for storage.
  • the video data to be detected and the corresponding user account information can be sent to the monitoring terminal at this time, and the monitoring terminal will conduct the inspection based on the video data to be detected. Investigate in detail after watching.
  • step S135 the method further includes:
  • the word segmentation result meets the detection condition, it means that the video data to be detected has passed the quality inspection.
  • the recognition result of the video data to be detected can be set to pass the speech quality inspection.
  • the video data to be detected is sent to a storage area of the data that has passed the quality inspection preset in the server for storage.
  • the following information can be configured:
  • the name of the speech technique detection model for example, it is configured to consult the customer's consent model
  • condition 1&& condition 2&& condition 3 The logical relationship between conditions, for example, it is configured as condition 1&& condition 2&& condition 3;
  • condition 1 specifically configure the content of the condition, the logical relationship between the condition range and the operator, and finally complete the type (keyword or regular expression) of each operator in the configuration condition 1, and the specific configuration content of the operator.
  • the way to configure condition 2 and condition 3 or even more conditions is to refer to configuration condition 1.
  • configuring the condition content of condition 1 you can choose to fill in none or other conditions.
  • configuring the condition range you can choose to fill in the full text or other conditions.
  • configuring the logical relationship between the operators you can connect through the relationship of and, or, and not.
  • the specific content of the operator is determined according to the type of the operator (for example, if the operator type is a keyword, the specific content of the operator is a keyword set, including several keywords).
  • step S140 the method further includes:
  • the newly added speech detection model will be uploaded to the server.
  • the server receives the newly-added speech detection model uploaded by the monitoring terminal, and performs speech and text content recognition on the to-be-detected video data according to the newly-added speech detection model to obtain the current recognition result. Perform voice and text content recognition on the video data to be detected according to the newly added speech detection model, and the complete process of obtaining the recognition result can refer to steps S131-S137.
  • This method realizes speech and speech detection through a speech detection model with multi-level logic, and the detection accuracy is improved, and when there is no suitable speech detection model in the server, it can prompt the monitoring terminal to configure the appropriate model in time to realize the speech.
  • Technical quality inspection realized by a speech detection model with multi-level logic, and the detection accuracy is improved, and when there is no suitable speech detection model in the server, it can prompt the monitoring terminal to configure the appropriate model in time to realize the speech.
  • the embodiment of the present application also provides a multi-level logic-based speech and language detection device, which is used to execute any embodiment of the aforementioned multi-level logic-based speech and language detection method.
  • FIG. 4 is a schematic block diagram of a multi-level logic-based speech and speech detection device provided by an embodiment of the present application.
  • the multi-level logic-based speech and speech detection device 100 can be configured in a server.
  • the voice and speech detection device 100 based on multi-level logic includes a data receiving unit 110, a model judgment unit 120, a text recognition unit 130, and a model addition prompt unit 140.
  • the data receiving unit 110 is configured to receive the to-be-detected video data and user account information uploaded by the user terminal.
  • the server is configured and stored in a large number of scenario detection models, which can perform text sentiment analysis on the words of an agent (for example, an agent selling life insurance) for quality inspection.
  • an agent for example, an agent selling life insurance
  • the agent uses the client terminal (such as a tablet computer, etc.) to log in to the agent’s system account, and then communicates with the user
  • the video communication process between the agent and the user is recorded by the client terminal (the recorded video can be regarded as Video data to be detected)
  • the user terminal uploads the automatically collected video data to be detected to the server, and at the same time uploads the user account information to the server.
  • At least the scene identification information corresponding to the agent is set in the user account information, such as scene identification of auto insurance sales, life insurance sales, etc., so that when the server knows the scene identification information of the agent, You can proceed to the quality inspection of the speech technique.
  • the model judging unit 120 is configured to obtain the current scene identification information according to the user account information, and determine whether or not a speech detection model corresponding to the current scene identification information is stored in the stored speech detection model; wherein, in the speech detection model A number of detection conditions are included, and each detection condition includes a number of detection operators, and the speech detection model is used to perform quality inspection on the text content corresponding to the video data to be detected;
  • each speech detection model corresponds to the detection scene (corresponding to the scene identification information of the agent) to which the model applies, such as speech
  • the scene identification corresponding to the technical detection model 1 is bank double recording
  • the scene identification corresponding to the speech detection model 2 is insurance double recording (more specifically, it can be set to the auto insurance sales as the example above)
  • the scene identification corresponding to the speech detection model 3 For the promotion of financial products, the scene corresponding to the speech detection model 4 is identified as the remote loan review and so on.
  • the server needs to determine whether the stored speech detection model has a speech detection model corresponding to the current scene identification information, so as to detect according to the corresponding speech
  • the model performs quality inspection on the video data to be inspected.
  • the text recognition unit 130 is configured to, if the stored speech detection model stores a speech detection model corresponding to the current scene identification information, perform speech and text content recognition on the to-be-detected video data according to the speech detection model to obtain recognition result.
  • the speech detection model in order to perform voice and text content recognition on the video data to be detected according to the speech detection model, before this, it is necessary to separate audio and video data from the audio data and video data in the video data to be detected, and then follow The speech detection model performs quality inspection on the audio data of the video data to be detected.
  • the text recognition unit 130 includes:
  • the audio and video data separation unit 131 is configured to separate the video data to be detected by audio and video data to obtain audio data and video data corresponding to the video data to be detected;
  • the audio text extraction unit 132 is configured to recognize the audio data through an N-gram model to obtain a text recognition result; wherein the N-gram model is a pre-trained N-gram model;
  • the text segmentation unit 133 is configured to segment the text recognition result through a probability and statistics segmentation model to obtain a corresponding segmentation result;
  • the detection condition obtaining unit 134 is configured to obtain the detection conditions included in the speech detection model; wherein, the detection conditions include at least one detection operator;
  • the detection condition judgment unit 135 is configured to judge whether the word segmentation result meets the detection condition
  • the first storage unit 136 is configured to, if the word segmentation result does not meet the detection condition, set the recognition result of the to-be-detected video data to fail the speech quality inspection, and send the to-be-detected video data to a preset
  • the failed quality inspection data storage area is stored, and the to-be-detected video data and the corresponding user account information are sent to the monitoring terminal.
  • the to-be-detected video data obtained by the server includes the encoding algorithm H264 (that is, the highly compressed digital data proposed by the joint video group jointly formed by the ITU-T Video Coding Expert Group and the ISO/IEC Moving Picture Expert Group).
  • Video codec standards video substreams with a resolution of 1080P, and other video data can also include the encoding algorithm EAC3 (full name Enhanced AC-3 bitstreams, that is, enhanced AC-3 bitstreams, which is an audio encoding format ), audio data such as audio substream with a sampling rate of 48KH.
  • EAC3 full name Enhanced AC-3 bitstreams, that is, enhanced AC-3 bitstreams, which is an audio encoding format
  • audio data such as audio substream with a sampling rate of 48KH.
  • the pre-stored parser in the calling server performs audio and video data separation on the to-be-detected video data, thereby obtaining audio data and video data.
  • the audio data can be recognized through the N-gram model to obtain a text recognition result; wherein the N-gram model is a pre-trained N-gram model.
  • the N-gram model is a language model (Language Model, LM).
  • the language model is a probability-based discriminant model. Its input is a sentence (the sequence of words), and the output is the probability of the sentence, that is, these words The joint probability (joint probability).
  • the N-gram model can also be used for speech and text recognition.
  • the text recognition unit 130 further includes:
  • the N-gram model training unit is used to receive the training set corpus, and input the training set corpus to the initial N-gram model for training, to obtain the N-gram model.
  • the training set corpus is a general corpus
  • the N-gram model for speech recognition can be obtained by inputting the training set corpus to the initial N-gram model for training.
  • the detection condition included in the speech detection model can be obtained, and it is determined whether the word segmentation result meets the detection condition.
  • the text recognition unit 130 further includes:
  • the condition association relationship obtaining unit is configured to obtain the condition association relationship between the detection conditions in the speech skill detection model; wherein, the condition association relationship includes an association relationship, or an association relationship, and a non-association relationship;
  • the detection operator acquiring unit is configured to acquire the detection operator corresponding to each detection condition in the speech detection model; wherein, the detection operator is a keyword or a regular expression;
  • the operator association relationship acquisition unit is configured to acquire the operator association relationship between the detection operators included in each detection condition; wherein the operator association relationship includes an association relationship, or an association relationship, and a non-association relationship.
  • part of the target words in the word segmentation result are selected as operators (that is, the participles that are the same word as the preset operator in the word segmentation result or satisfy the regular expression corresponding to the operator can be used as target words and used as operators), Participate in the judgment of the detection condition.
  • operators that is, the participles that are the same word as the preset operator in the word segmentation result or satisfy the regular expression corresponding to the operator can be used as target words and used as operators
  • the recognition result of the video data to be detected is set to fail the speech quality inspection, and the video data to be detected is sent to the preset failing quality inspection data storage area Store it, and send the to-be-detected video data and corresponding user account information to the monitoring terminal. That is, in order to effectively divide the video data that passed and failed the quality inspection in the server, a storage area for passed and failed quality inspection data can be preset in the server. After that, the video data to be detected that passed the quality inspection is sent to the data storage area that has passed the quality inspection for storage, and the video data that has not passed the quality inspection to be detected is sent to the data storage area that fails the quality inspection for storage.
  • the video data to be detected and the corresponding user account information can be sent to the monitoring terminal at this time, and the monitoring terminal will conduct the inspection based on the video data to be detected. Investigate in detail after watching.
  • the step text recognition unit 130 further includes:
  • the second storage unit 137 is configured to, if the word segmentation result meets the detection condition, set the recognition result of the to-be-detected video data to pass the speech quality inspection, and send the to-be-detected video data to a preset Has passed the quality inspection data storage area for storage.
  • the word segmentation result meets the detection condition, it means that the video data to be detected has passed the quality inspection.
  • the recognition result of the video data to be detected can be set to pass the speech quality inspection.
  • the video data to be detected is sent to a storage area of the data that has passed the quality inspection preset in the server for storage.
  • the newly added prompting unit 140 for the model is used for prompting that the prompt information of the applicable detection model is not stored to the monitoring terminal if the technical detection model corresponding to the current scene identification information is not stored in the stored technical detection model.
  • the following information can be configured:
  • the name of the speech technique detection model for example, it is configured to consult the customer's consent model
  • condition 1&& condition 2&& condition 3 The logical relationship between conditions, for example, it is configured as condition 1&& condition 2&& condition 3;
  • condition 1 specifically configure the content of the condition, the logical relationship between the condition range and the operator, and finally complete the type (keyword or regular expression) of each operator in the configuration condition 1, and the specific configuration content of the operator.
  • the way to configure condition 2 and condition 3 or even more conditions is to refer to configuration condition 1.
  • configuring the condition content of condition 1 you can choose to fill in none or other conditions.
  • configuring the condition range you can choose to fill in the full text or other conditions.
  • configuring the logical relationship between the operators you can connect through the relationship of and, or, and not.
  • the specific content of the operator is determined according to the type of the operator (for example, if the operator type is a keyword, the specific content of the operator is a keyword set, including several keywords).
  • the multi-level logic-based speech and speech detection device 100 further includes:
  • the new model text recognition unit 150 is used to receive the newly added speech detection model uploaded by the monitoring terminal, and perform speech and text content recognition on the to-be-detected video data according to the newly added speech detection model to obtain the current recognition result, and then The current recognition result is sent to the monitoring terminal for storage.
  • the newly added speech detection model will be uploaded to the server.
  • the server receives the newly-added speech detection model uploaded by the monitoring terminal, and performs speech and text content recognition on the to-be-detected video data according to the newly-added speech detection model to obtain the current recognition result.
  • the voice and text content recognition is performed on the video data to be detected according to the newly added speech detection model, and the complete process of obtaining the recognition result can refer to the data processing process of the text recognition unit 130.
  • the device realizes speech and speech detection through a speech detection model with multi-level logic, and the detection accuracy is improved, and when there is no suitable speech detection model in the server, it can prompt the monitoring terminal to configure the appropriate model in time to realize the speech.
  • Technical quality inspection is a technique for analyzing speech and speech detection through a speech detection model with multi-level logic, and the detection accuracy is improved, and when there is no suitable speech detection model in the server, it can prompt the monitoring terminal to configure the appropriate model in time to realize the speech.
  • the above-mentioned multi-level logic-based speech and speech detection device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 6.
  • FIG. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the computer program 5032 When executed, it can make the processor 502 execute a method for detecting speech and speech based on multi-level logic.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the voice and speech detection method based on multi-level logic.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the multi-level logic-based speech and speech detection method disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 6 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 6, and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application-specific integrated circuits. (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium In another embodiment of the present application, a computer-readable storage medium is provided.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the multi-level logic-based speech and speech detection method disclosed in the embodiments of the present application.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请公开了基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质,涉及人工智能技术领域。该方法包括接收用户端所上传的待检测视频数据及用户账号信息;根据用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;话术检测模型中包括有若干个检测条件且每一检测条件中包括若干个检测算子;若存储有对应的话术检测模型,根据话术检测模型对与待检测视频数据进行语音文本内容识别,得到识别结果;若未存储有对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。该方法实现了通过具有多层次逻辑的话术检测模型进行语音话术检测,检测准确率得到提升。

Description

基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质
本申请要求于2020年4月24日提交中国专利局、申请号为202010332843.7,发明名称为“基于多层次逻辑的语音话术检测方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质。
背景技术
目前,传统的语音质检系统,后期对于语音的话术质检所依靠的都是人工听录音进行审核,非常的耗时耗人力。
应用了语音转文字技术后,开始实现将录音或录像中的音频转换成文字,有传统的NLP关键词匹配进行话术的质检,例如先期消耗大量人力进行关键词及其同义词的配置、收集程度副词、收集否定词、收集褒义词、收集贬义词、收集正向情感词、收集负向情感词。然而,发明人意识到这个方法由于中国语言的多样性,仍然存在非常多的弊端,仅靠同义词、程度副词、否定词、褒贬义词、情感词,没有办法真正读懂人的自然语言,更没有办法让话术质检有一个较好的效果。
另外,传统的话术质检系统设计过于扁平,也就是说,所收集的词语没有一个很好的上下层级逻辑关系,只是使用单纯的关键词匹配,导致检测效果并不理想。
发明内容
本申请实施例提供了一种基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质,旨在解决现有技术中语音质检系统中将录音或录像中的音频转换成文字后,仅靠同义词、程度副词、否定词、褒贬义词、情感词,且使用单纯的关键词匹配,导致检测效果不佳的问题。
第一方面,本申请实施例提供了一种基于多层次逻辑的语音话术检测方法,其包括:
接收用户端所上传的待检测视频数据及用户账号信息;
根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
第二方面,本申请实施例提供了一种基于多层次逻辑的语音话术检测装置,其包括:
数据接收单元,用于接收用户端所上传的待检测视频数据及用户账号信息;
模型判断单元,用于根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
文本识别单元,用于若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
模型新增提示单元,用于若已存储的话术检测模型中未存储有与当前场景标识信息对应 的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
接收用户端所上传的待检测视频数据及用户账号信息;
根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
接收用户端所上传的待检测视频数据及用户账号信息;
根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
本申请实施例提供了一种基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质,包括接收用户端所上传的待检测视频数据及用户账号信息;根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。该方法实现了通过具有多层次逻辑的话术检测模型进行语音话术检测,检测准确率得到提升。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的基于多层次逻辑的语音话术检测方法的应用场景示意图;
图2为本申请实施例提供的基于多层次逻辑的语音话术检测方法的流程示意图;
图3为本申请实施例提供的基于多层次逻辑的语音话术检测方法的子流程示意图;
图4为本申请实施例提供的基于多层次逻辑的语音话术检测装置的示意性框图;
图5为本申请实施例提供的基于多层次逻辑的语音话术检测装置的子单元示意性框图;
图6为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1和图2,图1为本申请实施例提供的基于多层次逻辑的语音话术检测方法的应用场景示意图;图2为本申请实施例提供的基于多层次逻辑的语音话术检测方法的流程示意图,该基于多层次逻辑的语音话术检测方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。
如图2所示,该方法包括步骤S110~S140。
S110、接收用户端所上传的待检测视频数据及用户账号信息。
在本实施例中,服务器中配置并存储了大量场景下的话术检测模型,这些话术检测模型可以对代理人(例如销售寿险的代理人)的话术进行文本情感分析以进行质检检查。
当代理人使用用户端(如平板电脑等)登录代理人的系统账号,之后与用户进行沟通交流,代理人和用户之间的视频沟通过程由用户端全程记录(所录得的视频可视为待检测视频数据),用户端将自动采集的待检测视频数据上传至服务器,并同时将用户账号信息上传至服务器。
其中,用户账号信息中至少设置了该代理人对应的场景标识信息(表示该代理人的服务场景),例如车险销售、寿险销售等场景标识,这样当服务器获知了该代理人的场景标识信息,即可进一步进行话术质检。
S120、根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检。
在本实施例中,服务器中预先存储的多个话术检测模型中,每一话术检测模型均对应设置了该模型所适用的检测场景(与代理人的场景标识信息相对应),例如话术检测模型1对应的场景标识为银行双录、话术检测模型2对应的场景标识为保险双录(更具体一点可以设置为如上述举例的车险销售),话术检测模型3对应的场景标识为金融产品推销,话术检测模型4对应的场景标识为贷款远程面审等。
当获取了与待检测视频数据对应的当前场景标识信息后,此时在服务器中需判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型,从而根据相应的话术检测模型对待检测视频数据进行质检。
S130、若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果。
在本实施例中,为了根据话术检测模型对所述待检测视频数据进行语音文本内容识别,在此之前,需要对待检测视频数据进行中音视频数据进行分离,之后再依据话术检测模型对所述待检测视频数据的音频数据进行质检。
在一实施例中,如图3所示,步骤S130包括:
S131、将所述待检测视频数据进行音视频数据分离,得到与所述待检测视频数据对应的 音频数据和视频数据;
S132、通过N-gram模型对所述音频数据进行识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型;
S133、将所述文本识别结果通过基于概率统计分词模型进行分词,得到对应的分词结果;
S134、获取所述话术检测模型中所包括的检测条件;其中,所述检测条件中包括至少一个检测算子;
S135、判断所述分词结果是否满足所述检测条件;
S136、若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。
在本实施例中,服务器获取的待检测视频数据中,包括编码算法为H264(即是由ITU-T视频编码专家组和ISO/IEC动态图像专家组联合组成的联合视频组提出的高度压缩数字视频编解码器标准)、分辨率为1080P的视频子流等视频数据,也可以包括编码算法为EAC3(全称为Enhanced AC-3 bit streams,即增强AC-3位流,是一种音频编码格式)、采样率为48KH的音频子流等音频数据。此时在调用服务器中预先存储的解析器对所述待检测视频数据进行音视频数据分离,从而得到音频数据和视频数据。
由于服务器中预先存储了完成训练的N-gram模型,此时可通过N-gram模型对所述音频数据识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型。
N-gram模型是一种语言模型(Language Model,LM),语言模型是一个基于概率的判别模型,它的输入是一句话(单词的顺序序列),输出是这句话的概率,即这些单词的联合概率(joint probability)。N-gram模型也可以用于语音文本识别。
假设句子T是有词序列w 1,w 2,w 3...w n组成,用公式表示N-Gram语言模型如下:
P(T)=P(w 1)*p(w 2)*p(w 3)*…*p(w n)
=p(w 1)*p(w 2|w 1)*p(w 3|w 1w 2)*…*p(w n|w 1w 2w 3...)
一般常用的N-Gram模型是Bi-Gram和Tri-Gram。分别用公式表示如下:
Bi-Gram:
P(T)=p(w 1|begin)*p(w 2|w 1)*p(w 3|w 2)*…*p(w n|w n-1)
Tri-Gram:
P(T)=p(w 1|begin 1,begin 2)*p(w 2|w 1,begin 1)*p(w 3|w 2w 1)*…*p(w n|w n-1,w n-2);
可见,对于句子T中每一个词出现的条件概率,可以通过在语料库中统计计数的方式得出。则n元模型如下:
p(w n|w 1w 2w 3...)=C(w i-n-1,…,w i)/C(w i-n-1,…,w i-1);
式中C(w i-n-1,…,w i)表示字符串w i-n-1,…,w i在语料库中的次数。
在一实施例中,步骤S131之前还包括:
接收训练集语料库,将所述训练集语料库输入至初始N-gram模型进行训练,得到N-gram模型。
在本实施例中,训练集语料库是通用语料,通过所述训练集语料库输入至初始N-gram模型进行训练,即可得到用于语音识别的N-gram模型。
之后将所述文本识别结果通过基于概率统计分词模型进行分词过程如下:
例如,令C=C1C2...Cm,C是待切分的汉字串,令W=W1W2...Wn,W是切分的结果,Wa,Wb,……,Wk是C的所有可能的切分方案。那么,基于概率统计分词模型就是能够找到目的词串W,使得W满足:P(W|C)=MAX(P(Wa|C),P(Wb|C)...P(Wk|C))的分词模型,上述分词模型得到的词串W即估计概率为最大之词串。即:
对一个待分词的子串S,按照从左到右的顺序取出全部候选词w1、w2、…、wi、…、wn;在词典中查出每个候选词的概率值P(wi),并记录每个候选词的全部左邻词;计算每个候选词的累计概率,同时比较得到每个候选词的最佳左邻词;如果当前词wn是字串S的尾词,且累计概率P(wn)最大,则wn就是S的终点词;从wn开始,按照从右到左顺序,依次 将每个词的最佳左邻词输出,即S的分词结果。
在得到了所述音频数据对应的分词结果后,即可获取所述话术检测模型中所包括的检测条件,判断所述分词结果是否满足所述检测条件。
在本申请中,所述话术检测模型包含多个“条件”,每一“条件”具有:
①条件之间的关系可以使用“与、或、非”进行关联,例如:条件1&&(条件2||!条件3);
②满足该条条件,则执行该条其下方算子;
③可选条件如下:i.无条件,即全文查找;ii.满足某一条件,即可认为是条件的嵌套,例如:条件2为“当满足条件1时”;
④当满足条件时,可以选择其下算子的执行范围:i.第几次命中条件,命中位置的前后范围;
⑤每个条件下包含多个“算子”,该算子也可以理解为检测算子;
⑥各算子之间的关系可以使用“与、或、非”进行关联,例如:算子1&&(算子2||!算子3);
⑦可选算子如下:i.关键字或正则表达式,例如:算子1:好的||是的||同意ii.是否包含疑问句。
在一实施例中,步骤S134之后还包括:
获取所述话术检测模型中各检测条件之间的条件关联关系;其中,所述条件关联关系包括与关联关系、或关联关系、非关联关系;
获取所述话术检测模型中每一检测条件对应包括的检测算子;其中,所述检测算子为关键字或正则表达式;
获取每一检测条件对应包括的检测算子之间的算子关联关系;其中,所述算子关联关系包括与关联关系、或关联关系、非关联关系。
也即分词结果中的部分目标词语选作算子(即分词结果中与预设的算子为相同词语或满足算子对应的正则表达式的分词,才能作为目标词语而当作算子),参与检测条件的判断,当所有算子带入检测条件后均满足,则表示所述分词结果满足所述检测条件;当有算子带入检测条件后未满足,则表示所述分词结果不满足所述检测条件。
若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。即在服务器中为了有效将通过质检和未通过质检的视频数据进行有效划分,可以在服务器中预先设置已通过质检数据存储区域和未通过质检数据存储区域。之后将通过质检的所述待检测视频数据发送至已通过质检数据存储区域进行存储,将未通过质检的待检测视频数据发送至未通过质检数据存储区域进行存储。
为了对未通过质检的待检测视频数据进行周期性的历史数据回溯调查,此时可将所述待检测视频数据和对应的用户账号信息发送至监察终端,由监察终端根据待检测视频数据进行观看后详细调查。
在一实施例中,如图3所示,步骤S135之后还包括:
S137、若所述分词结果满足所述检测条件,将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至预先设置的已通过质检数据存储区域进行存储。
在本实施例中,当所述分词结果满足所述检测条件,即表示该待检测视频数据通过质检,此时可以将将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至服务器中预先设置的已通过质检数据存储区域进行存储。
S140、若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
在本实施例中,当用户端所上传的待检测视频数据对应的当前场景标识信息在服务器中 并未有对应的话术检测模型,此时需及时的提示监察终端及时上传以新增服务器中的话术检测模型。
其中,在监察终端中新增并配置话术检测模型时,需要配置多个条件和算子,从而组成新增话术检测模型,用于对待检测视频数据进行质检。其中,每一检测条件在配置时,可以配置以下信息:
1)话术检测模型名称,例如配置为征询客户同意模型;
2)条件之间逻辑关系,例如配置为条件1&&条件2&&条件3;
例如,配置条件1时,具体配置条件内容,条件范围和算子之间逻辑关系,最后完成配置条件1中每一个算子的类型(关键词或正则表达式)及算子的具体配置内容,配置条件2和条件3甚至更多条件的方式都是参考配置条件1。其中,配置条件1的条件内容时可以选填无或者其他条件,配置条件范围时可以选填全文或是其他条件,配置算子之间的逻辑关系时通过与、或、非的关系连接,配置算子的具体内容时则根据算子的类型来确定(例如算子类型为关键词,则算子的具体内容为关键词集合,包括若干个关键词)。
在一实施例中,如图2所示,步骤S140之后还包括:
S150、接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果,将所述当前识别结果发送至监察终端进行存储。
在本实施例中,当监察终端接收到了服务器发送的用于提示未存储适用检测模型的提示信息后,完成新增话术检测模型的配置之后,即将新增话术检测模型上传至服务器。服务器接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果。根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果的完整过程可参考步骤S131-S137。
该方法实现了通过具有多层次逻辑的话术检测模型进行语音话术检测,检测准确率得到提升,而且当服务器中没有合适的话术检测模型时能及时提示监察终端及时的配置合适的模型从而实现话术质检。
本申请实施例还提供一种基于多层次逻辑的语音话术检测装置,该基于多层次逻辑的语音话术检测装置用于执行前述基于多层次逻辑的语音话术检测方法的任一实施例。具体地,请参阅图4,图4是本申请实施例提供的基于多层次逻辑的语音话术检测装置的示意性框图。该基于多层次逻辑的语音话术检测装置100可以配置于服务器中。
如图4所示,基于多层次逻辑的语音话术检测装置100包括数据接收单元110、模型判断单元120、文本识别单元130、模型新增提示单元140。
数据接收单元110,用于接收用户端所上传的待检测视频数据及用户账号信息。
在本实施例中,服务器中配置并存储了大量场景下的话术检测模型,这些话术检测模型可以对代理人(例如销售寿险的代理人)的话术进行文本情感分析以进行质检检查。
当代理人使用用户端(如平板电脑等)登录代理人的系统账号,之后与用户进行沟通交流,代理人和用户之间的视频沟通过程由用户端全程记录(所录得的视频可视为待检测视频数据),用户端将自动采集的待检测视频数据上传至服务器,并同时将用户账号信息上传至服务器。
其中,用户账号信息中至少设置了该代理人对应的场景标识信息(表示该代理人的服务场景),例如车险销售、寿险销售等场景标识,这样当服务器获知了该代理人的场景标识信息,即可进一步进行话术质检。
模型判断单元120,用于根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;。
在本实施例中,服务器中预先存储的多个话术检测模型中,每一话术检测模型均对应设 置了该模型所适用的检测场景(与代理人的场景标识信息相对应),例如话术检测模型1对应的场景标识为银行双录、话术检测模型2对应的场景标识为保险双录(更具体一点可以设置为如上述举例的车险销售),话术检测模型3对应的场景标识为金融产品推销,话术检测模型4对应的场景标识为贷款远程面审等。
当获取了与待检测视频数据对应的当前场景标识信息后,此时在服务器中需判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型,从而根据相应的话术检测模型对待检测视频数据进行质检。
文本识别单元130,用于若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果。
在本实施例中,为了根据话术检测模型对所述待检测视频数据进行语音文本内容识别,在此之前,需要对待检测视频数据进行中音频数据和视频数据进行音视频数据分离,之后再依据话术检测模型对所述待检测视频数据的音频数据进行质检。
在一实施例中,如图5所示,文本识别单元130包括:
音视频数据分离单元131,用于将所述待检测视频数据进行音视频数据分离,得到与所述待检测视频数据对应的音频数据和视频数据;
音频文本提取单元132,用于通过N-gram模型对所述音频数据识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型;
文本分词单元133,用于将所述文本识别结果通过基于概率统计分词模型进行分词,得到对应的分词结果;
检测条件获取单元134,用于获取所述话术检测模型中所包括的检测条件;其中,所述检测条件中包括至少一个检测算子;
检测条件判断单元135,用于判断所述分词结果是否满足所述检测条件;
第一存储单元136,用于若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。
在本实施例中,服务器获取的待检测视频数据中,包括编码算法为H264(即是由ITU-T视频编码专家组和ISO/IEC动态图像专家组联合组成的联合视频组提出的高度压缩数字视频编解码器标准)、分辨率为1080P的视频子流等视频数据,也可以包括编码算法为EAC3(全称为Enhanced AC-3 bit streams,即增强AC-3位流,是一种音频编码格式)、采样率为48KH的音频子流等音频数据。此时在调用服务器中预先存储的解析器对所述待检测视频数据进行音视频数据分离,从而得到音频数据和视频数据。
由于服务器中预先存储了完成训练的N-gram模型,此时可通过N-gram模型对所述音频数据识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型。
N-gram模型是一种语言模型(Language Model,LM),语言模型是一个基于概率的判别模型,它的输入是一句话(单词的顺序序列),输出是这句话的概率,即这些单词的联合概率(joint probability)。N-gram模型也可以用于语音文本识别。
在一实施例中,文本识别单元130还包括:
N-gram模型训练单元,用于接收训练集语料库,将所述训练集语料库输入至初始N-gram模型进行训练,得到N-gram模型。
在本实施例中,训练集语料库是通用语料,通过所述训练集语料库输入至初始N-gram模型进行训练,即可得到用于语音识别的N-gram模型。
之后将所述文本识别结果通过基于概率统计分词模型进行分词过程如下:
例如,令C=C1C2...Cm,C是待切分的汉字串,令W=W1W2...Wn,W是切分的结果,Wa,Wb,……,Wk是C的所有可能的切分方案。那么,基于概率统计分词模型就是能够找到目的词串W,使得W满足:P(W|C)=MAX(P(Wa|C),P(Wb|C)...P(Wk|C))的分词模型,上 述分词模型得到的词串W即估计概率为最大之词串。即:
对一个待分词的子串S,按照从左到右的顺序取出全部候选词w1、w2、…、wi、…、wn;在词典中查出每个候选词的概率值P(wi),并记录每个候选词的全部左邻词;计算每个候选词的累计概率,同时比较得到每个候选词的最佳左邻词;如果当前词wn是字串S的尾词,且累计概率P(wn)最大,则wn就是S的终点词;从wn开始,按照从右到左顺序,依次将每个词的最佳左邻词输出,即S的分词结果。
在得到了所述音频数据对应的分词结果后,即可获取所述话术检测模型中所包括的检测条件,判断所述分词结果是否满足所述检测条件。
在一实施例中,文本识别单元130还包括:
条件关联关系获取单元,用于获取所述话术检测模型中各检测条件之间的条件关联关系;其中,所述条件关联关系包括与关联关系、或关联关系、非关联关系;
检测算子获取单元,用于获取所述话术检测模型中每一检测条件对应包括的检测算子;其中,所述检测算子为关键字或正则表达式;
算子关联关系获取单元,用于获取每一检测条件对应包括的检测算子之间的算子关联关系;其中,所述算子关联关系包括与关联关系、或关联关系、非关联关系。
也即分词结果中的部分目标词语选作算子(即分词结果中与预设的算子为相同词语或满足算子对应的正则表达式的分词,才能作为目标词语而当作算子),参与检测条件的判断,当所有算子带入检测条件后均满足,则表示所述分词结果满足所述检测条件;当有算子带入检测条件后未满足,则表示所述分词结果不满足所述检测条件。
若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。即在服务器中为了有效将通过质检和未通过质检的视频数据进行有效划分,可以在服务器中预先设置已通过质检数据存储区域和未通过质检数据存储区域。之后将通过质检的所述待检测视频数据发送至已通过质检数据存储区域进行存储,将未通过质检的待检测视频数据发送至未通过质检数据存储区域进行存储。
为了对未通过质检的待检测视频数据进行周期性的历史数据回溯调查,此时可将所述待检测视频数据和对应的用户账号信息发送至监察终端,由监察终端根据待检测视频数据进行观看后详细调查。
在一实施例中,如图5所示,步骤文本识别单元130还包括:
第二存储单元137,用于若所述分词结果满足所述检测条件,将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至预先设置的已通过质检数据存储区域进行存储。
在本实施例中,当所述分词结果满足所述检测条件,即表示该待检测视频数据通过质检,此时可以将将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至服务器中预先设置的已通过质检数据存储区域进行存储。
模型新增提示单元140,用于若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
在本实施例中,当用户端所上传的待检测视频数据对应的当前场景标识信息在服务器中并未有对应的话术检测模型,此时需及时的提示监察终端及时上传以新增服务器中的话术检测模型。
其中,在监察终端中新增并配置话术检测模型时,需要配置多个条件和算子,从而组成新增话术检测模型,用于对待检测视频数据进行质检。其中,每一检测条件在配置时,可以配置以下信息:
1)话术检测模型名称,例如配置为征询客户同意模型;
2)条件之间逻辑关系,例如配置为条件1&&条件2&&条件3;
例如,配置条件1时,具体配置条件内容,条件范围和算子之间逻辑关系,最后完成配置条件1中每一个算子的类型(关键词或正则表达式)及算子的具体配置内容,配置条件2和条件3甚至更多条件的方式都是参考配置条件1。其中,配置条件1的条件内容时可以选填无或者其他条件,配置条件范围时可以选填全文或是其他条件,配置算子之间的逻辑关系时通过与、或、非的关系连接,配置算子的具体内容时则根据算子的类型来确定(例如算子类型为关键词,则算子的具体内容为关键词集合,包括若干个关键词)。
在一实施例中,如图4所示,基于多层次逻辑的语音话术检测装置100还包括:
新模型文本识别单元150,用于接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果,将所述当前识别结果发送至监察终端进行存储。
在本实施例中,当监察终端接收到了服务器发送的用于提示未存储适用检测模型的提示信息后,完成新增话术检测模型的配置之后,即将新增话术检测模型上传至服务器。服务器接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果。根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果的完整过程可参考文本识别单元130的数据处理过程。
该装置实现了通过具有多层次逻辑的话术检测模型进行语音话术检测,检测准确率得到提升,而且当服务器中没有合适的话术检测模型时能及时提示监察终端及时的配置合适的模型从而实现话术质检。
上述基于多层次逻辑的语音话术检测装置可以实现为计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。
请参阅图6,图6是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。
参阅图6,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于多层次逻辑的语音话术检测方法。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于多层次逻辑的语音话术检测方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的基于多层次逻辑的语音话术检测方法。
本领域技术人员可以理解,图6中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图6所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(CentralProcessingUnit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(DigitalSignal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的 处理器等。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的基于多层次逻辑的语音话术检测方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于多层次逻辑的语音话术检测方法,其中,包括:
    接收用户端所上传的待检测视频数据及用户账号信息;
    根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
    若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
    若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
  2. 根据权利要求1所述的基于多层次逻辑的语音话术检测方法,其中,所述根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果,包括:
    将所述待检测视频数据进行音视频数据分离,得到与所述待检测视频数据对应的音频数据和视频数据;
    通过N-gram模型对所述音频数据进行识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型;
    将所述文本识别结果通过基于概率统计分词模型进行分词,得到对应的分词结果;
    获取所述话术检测模型中所包括的检测条件;其中,所述检测条件中包括至少一个检测算子;
    判断所述分词结果是否满足所述检测条件;
    若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。
  3. 根据权利要求2所述的基于多层次逻辑的语音话术检测方法,其中,所述判断所述分词结果是否满足所述检测条件之后,还包括:
    若所述分词结果满足所述检测条件,将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至预先设置的已通过质检数据存储区域进行存储。
  4. 根据权利要求2所述的基于多层次逻辑的语音话术检测方法,其中,还包括:
    接收训练集语料库,将所述训练集语料库输入至初始N-gram模型进行训练,得到N-gram模型。
  5. 根据权利要求2所述的基于多层次逻辑的语音话术检测方法,其中,所述获取所述话术检测模型中所包括的检测条件之后,还包括:
    获取所述话术检测模型中各检测条件之间的条件关联关系;其中,所述条件关联关系包括与关联关系、或关联关系、非关联关系;
    获取所述话术检测模型中每一检测条件对应包括的检测算子;其中,所述检测算子为关键字或正则表达式;
    获取每一检测条件对应包括的检测算子之间的算子关联关系;其中,所述算子关联关系包括与关联关系、或关联关系、非关联关系。
  6. 根据权利要求1所述的基于多层次逻辑的语音话术检测方法,其中,所述若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端之后,还包括:
    接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果,将所述当前识别结果发送至监察终端进行存储。
  7. 根据权利要求1所述的基于多层次逻辑的语音话术检测方法,其中,所述根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型中已存储的多个话术检测模型中,每一话术检测模型均对应设置了模型所适用的检测场景。
  8. 根据权利要求1所述的基于多层次逻辑的语音话术检测方法,其中,所述话术检测模型中各检测条件包括空检测条件、嵌套检测条件。
  9. 一种基于多层次逻辑的语音话术检测装置,其中,包括:
    数据接收单元,用于接收用户端所上传的待检测视频数据及用户账号信息;
    模型判断单元,用于根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
    文本识别单元,用于若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
    模型新增提示单元,用于若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
  10. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    接收用户端所上传的待检测视频数据及用户账号信息;
    根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
    若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
    若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
  11. 根据权利要求10所述的计算机设备,其中,所述根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果,包括:
    将所述待检测视频数据进行音视频数据分离,得到与所述待检测视频数据对应的音频数据和视频数据;
    通过N-gram模型对所述音频数据进行识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型;
    将所述文本识别结果通过基于概率统计分词模型进行分词,得到对应的分词结果;
    获取所述话术检测模型中所包括的检测条件;其中,所述检测条件中包括至少一个检测算子;
    判断所述分词结果是否满足所述检测条件;
    若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。
  12. 根据权利要求11所述的计算机设备,其中,所述判断所述分词结果是否满足所述检测条件之后,还包括:
    若所述分词结果满足所述检测条件,将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至预先设置的已通过质检数据存储区域进行存储。
  13. 根据权利要求11所述的计算机设备,其中,还包括:
    接收训练集语料库,将所述训练集语料库输入至初始N-gram模型进行训练,得到N-gram模型。
  14. 根据权利要求11所述的计算机设备,其中,所述获取所述话术检测模型中所包括的检测条件之后,还包括:
    获取所述话术检测模型中各检测条件之间的条件关联关系;其中,所述条件关联关系包括与关联关系、或关联关系、非关联关系;
    获取所述话术检测模型中每一检测条件对应包括的检测算子;其中,所述检测算子为关键字或正则表达式;
    获取每一检测条件对应包括的检测算子之间的算子关联关系;其中,所述算子关联关系包括与关联关系、或关联关系、非关联关系。
  15. 根据权利要求10所述的计算机设备,其中,所述若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端之后,还包括:
    接收监察终端所上传的新增话术检测模型,根据新增话术检测模型对所述待检测视频数据进行语音文本内容识别,得到当前识别结果,将所述当前识别结果发送至监察终端进行存储。
  16. 根据权利要求10所述的计算机设备,其中,所述根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型中已存储的多个话术检测模型中,每一话术检测模型均对应设置了模型所适用的检测场景。
  17. 根据权利要求10所述的计算机设备,其中,所述话术检测模型中各检测条件包括空检测条件、嵌套检测条件。
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
    接收用户端所上传的待检测视频数据及用户账号信息;
    根据所述用户账号信息获取当前场景标识信息,判断已存储的话术检测模型中是否存储有与当前场景标识信息对应的话术检测模型;其中,所述话术检测模型中包括有若干个检测条件,且每一检测条件中包括若干个检测算子,所述话术检测模型用于对与待检测视频数据对应的文本内容进行质检;
    若已存储的话术检测模型中存储有与当前场景标识信息对应的话术检测模型,根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果;以及
    若已存储的话术检测模型中未存储有与当前场景标识信息对应的话术检测模型,将用于提示未存储适用检测模型的提示信息至监察终端。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述话术检测模型对所述待检测视频数据进行语音文本内容识别,得到识别结果,包括:
    将所述待检测视频数据进行音视频数据分离,得到与所述待检测视频数据对应的音频数据和视频数据;
    通过N-gram模型对所述音频数据进行识别,得到文本识别结果;其中,所述N-gram模型为预先训练的N元模型;
    将所述文本识别结果通过基于概率统计分词模型进行分词,得到对应的分词结果;
    获取所述话术检测模型中所包括的检测条件;其中,所述检测条件中包括至少一个检测算子;
    判断所述分词结果是否满足所述检测条件;
    若所述分词结果不满足所述检测条件,将所述待检测视频数据的识别结果设置为未通过话术质检,将所述待检测视频数据发送至预先设置的未通过质检数据存储区域进行存储,并将所述待检测视频数据和对应的用户账号信息发送至监察终端。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述判断所述分词结果是否满足所述检测条件之后,还包括:
    若所述分词结果满足所述检测条件,将所述待检测视频数据的识别结果设置为通过话术质检,并将所述待检测视频数据发送至预先设置的已通过质检数据存储区域进行存储。
PCT/CN2021/077577 2020-04-24 2021-02-24 基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质 WO2021212998A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010332843.7 2020-04-24
CN202010332843.7A CN111639529A (zh) 2020-04-24 2020-04-24 基于多层次逻辑的语音话术检测方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2021212998A1 true WO2021212998A1 (zh) 2021-10-28

Family

ID=72332800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077577 WO2021212998A1 (zh) 2020-04-24 2021-02-24 基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111639529A (zh)
WO (1) WO2021212998A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050383A (zh) * 2023-03-29 2023-05-02 珠海金智维信息科技有限公司 一种理财产品销售环节飞单话术检测方法及系统
CN117271773A (zh) * 2023-09-26 2023-12-22 人民法院信息技术服务中心 用于庭审的对话场景识别方法、装置、设备和存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639529A (zh) * 2020-04-24 2020-09-08 深圳壹账通智能科技有限公司 基于多层次逻辑的语音话术检测方法、装置及计算机设备
CN113641795A (zh) * 2021-08-20 2021-11-12 上海明略人工智能(集团)有限公司 用于话术统计的方法及装置、电子设备、存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10031910B1 (en) * 2013-03-12 2018-07-24 Guangsheng Zhang System and methods for rule-based sentiment analysis
CN108962282A (zh) * 2018-06-19 2018-12-07 京北方信息技术股份有限公司 语音检测分析方法、装置、计算机设备及存储介质
CN109448730A (zh) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 一种自动语音质检方法、系统、装置及存储介质
CN109729383A (zh) * 2019-01-04 2019-05-07 深圳壹账通智能科技有限公司 双录视频质量检测方法、装置、计算机设备和存储介质
CN110197672A (zh) * 2018-02-27 2019-09-03 招商信诺人寿保险有限公司 一种语音通话质量检测方法、服务器、存储介质
CN110364183A (zh) * 2019-07-09 2019-10-22 深圳壹账通智能科技有限公司 语音质检的方法、装置、计算机设备和存储介质
CN111639529A (zh) * 2020-04-24 2020-09-08 深圳壹账通智能科技有限公司 基于多层次逻辑的语音话术检测方法、装置及计算机设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10031910B1 (en) * 2013-03-12 2018-07-24 Guangsheng Zhang System and methods for rule-based sentiment analysis
CN110197672A (zh) * 2018-02-27 2019-09-03 招商信诺人寿保险有限公司 一种语音通话质量检测方法、服务器、存储介质
CN108962282A (zh) * 2018-06-19 2018-12-07 京北方信息技术股份有限公司 语音检测分析方法、装置、计算机设备及存储介质
CN109448730A (zh) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 一种自动语音质检方法、系统、装置及存储介质
CN109729383A (zh) * 2019-01-04 2019-05-07 深圳壹账通智能科技有限公司 双录视频质量检测方法、装置、计算机设备和存储介质
CN110364183A (zh) * 2019-07-09 2019-10-22 深圳壹账通智能科技有限公司 语音质检的方法、装置、计算机设备和存储介质
CN111639529A (zh) * 2020-04-24 2020-09-08 深圳壹账通智能科技有限公司 基于多层次逻辑的语音话术检测方法、装置及计算机设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050383A (zh) * 2023-03-29 2023-05-02 珠海金智维信息科技有限公司 一种理财产品销售环节飞单话术检测方法及系统
CN117271773A (zh) * 2023-09-26 2023-12-22 人民法院信息技术服务中心 用于庭审的对话场景识别方法、装置、设备和存储介质
CN117271773B (zh) * 2023-09-26 2024-03-12 人民法院信息技术服务中心 用于庭审的对话场景识别方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN111639529A (zh) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2021212998A1 (zh) 基于多层次逻辑的语音话术检测方法、装置、计算机设备及存储介质
JP6678710B2 (ja) 自己学習自然言語理解を伴うダイアログ・システム
CN111046152B (zh) Faq问答对自动构建方法、装置、计算机设备及存储介质
US8532994B2 (en) Speech recognition using a personal vocabulary and language model
US20190287519A1 (en) Correction of speech recognition on repetitive queries
US9483582B2 (en) Identification and verification of factual assertions in natural language
US11544470B2 (en) Efficient determination of user intent for natural language expressions based on machine learning
US20110307252A1 (en) Using Utterance Classification in Telephony and Speech Recognition Applications
US9858923B2 (en) Dynamic adaptation of language models and semantic tracking for automatic speech recognition
WO2017161899A1 (zh) 一种文本处理方法、装置及计算设备
US20170337182A1 (en) Evaluation element recognition method, evaluation element recognition apparatus, and evaluation element recognition system
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN112232276B (zh) 一种基于语音识别和图像识别的情绪检测方法和装置
US20230022004A1 (en) Dynamic vocabulary customization in automated voice systems
CN115359799A (zh) 语音识别方法、训练方法、装置、电子设备及存储介质
CN114049895B (zh) 一种基于asr的语音质检分析方法及系统
CN110956958A (zh) 搜索方法、装置、终端设备及存储介质
CN115985342A (zh) 发音检错方法、装置、电子设备和存储介质
US20210058844A1 (en) Handoff Between Bot and Human
CN113889092A (zh) 语音识别结果的后处理模型的训练方法、处理方法及装置
US20230298589A1 (en) Ai platform for processing speech and video information collected during a medical procedure
CN111723204B (zh) 语音质检区域的校正方法、装置、校正设备及存储介质
WO2023137920A1 (zh) 语义截断检测方法、装置、设备和计算机可读存储介质
WO2023035529A1 (zh) 基于意图识别的信息智能查询方法、装置、设备及介质
WO2021159756A1 (zh) 基于多模态的响应义务检测方法、系统及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793645

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 14.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21793645

Country of ref document: EP

Kind code of ref document: A1