WO2020233440A1 - 一种订单处理方法、装置、电子设备及存储介质 - Google Patents

一种订单处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020233440A1
WO2020233440A1 PCT/CN2020/089669 CN2020089669W WO2020233440A1 WO 2020233440 A1 WO2020233440 A1 WO 2020233440A1 CN 2020089669 W CN2020089669 W CN 2020089669W WO 2020233440 A1 WO2020233440 A1 WO 2020233440A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature vector
speech
paragraph
information
Prior art date
Application number
PCT/CN2020/089669
Other languages
English (en)
French (fr)
Inventor
葛檬
张睿雄
Original Assignee
北京嘀嘀无限科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京嘀嘀无限科技发展有限公司 filed Critical 北京嘀嘀无限科技发展有限公司
Publication of WO2020233440A1 publication Critical patent/WO2020233440A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • This application relates to the field of computer technology, and specifically to an order processing method, device, electronic equipment and storage medium.
  • the service request can obtain the corresponding car service through the car APP request.
  • the car platform receives the travel request initiated by the service requester, it will match the user with the service provider and provide the corresponding travel service.
  • the service provider In order delivery, the service provider has the right to choose to accept or reject the order.
  • the study found that the probability of accidents when the service requester requests travel services alone while intoxicated is relatively high. For example, the service requester’s drunkenness and troubles affect the service provider’s normal driving, or there is a situation that threatens the service provider’s personal safety.
  • the service provider can only artificially judge whether there is drunkenness after the service requester gets on the bus, and cannot predict before the service requester gets on the bus, so it is impossible to carry out risk control on the safety of the riding environment in advance.
  • the purpose of this application is to provide an order processing method, device, electronic equipment, and storage medium, which can perform risk control on the safety of the riding environment in advance, thereby improving the safety of the overall riding environment.
  • an order processing device including:
  • the obtaining module is used to obtain the voice information sent by the service requesting end after the service provider receives the service request and triggers the voice obtaining request, and transmits the voice information to the determining module;
  • the determining module is used to extract the voice feature vector and the speaking rate feature vector of the voice information, and based on the voice feature vector and the speaking rate feature vector, determine the current status information of the requester using the service requester, and Transmitting the status information to the prompt module;
  • the status information includes information indicating whether the requester is currently in a drunk state;
  • the prompt module is configured to prompt the service provider to confirm whether to accept the order based on the status information.
  • the processing module further includes a processing module, after the acquiring module acquires the voice information sent by the service requester, before the determining module extracts the voice feature vector and the speech rate feature vector of the voice information For:
  • the determining module is specifically configured to extract the feature vector of the voice information according to the following steps:
  • Framing processing is performed on each speech paragraph in the speech information to obtain a speech frame corresponding to each speech paragraph;
  • For each voice paragraph extract the voice frame feature of each voice frame in the voice paragraph and the voice frame feature difference between the voice frame and its neighboring voice frames, and based on the voice frame feature and the voice frame feature Difference and a preset speech paragraph feature function to determine the first paragraph speech feature vector of the speech paragraph;
  • the determining module is specifically configured to extract the voice feature of the voice information based on the voice feature vector of the first paragraph corresponding to each voice paragraph of the voice information according to the following steps:
  • For each speech paragraph determine the differential speech feature vector of each speech paragraph based on the first paragraph speech feature vector of the speech paragraph and the pre-stored awake state speech feature vector;
  • the speech feature vectors of each second paragraph are combined to obtain the speech feature vector of the speech information.
  • the determining module is specifically configured to extract the speech rate feature vector of the voice information according to the following steps:
  • the speech rate feature vector of the voice information is extracted.
  • the determining module is specifically configured to determine the current status information of the requester using the service requester based on the speech feature vector and the speech rate feature vector according to the following steps:
  • a first score feature vector indicating a drunken state and a second score feature vector indicating a non-drunk state are determined for the voice information, and the first score feature vector includes each of the voice information
  • the voice paragraph indicates the probability value of the drunk state
  • the second score feature vector includes the probability value of each voice paragraph in the voice information indicating the non-drunk state
  • the current state information of the requester is determined.
  • the determining module is specifically configured to determine, based on the voice feature vector, a first score feature vector indicating an intoxication state and a second score feature vector indicating a non-drunken state in the voice information according to the following steps:
  • the score feature vector and the speaking rate feature vector After merging the score feature vector and the speaking rate feature vector, they are input into the state level classifier in the voice recognition model to determine the current state information of the requester.
  • it further includes a model training module, which is used to train the voice recognition model according to the following steps:
  • the training sample library including a plurality of training voice feature vectors corresponding to training voice information, training speech rate feature vectors, and state information corresponding to each training voice information;
  • the training voice feature vector of each training voice information is sequentially input to the segment-level classifier to obtain the first training score feature vector and the second training score feature vector corresponding to the training voice information; and the first training score
  • the value feature vector, the second training score feature vector, and the training speech rate feature vector are used as input variables of the state level classifier, and the state information corresponding to the training voice information is used as the output variable of the voice recognition model , Training to obtain model parameter information of the voice recognition model.
  • it further includes a model testing module, which is used to test the voice recognition model according to the following steps:
  • test sample library including a plurality of test speech feature vectors corresponding to test speech information, test speech rate feature vectors, and real state information corresponding to each test speech information;
  • test speech feature vector in the test sample library is input into the segment-level classifier of the voice recognition model in turn to obtain the first test score feature vector and the second test score corresponding to each test speech in the test sample library.
  • Test score feature vector
  • the first test score feature vector, the second test score feature vector, and the test speech rate feature vector corresponding to each test speech in the test sample library are input into the state level classifier of the voice recognition model to obtain the test Test status information corresponding to each test voice in the sample library;
  • an order processing method including:
  • the voice information sent by the service requester is obtained;
  • an embodiment of the present application provides an electronic device, including: a processor, a storage medium, and a bus.
  • the storage medium stores machine-readable instructions executable by the processor.
  • the processor and the storage medium communicate via a bus, and the processor executes the machine-readable instructions to execute the steps of the order processing method described in the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium with a computer program stored on the computer-readable storage medium, and the computer program executes the order processing method as described in the first aspect when the computer program is run by a processor A step of.
  • the embodiments of the application provide an order processing method, device, server, and computer-readable storage medium.
  • the service provider receives the service request and triggers the voice acquisition request
  • the voice information sent by the service requester is obtained, and then Extract the voice feature vector and speech rate feature vector of the voice information, and determine the current status information of the requestor at the service requester based on the voice feature vector and speech rate feature vector, that is, determine whether the requestor is in a drunk state, and then based on the request Information about whether the person is in a drunk state prompts the service provider to confirm whether to accept the order, so that by determining in advance whether the requester is in a drunk state, and prompting the service provider accordingly, the safety of the riding environment can be controlled in advance , Thereby improving the safety of the overall ride environment.
  • FIG. 1 shows a schematic structural diagram of an order processing system provided by an embodiment of the present application
  • FIG. 2 shows a flowchart of an order processing method provided by an embodiment of the present application
  • FIG. 3 shows a flowchart of the first method for extracting voice feature vectors of voice information provided by an embodiment of the present application
  • Fig. 4 shows a flowchart of a second method for extracting voice feature vectors of voice information provided by an embodiment of the present application
  • Fig. 5 shows a flow chart of a method for extracting a speech rate feature vector of speech information provided by an embodiment of the present application
  • FIG. 6 shows a flowchart of a method for determining the current status information of a requester using a service requester based on a voice feature vector and a speech rate feature vector provided by an embodiment of the present application
  • FIG. 7 shows a schematic diagram of a precision-recall rate curve of a voice recognition model provided by an embodiment of the present application
  • FIG. 8 shows a schematic structural diagram of an order processing device provided by an embodiment of the present application.
  • FIG. 9 shows a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • passenger means interchangeably to refer to individuals, entities, or tools that can request or order services.
  • driver means interchangeably to refer to individuals, entities or tools that can provide services.
  • service provider means interchangeably to refer to individuals, entities or tools that can provide services.
  • user in this application may refer to an individual, entity, or tool that requests, subscribes to, provides, or facilitates the provision of services. For example, the user may be a passenger, driver, operator, etc., or any combination thereof.
  • passenger and passenger terminal can be used interchangeably, and “driver” and “driver terminal” can be used interchangeably.
  • service request and "order” in this application can be used interchangeably to refer to requests initiated by passengers, service requesters, drivers, service providers, or suppliers, or any combination thereof.
  • the person who accepts the "service request” or “order” may be a passenger, a service requester, a driver, a service provider, or a supplier, etc., or any combination thereof. Service requests can be paid or free.
  • the positioning technology used in this application can be based on Global Positioning System (GPS), Global Navigation Satellite System (GLONASS), Compass Navigation System (COMPASS), Galileo Positioning System, Quasi-Zenith Satellite System ( Quasi-Zenith Satellite System, QZSS), Wireless Fidelity (Wireless Fidelity, WiFi) positioning technology, etc., or any combination thereof.
  • GPS Global Positioning System
  • GLONASS Global Navigation Satellite System
  • COMPASS Compass Navigation System
  • Galileo Positioning System Galileo Positioning System
  • Quasi-Zenith Satellite System Quasi-Zenith Satellite System
  • QZSS Quadratureo Positioning System
  • Wireless Fidelity Wireless Fidelity
  • One aspect of this application relates to an order processing system.
  • the system can determine whether the requester using the service requester is currently in a drunk state by processing the voice information sent by the service requester, and prompt the service provider to confirm whether to accept the request sent by the service requester according to the current status of the determined requester Order.
  • the service provider can only artificially judge whether there is drunkenness after the service requester gets on the bus, and cannot prejudge the ride before the service requester gets on the bus. Risk control is carried out in advance for safety.
  • the order processing system provided by this application can determine whether the service requester is drunk before the service requester gets on the bus, and then prompts the service requester in advance. Therefore, by prompting the service requester in advance, the order processing system of the present application can perform risk control in advance for the safety of the riding environment.
  • FIG. 1 is a schematic structural diagram of an order processing system 100 provided by an embodiment of the present application.
  • the order processing system 100 may be an online transportation service platform for transportation services such as taxis, driving services, express cars, carpooling, bus services, driver leasing, or shuttle services, or any combination thereof.
  • the order processing system 100 may include one or more of a server 101, a network 102, a service requester 103, a service provider 104, and a database 105.
  • the server 101 may include a processor.
  • the processor may process information and/or data related to the service request to perform one or more functions described in this application.
  • the processor may include one or more processing cores (e.g., single-core processor (S) or multi-core processor (S)).
  • the processor may include a central processing unit (CPU), an application specific integrated circuit (ASIC), a dedicated instruction set processor (Application Specific Instruction-set Processor, ASIP), and a graphics processing unit (Graphics Processing Unit, GPU), Physical Processing Unit (Physics Processing Unit, PPU), Digital Signal Processor (Digital Signal Processor, DSP), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Programmable Logic Device ( Programmable Logic Device (PLD), controller, microcontroller unit, Reduced Instruction Set Computing (RISC), or microprocessor, etc., or any combination thereof.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • ASIP Application Specific Instruction-set Processor
  • GPU Graphics Processing Unit
  • PPU Physical Processing Unit
  • DSP Digital Signal Processor
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • PLD Programmable Logic Device
  • controller microcontroller unit, Reduced Instruction Set Computing (RISC), or microprocessor, etc., or any combination thereof.
  • the device types corresponding to the service request terminal 103 and the service provider 104 may be mobile devices, such as smart home devices, wearable devices, smart mobile devices, virtual reality devices, or augmented reality devices. It can be a tablet computer, a laptop computer, or a built-in device in a motor vehicle.
  • the database 105 may be connected to the network 102 to communicate with one or more components in the order processing system 100 (eg, the server 101, the service requester 103, the service provider 104, etc.). One or more components in the order processing system 100 can access data or instructions stored in the database 105 via the network 102.
  • the database 150 may be directly connected to one or more components in the order processing system 100, or the database 105 may also be a part of the server 101.
  • FIG. 2 there is a schematic flow chart of an order processing method provided by this embodiment of the application.
  • the method may be executed by a server or a service provider in the order processing system 100.
  • the specific execution process includes the following steps S201 to S203:
  • the service provider here can be a mobile device on the driver’s side, and the service request side can be a mobile device on the passenger’s side.
  • the mobile devices here can include, for example, smart home devices, wearable devices, smart mobile devices, virtual reality devices, or The augmented reality device, etc., may also be a tablet computer, a laptop computer, or a built-in device in a motor vehicle.
  • the service request here can be understood as a car request or an order request.
  • a passenger sends a car request through a car application (APP) on a mobile terminal.
  • the car request can protect the passenger’s current location information and communication address. For example, the number of a mobile terminal.
  • the driver terminal When the driver terminal receives the service request sent by the passenger terminal, it can trigger the voice acquisition request through the communication address in the service request.
  • the driver terminal When the driver terminal establishes a voice call connection with the passenger terminal, the server or the driver terminal can acquire the service request terminal, the passenger terminal The voice message sent.
  • the voice information here mainly refers to the passenger's voice information, that is, the passenger's voice information obtained after the passenger has established a voice call connection between the driver and the passenger, for example, includes the passenger's current position described by the passenger, and Information such as the destination to go to.
  • the voice message may contain some silent paragraphs that do not contain voice. After removing this part of the silent paragraph, it can improve the status information of the subsequent recognition of passengers based on voice information, and It can also prevent the intrusion of some dirty data. Therefore, after obtaining the voice information of the passenger terminal, the order processing method of the embodiment of the present application further includes:
  • voice endpoint detection can remove silent paragraphs in voice messages, such as deleting blank paragraphs caused by pauses between paragraphs caused by passengers listening to each other, thinking, taking a break, etc. Or noisy passages that obviously do not belong to the passenger's voice information, such as noise passages such as car whistle, wind, and rain.
  • S202 Extract the voice feature vector and the speech rate feature vector of the voice information, and based on the voice feature vector and the speech rate feature vector, determine the current status information of the requester using the service requester; the status information includes indicating whether the requester is currently drunk Status information.
  • the voice information here can refer to the voice information after the silent segment is removed, and then the voice feature vector that can express human acoustic features in the voice information is extracted, and the speech rate feature vector that is formed by the speech rate features of the passenger's narration is extracted.
  • the information indicating whether the requester is currently in a drunk state can be represented by a set number.
  • 1001 means drunk
  • 1002 means not drunk
  • text that is, directly expressed as "drunk” or " Not drunk”.
  • S203 Based on the status information, prompt the service provider to confirm whether to accept the order.
  • the service provider can be prompted by voice whether to accept the order, or the service provider can be controlled to display a trigger button for accepting the order, so that the driver can make a choice independently.
  • the driver can be prompted with a voice prompt "The passenger is in a drunk state, please confirm whether to take the order", or the driver’s mobile terminal can display "This passenger is in a drunk state, please "Determine whether to take orders", so that the driver can determine the current status of the passengers in advance, so that corresponding measures can be taken in advance to control the safety of the riding environment in advance.
  • the voice feature vector of the voice information can be extracted according to the following process, which specifically includes steps S301 to S303:
  • S301 Perform framing processing on each voice paragraph in the voice information to obtain a voice frame corresponding to each voice paragraph.
  • each voice paragraph in the voice information is divided into frames, for example, according to an interval of 10 ms, each voice paragraph is divided into multiple speech frames.
  • the voice frame features here can include the fundamental frequency feature of the voice frame, Mel frequency cepstral coefficient, zero-crossing rate, harmonic-to-noise ratio and energy and other acoustic features, where the voice frame feature difference between the voice frame and its neighboring voice frames In the embodiments of the present application, it may refer to the difference in acoustic characteristics such as the fundamental frequency characteristic difference, the Mel frequency cepstrum coefficient difference, the zero-crossing rate difference, the harmonic-to-noise ratio difference, and the energy difference between the speech frame and the previous speech frame.
  • Fundamental frequency (F0) is the vibration frequency of the fundamental tone, which determines the pitch of the voice. In practical applications, the highest pitch is often used to represent the fundamental frequency. Here, when extracting the fundamental frequency characteristics of the speech frame, the The pitch of the speech frame.
  • MFCC Mel-frequency cepstral coefficients
  • the Mel frequency cepstral coefficients MFCC in the embodiment of the present application include 12, and there are 12 Mel frequency cepstral coefficients in the voice frame feature.
  • Zero crossing rate represents the number of times the signal passes through the zero point in a unit sampling point.
  • the calculation formula is:
  • N represents the frame length of the speech frame (or it can refer to the number of sampling points in the speech frame. For example, if the speech frame is collected at an interval of 10 ms and there is a sampling point every 1 ms, the frame length here is 10 ), x n (m) represents the m-th signal sampling point of the n-th frame.
  • the zero-crossing rate is often used to distinguish between unvoiced and voiced sounds. The former has a higher zero-crossing rate and the latter has a lower.
  • HNR harmonics-to-noise ratio
  • ACF autocorrelation coefficient
  • T 0 represents the fundamental period (fundamental period)
  • represents the interval time between two adjacent sampling points in the same speech frame.
  • the calculation method of the short-term energy of the speech signal is the sum of the squares of the signal values of each point in a frame, namely:
  • the energy reflects the amplitude of the audio signal. Through energy, you can roughly determine how much information each frame contains, and can also be used to distinguish between voiced and light segments, and the boundaries between voiced and silent segments.
  • the voice characteristics of the requester's voice information can be expressed, so as to serve as a reference for subsequent determination of the requester's status information.
  • the obtained speech frame feature vector of any speech frame includes the aforementioned speech frame features and speech frames
  • the feature vector composed of feature difference, taking the nth speech frame as an example, the obtained speech frame feature vector can be represented by the following vector:
  • Y n (pitch n EMCC n1 ... EMCC n12 ... E n ⁇ pitch n ⁇ EMCC n1 ... ⁇ EMCC n12 ... ⁇ E n ) T ;
  • the first ellipsis is EMCC n2 ⁇ EMCC n11
  • the second ellipsis is ZCR n and HNR n in turn
  • the third ellipsis is ⁇ EMCC n2 ⁇ EMCC n11 .
  • the four ellipsis are ⁇ ZCR n and ⁇ HNR n in sequence.
  • the speech frame feature vector of the nth speech frame is a 32-dimensional feature vector, where n can refer to any speech frame in any speech paragraph. If any speech paragraph includes 10 speech frames, After obtaining the voice frame feature vectors corresponding to 10 voice frames in any voice paragraph as 10 voice frame feature vectors (Y 1 ⁇ Y 10 ), focus on the voice frame features of the same dimension among the 10 voice frame feature vectors Or the voice frame feature difference is processed according to the preset voice paragraph feature function, and the first paragraph voice feature vector corresponding to any voice paragraph is determined.
  • the element positions in the voice frame feature vector of the same dimension for example, a vector includes 32 Element, that is, the vector is positioned as a 32-dimensional vector here, and belonging to the same dimension refers to the elements that belong to the same element position in each vector.
  • all pitch features in the 10 voice frame feature vectors are respectively passed through the average function and the standard Difference function, kurtosis function, skewness function, maximum value function, minimum value function, relative maximum position function, relative minimum position function, range function, and linear regression function are processed to obtain all the above 10 speech frame feature vectors
  • the average, standard deviation, kurtosis, skewness, maximum, minimum, relative maximum position, relative minimum position, range, and offset term, slope, and minimum mean square error of the linear regression function of the pitch feature of the first dimension Twelve function values, that is, for the voice frame feature or voice frame feature difference belonging to the same dimension in each voice frame feature vector in any voice paragraph, the 12 function values corresponding to each dimension are calculated according to the above function.
  • the function values corresponding to all the voice frame features and the voice frame feature differences of any voice paragraph can be obtained, and then the set voice frame features, voice frame feature difference order and function value order are formed to form any one The speech feature vector of the first paragraph of a speech paragraph. If each speech frame feature vector has 32 dimensions, and the number of function values corresponding to the speech frame feature and the speech frame feature difference is 12, the speech feature vector of the first paragraph is obtained That is, it can be a feature vector of 32*12 or 384 dimensions.
  • each speech frame feature vector in any speech paragraph is a sample group with speech frame features or speech frame feature differences in the same dimension.
  • any speech paragraph above includes 10 speech frame feature vectors, each The voice frame feature vectors are all 32 dimensions.
  • the pitch feature sample group includes a total of 10 samples, and the 32 dimensions include 32 sample groups.
  • the average and standard deviation are introduced below. The meaning of kurtosis, skewness, maximum, minimum, relative maximum position, relative minimum position, range, offset term, slope and minimum mean square error:
  • the average value means the average value of all samples in each sample group in all speech frame feature vectors in any speech paragraph.
  • the average value here is the average value of all speech frame feature vectors in any speech paragraph.
  • the average value of the pitch is the average value of the pitch.
  • the standard deviation means the standard deviation of each group of samples in all speech frame feature vectors in any speech paragraph.
  • Kurtosis is a statistic that describes how steep the sample distribution is compared to the normal distribution.
  • the kurtosis function is expressed by the following formula:
  • D is the sample variance of each sample group in the feature vectors of all speech frames in any speech paragraph, Represents the sample mean of each sample group.
  • Skewness is similar to kurtosis. It is also a statistical value that describes the distribution of samples. It describes the symmetry of the sample distribution of a certain population.
  • the calculation formula for skewness is as follows:
  • the maximum value and the minimum value respectively refer to the maximum and minimum values belonging to the same sample group.
  • the relative maximum position refers to the position of the speech frame to which the maximum value in the same sample group belongs in any speech paragraph
  • the relative minimum position refers to the speech frame to which the minimum value in the same sample belongs in any speech paragraph
  • the offset term, slope and minimum mean square error respectively refer to the intercept, slope and minimum mean square error corresponding to the linear regression function after forming the same sample group into a linear regression function.
  • the first paragraph speech feature vector of each speech paragraph in the speech information can be determined. If the speech information includes 10 speech paragraphs, 10 first paragraph speech feature vectors of 384 dimensions can be obtained.
  • the speech frame feature vector is converted into the first paragraph speech feature vector, that is, the feature vector of each speech paragraph in the speech information is obtained, so that the acoustic characteristics of the speech paragraph of the requester can be fully expressed, and the status information of the requester can be determined later.
  • pre-set voice paragraph feature functions given above are not limited to the above-mentioned types given in this application, and the above-mentioned several language paragraph feature functions are only a specific embodiment.
  • S303 Extract a voice feature vector of the voice information based on the voice feature vector of the first paragraph corresponding to each voice paragraph of the voice information.
  • step S303 based on the voice feature vector of the first paragraph corresponding to each voice paragraph of the voice information, the voice feature vector of the voice information is extracted, such as As shown in Fig. 4, it includes the following steps S401 to S403:
  • S401 For each speech paragraph, determine the differential speech feature vector of each speech paragraph based on the first paragraph speech feature vector of the speech paragraph and the pre-stored awake state speech feature vector.
  • the voice feature vector of the awake state here is the pre-collected voice information of a large number of passengers in the awake state, and the voice information of each passenger in the awake state is processed in the above manner to obtain the first paragraph of the voice of each passenger in the awake state
  • the feature vector is then averaged for all the first paragraph voice feature vectors of these passengers to obtain the average first paragraph voice feature information, and the average first paragraph voice information is stored as the awake state voice feature vector.
  • each differential voice feature in the differential voice feature vector can be determined by the following formula:
  • K represents the dimension of the difference feature vector, the first paragraph speech feature vector or the waking state speech feature vector, where i takes values from 1 to K in turn, so that the difference can be obtained
  • the value of each element position in the feature vector is the difference feature of each element position.
  • the differential voice feature vector here is also a 384-dimensional feature vector, that is, the dimension of the differential voice feature vector is the same as the dimension of the voice feature vector of the first paragraph.
  • S402 Determine the second paragraph voice feature vector of each voice paragraph based on the first paragraph voice feature vector and the differential voice feature vector of each voice paragraph.
  • the differential speech feature vector of each speech paragraph is spliced with the first paragraph speech feature vector of the speech paragraph to obtain the second paragraph speech feature vector of each speech paragraph
  • the voice feature vector of the first paragraph is a 384-dimensional feature vector
  • the differential voice feature vector is also a 384-dimensional feature vector
  • the second paragraph's voice feature vector is recorded as a 768-dimensional feature vector.
  • the speech feature vectors of the second paragraphs are combined.
  • the speech information in the embodiment of the present application includes 10 speech paragraphs, that is, 10 768-dimensional first paragraphs are obtained.
  • the two-paragraph speech feature vector, that is, the 10 768-dimensional second-paragraph speech feature vectors are the speech feature vectors of the speech information.
  • the voice feature vector of the second paragraph here can represent the voice feature vector of the different paragraph between the requester and other passengers, and has certain differences.
  • the second paragraph voice feature vector can more accurately determine the status information of the requester.
  • the speech rate feature vector of the voice information can be extracted according to the following process, which specifically includes the following steps S501 to S504:
  • S501 Convert each voice paragraph in the voice information into text paragraphs, where each text paragraph includes multiple characters.
  • S502 Determine the speech rate of each text paragraph based on the number of characters corresponding to each text paragraph and the duration of the voice paragraph corresponding to the text paragraph.
  • the text paragraph corresponding to any voice paragraph is obtained.
  • S503 Determine the maximum speech rate, minimum speech rate, and average speech rate of the voice information based on the speech rate corresponding to each text paragraph.
  • S504 Extract a speech rate feature vector of the speech information based on the maximum speech rate, minimum speech rate and average speech rate of the speech information.
  • the speech rate of each text paragraph in the speech information can be calculated, and then the maximum speech rate, minimum speech rate and average speech rate can be determined from these speech rates, and then the maximum speech rate and minimum speech rate of the speech information can be determined.
  • the combination of speed and average speech speed yields a 3-dimensional speech speed feature vector.
  • the speech rate feature vector is used as the reference quantity for determining the status information of the requester in the embodiment of the application, which improves the accuracy of identifying the requester’s status information .
  • the current status information of the requester using the service requester can be determined based on the voice feature vector and speech rate feature vector, as shown in Figure 6. Including the following steps S601 to S602:
  • S601 Determine, based on the voice feature vector, a first score feature vector indicating a drunk state and a second score feature vector indicating a non-drunk state in the voice information.
  • the first score feature vector includes the probability value of each voice paragraph in the voice information indicating the drunken state
  • the second score feature vector includes the probability value of each voice paragraph in the voice information indicating the non-drunk state.
  • the speech feature vector here includes multiple second paragraph speech feature vectors, where each score feature in the first score feature vector means that the corresponding second paragraph speech feature vector in the language feature vector is in a drunken state Probability value, each score feature in the second score feature vector represents the probability value that the corresponding second paragraph speech feature vector in the language feature vector is in a non-drunken state.
  • the determination of the first score feature vector indicating the intoxication state and the second score feature vector indicating the non-drunken state of the voice information based on the voice feature vector includes:
  • the voice feature vector is input into the segment-level classifier in the pre-trained voice recognition model to obtain the first score feature vector indicating the intoxication state and the second score feature vector indicating the non-drunken state in the voice information.
  • each of the voice feature vectors can be obtained.
  • a first score feature vector composed of the probability values of the second paragraph speech feature vector belonging to the drunk state
  • a second score feature vector composed of the probability values of each second paragraph speech feature vector belonging to the non-drunk state.
  • S602 Determine current status information of the requestor based on the first score feature vector, the second score feature vector, and the speech rate feature vector.
  • the current state information of the requester can be determined by combining the speech rate feature vector mentioned above. This process specifically includes:
  • the score feature vector that can represent the voice information can be obtained through the preset voice score feature function, such as the maximum value function, the minimum value Function, average function, and quantile function, respectively calculate the maximum value, minimum value, average value and 9th quantile of each score feature in the first score feature vector and the second score feature vector, that is, and
  • the 12 function values related to the first score feature vector and the 12 function values related to the second score feature vector are then spliced to obtain the score feature vector of the voice information of 24 dimensions.
  • the ninth quantile here includes the score feature corresponding to the first quantile, the score feature corresponding to the second quantile, and all the score features in the first score feature vector.
  • the score feature corresponding to the eighth quartile and the score feature corresponding to the ninth quantile that is, the 9 point features.
  • a 27-dimensional feature vector can be obtained, that is, a score feature vector of 24 dimensions and a speech rate feature vector of 3 dimensions.
  • the score feature vector and the speech rate feature vector are input into the state level classifier in the voice recognition model to determine the current state information of the requester.
  • the first score feature vector, the second score feature vector, and the speech rate feature vector can be directly input into the state level classifier in the voice recognition model, and the state level classifier in the voice recognition model is based on the first score
  • the feature vector, the second score feature vector and the pre-set voice score feature function determine the score feature vector of the voice information, and then the score feature vector and the speech rate feature vector are combined to determine the current status of the requester information.
  • the current state information of the requester is determined by the speech feature vector and the speech rate feature vector.
  • the current acoustic characteristics of the requester are considered, and on the other hand, the current speech rate characteristics of the requester are considered. Determining the status information of the requester can more accurately determine whether the requester is currently in a drunk state.
  • the training sample database here can include the training voice information of a large number of passengers.
  • the status information of these passengers is determined. For example, it can include 1000 passengers who are in a drunk state and 1000 passengers who are not in a drunk state, so that 2000 trains can be obtained.
  • the training speech feature vector and the training speech rate feature vector corresponding to the speech information.
  • Each training speech feature vector here may consist of at least one second training paragraph speech feature vector.
  • each training speech feature vector consists of 10 second training paragraph speech feature vectors.
  • the process of determining the second training paragraph speech feature vector It is similar to the process of determining the voice feature vector of the second paragraph mentioned above, and will not be repeated here.
  • training speech rate feature vector of each training speech information is similar to the process of determining the speech rate feature vector of the speech information mentioned above, and will not be repeated here.
  • the voice recognition model in the embodiment of the present application may include two classifiers.
  • the first is a segment-level classifier.
  • the input can be a training speech feature vector
  • the output is each second training paragraph speech in the training speech feature vector.
  • the feature vector belongs to the first training score feature vector composed of the probability value of the drunk state
  • the second training score feature vector composed of the probability value of each second training paragraph speech feature vector belongs to the non-drunk state.
  • the first training score feature vector, the second training score feature vector, and the speech rate feature vector can be used as the input variables of the state-level classifier. Specifically, it can be based on the first training score feature vector and the second training score feature vector.
  • the value feature vector and the preset voice score feature function determine the training score feature vector of the training voice information.
  • the process of determining the training score feature vector is similar to the process of determining the score feature vector of the voice information mentioned above. , I’m not going to repeat it here, and then merge the training score feature vector and the training speech rate feature vector as the input variable of the state-level classifier.
  • the input variables here all carry the code information of the passenger to which they belong, here
  • the coding information can indicate which passenger the combined training score feature vector and training speech rate feature vector come from, and then use the state information corresponding to the training voice information as the output variable of the voice recognition model to train the model of the voice recognition model Parameter information.
  • the segment-level classifier and state-level classifier in the voice recognition model can be trained separately.
  • the model parameter information in the segment-level classifier can be trained first, and the training speech feature vector of each speech information can be input into the segment Classifier to obtain the first training score feature vector and the second training score feature vector, and then input the first training score feature vector and the second training score feature vector into the preset loss function, and adjust the segment-level classification Until the loss function converges, the model parameter information in the segment-level classifier is obtained.
  • test sample library which includes a plurality of test voice feature vectors corresponding to test voice information, test speech rate feature vectors, and real state information corresponding to each test voice information.
  • test sample library here can include a large number of test voice information of passengers whose status information is known.
  • the test voice feature vector and test speech rate feature vector are similar to the process of determining the voice feature vector and speech rate feature vector above. Repeat it again.
  • each test voice feature vector is input into the segment-level classifier of the voice recognition model, and the first test score feature vector and the second test score feature vector corresponding to each test voice in the test sample library can be obtained.
  • the test voice The feature vector contains the coded information of the passenger.
  • test score feature vector of the test voice information can also be determined based on the first test score feature vector, the second test score feature vector and the preset voice score feature function, and then the test score feature vector and The test speech rate feature vectors are combined and input into the state level classifier of the voice recognition model to obtain the test state information corresponding to each test voice in the test sample library.
  • this application can use the accuracy rate and recall rate as the test performance evaluation index of the voice recognition model.
  • the embodiments of this application introduce four classification situations: True Positives (TP) ), False Positives (FP), False Negatives (FN) and True Negatives (TN), their specific meanings are shown in the following table:
  • the precision rate Precision calculates is the proportion of all correctly classified samples to all the samples that are actually classified as this class. The formula is:
  • Recall rate Recall calculates the proportion of all correctly classified samples (TP) to the actual number of samples in this category. The formula is:
  • the precision rate formula and the recall rate formula the precision rate and recall rate of the voice recognition model can be obtained.
  • the setting conditions here include: (1) the precision rate is not less than the set precision rate and the recall rate is not less than the set recall rate; (2) the precision rate is not less than the set precision rate, and the recall rate is not limited; (3) the precision rate Not limited, the recall rate is not less than the set recall rate; (4) The robustness related to the precision rate and the recall rate meets the set robustness condition.
  • the Precision-Recall curve of the voice recognition model can also be obtained through the accuracy rate and the recall rate.
  • the AUC index here is 0.819 and the AP index is 0.122.
  • the AUC index and AP index here can indicate whether the voice recognition model meets the set robustness condition.
  • AUC Area Under Curve
  • FPR axis the full name of AP
  • Average Precision refers to the average accuracy rate. Specifically, it can be the Precision-Recall curve and the x-axis sum The area between the y-axis.
  • the model training parameters in the voice recognition model or the training samples in the training sample library can be updated. Train the voice recognition model, or update the model training parameters in the voice recognition model and the training samples in the training sample library at the same time Retrain the voice recognition model until the accuracy and recall rate meet the set conditions, stop training, and the training is completed Voice recognition model.
  • the embodiment of the application also provides an order processing device corresponding to the order processing method. Since the principle of the device in the embodiment of the application to solve the problem is similar to the order processing method described in the embodiment of the application, the implementation of the device is You can refer to the implementation of the method, and the repetition will not be repeated.
  • the order processing apparatus 800 includes: an acquisition module 801, a determination module 802, and a prompt module 803; wherein,
  • the obtaining module 801 is configured to obtain the voice information sent by the service requesting end after the service provider receives the service request and triggers the voice obtaining request, and transmits the voice information to the determining module 802;
  • the determining module 802 is used to extract the voice feature vector and the speaking rate feature vector of the voice information, and based on the voice feature vector and the speaking rate feature vector, determine the current status information of the requester using the service requester, and transmit the status information to the prompt Module 803; the status information includes information indicating whether the requester is currently in a drunk state;
  • the prompt module 803 is configured to prompt the service provider to confirm whether to accept the order based on the status information.
  • the order processing apparatus further includes a processing module 804. After the acquiring module 801 acquires the voice information sent by the service requester, the processing module 804 before the determining module 802 extracts the voice feature vector and the speech rate feature vector of the voice information, Used for:
  • the determining module 802 is specifically configured to extract the feature vector of the voice information according to the following steps:
  • Fragment processing is performed on each voice paragraph in the voice information to obtain a voice frame corresponding to each voice paragraph;
  • the voice feature vector of the voice information is extracted.
  • the determining module 802 is specifically configured to extract the voice feature vector of the voice information based on the voice feature vector of the first paragraph corresponding to each voice paragraph of the voice information according to the following steps:
  • each voice paragraph based on the first paragraph voice feature vector of the voice paragraph and the pre-stored awake state voice feature vector, determine the differential voice feature vector of each voice paragraph;
  • the voice feature vectors of each second paragraph are combined to obtain the voice feature vector of the voice information.
  • the determining module 802 is specifically configured to extract the speech rate feature vector of the speech information according to the following steps:
  • the speech rate feature vector of the speech information is extracted.
  • the determining module 802 is specifically configured to determine the current status information of the requester using the service requester based on the voice feature vector and the speech rate feature vector according to the following steps:
  • a first score feature vector indicating a drunken state of the voice information and a second score feature vector indicating a non-drunk state are determined based on the voice feature vector.
  • the first score feature vector includes the probability value of each voice segment in the voice information indicating the drunk state
  • the second score feature vector includes the probability value of each voice paragraph in the voice information indicating the non-drunk state;
  • the current status information of the requester is determined.
  • the determining module 802 is specifically configured to determine, based on the voice feature vector, a first score feature vector indicating an intoxication state and a second score feature vector indicating a non-drunken state in the voice information according to the following steps:
  • the score feature vector and the speech rate feature vector After merging the score feature vector and the speech rate feature vector, they are input into the state level classifier in the voice recognition model to determine the current state information of the requester.
  • a model training module 805 is further included, and the model training module 805 is configured to train a voice recognition model according to the following steps:
  • the training sample library includes multiple training voice feature vectors corresponding to training voice information, training speech rate feature vectors, and status information corresponding to each training voice information;
  • the training voice feature vector of each training voice information is input into the segment-level classifier in turn to obtain the first training score feature vector and the second training score feature vector corresponding to the training voice information; the first training score feature vector, The second training score feature vector and the speech rate feature vector are used as input variables of the state-level classifier, and the state information corresponding to the training voice information is used as the output variable of the voice recognition model, and the model parameter information of the voice recognition model is obtained by training.
  • the order processing device further includes a model testing module 806, and the model testing module 806 is configured to test the voice recognition model according to the following steps:
  • test speech feature vectors corresponding to test speech information, test speech rate feature vectors, and real state information corresponding to each test speech information
  • the accuracy and recall rate of the voice recognition model are determined.
  • At least one of the model training parameters and the training sample library in the voice recognition model can be updated, so that the model training module 805 retrains voice recognition Model, until the accuracy rate of the voice recognition model is not less than the set accuracy rate, and the recall rate is not less than the set recall rate.
  • An embodiment of the present application also provides an electronic device 900.
  • a schematic structural diagram of the electronic device 900 provided in this embodiment of the present application includes: a processor 901, a storage medium 902, and a bus 903.
  • the storage medium 902 stores There are machine-readable instructions executable by the processor 901 (for example, the acquisition module 801, the determination module 802, the prompt module 803, etc.).
  • the processor 901 and the storage medium 902 communicate through the bus 903, and the machine can
  • the read instruction is executed by the processor 901, the following processing is performed:
  • the voice information sent by the service requester is obtained;
  • the service provider is prompted to confirm whether to accept the order.
  • the instructions executed by the processor 901 further include:
  • the instructions executed by the processor 901 include:
  • Fragment processing is performed on each voice paragraph in the voice information to obtain a voice frame corresponding to each voice paragraph;
  • the voice feature vector of the voice information is extracted.
  • the instructions executed by the processor 901 include:
  • each voice paragraph based on the first paragraph voice feature vector of the voice paragraph and the pre-stored awake state voice feature vector, determine the differential voice feature vector of each voice paragraph;
  • the voice feature vectors of each second paragraph are combined to obtain the voice feature vector of the voice information.
  • the instructions executed by the processor 901 include:
  • the speech rate feature vector of the speech information is extracted.
  • the instructions executed by the processor 901 include:
  • a first score feature vector indicating a drunken state of the voice information and a second score feature vector indicating a non-drunk state are determined based on the voice feature vector.
  • the second score feature vector includes the probability value of each voice paragraph in the voice information indicating the non-drunk state;
  • the current status information of the requester is determined.
  • the instructions executed by the processor 901 include:
  • the state level classifier After combining the score feature vector and the speech rate feature vector, they are input into the state level classifier in the voice recognition model to determine the current state information of the requester.
  • the instructions executed by the processor 901 further include:
  • the training sample library includes multiple training voice feature vectors corresponding to training voice information, training speech rate feature vectors, and status information corresponding to each training voice information;
  • the training voice feature vector of each training voice information is input into the segment-level classifier in turn to obtain the first training score feature vector and the second training score feature vector corresponding to the training voice information; the first training score feature vector, The second training score feature vector and the training speech rate feature vector are used as input variables of the state-level classifier, and the state information corresponding to the training voice information is used as the output variable of the voice recognition model, and the model parameter information of the voice recognition model is obtained by training.
  • the instructions executed by the processor 901 further include:
  • test speech feature vectors corresponding to test speech information, test speech rate feature vectors, and real state information corresponding to each test speech information
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the computer program executes the steps of the above-mentioned order processing method when the computer program is run by a processor.
  • the storage medium can be a general storage medium, such as a mobile disk, a hard disk, etc.
  • the computer program on the storage medium can execute the above-mentioned order processing method when it is running, thereby solving the problem of preventing the safety of the riding environment
  • the issue of risk control can control the safety of the riding environment in advance, thereby improving the safety of the overall riding environment.
  • the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a nonvolatile computer readable storage medium executable by a processor.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Abstract

一种订单处理方法,包括:在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息(S201);提取语音信息的语音特征向量和语速特征向量,并基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息,状态信息中包括指示请求者当前是否处于醉酒状态的信息(S202);基于状态信息,提示服务提供端确认是否接受订单(S203)。该方法能够提高乘车环境的安全性。还公开了一种订单处理装置、电子设备及存储介质。

Description

一种订单处理方法、装置、电子设备及存储介质 技术领域
本申请涉及计算机技术领域,具体而言,涉及一种订单处理方法、装置、电子设备及存储介质。
背景技术
随着互联网移动通信技术及智能设备的迅速发展,各种服务类应用程序也随之诞生,比如用车应用(Application,APP)。服务请求可以通过用车APP请求获取相应的用车服务,当用车平台接收到服务请求方发起的出行请求时,会为用户匹配服务提供方,提供相应的出行服务。
在订单派送中,服务提供方具有选择接受订单或者拒绝订单的权利。研究中发现,服务请求方在醉酒状态下独自请求出行服务时出现事故的概率比较大,比如服务请求方醉酒闹事影响服务提供方的正常驾驶,或者,出现威胁到服务提供方人身安全的情况。目前服务提供方只能在服务请求方上车后通过人为判断是否存在醉酒情况,无法在服务请求方上车前进行预判,从而无法对乘车环境的安全性提前进行风险控制。
发明内容
有鉴于此,本申请的目的在于提供一种订单处理方法、装置、电子设备及存储介质,能够对乘车环境的安全性提前进行风险控制,进而提高了整体乘车环境的安全性。
第一方面,本申请实施例提供了一种订单处理装置,包括:
获取模块,用于在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息,并将所述语音信息传输至确定模块;
确定模块,用于提取所述语音信息的语音特征向量和语速特征向量,并基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前的状态信息,并将所述状态信息传输至提示模块;所述状态信息中包括指示所述请求者当前是否处于醉酒状态的信息;
提示模块,用于基于所述状态信息,提示所述服务提供端确认是否接受所述订单。
在一些实施方式中,还包括处理模块,所述处理模块在所述获取模块获取服务请求端发送的语音信息之后,在所述确定模块提取所述语音信息的语音特征向量和语速特征向量之前,用于:
对所述语音信息进行语音端点检测,得到至少一个语音段落和静音段落;
删除所述语音信息中的静音段落。
在一些实施方式中,所述确定模块具体用于根据以下步骤提取所述语音信息时的特征向量:
对所述语音信息中的各个语音段落分别进行分帧处理,得到每个语音段落对应的语音帧;
针对每个语音段落,提取该语音段落中每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差,并基于所述语音帧特征、所述语音帧特征差和预先设定的语音段落特征函数,确定该语音段落的第一段落语音特征向量;
基于所述语音信息的各个语音段落分别对应的第一段落语音特征向量,提取所述语音信息的语音特征向量。
在一些实施方式中,所述确定模块具体用于根据以下步骤基于所述语音信息的各个语音段落分别对应的第一段落语音特征向量,提取所述语音信息的语音特征:
针对每个语音段落,基于该语音段落的所述第一段落语音特征向量和预先存储的清醒状态语音特征向量,确定每个语音段落的差异性语音特征向量;
基于每个语音段落的所述第一段落语音特征向量和所述差异性语音特征向量,确定每个语音段落的第二段落语音特征向量;
对各个第二段落语音特征向量进行合并,得到所述语音信息的语音特征向量。
在一些实施方式中,所述确定模块具体用于根据以下步骤提取所述语音信息的语速特征向量:
将所述语音信息中的各个语音段落转换为文本段落,每个文本段落包括多个字符;
基于每个文本段落对应的字符个数以及该文本段落对应的语音段落的时长,确定每个文本段落的语速;
基于每个文本段落对应的语速,确定所述语音信息的最大语速、最小语速和平均语速;
基于所述语音信息的最大语速、最小语速和平均语速,提取所述语音信息的语速特征向量。
在一些实施方式中,所述确定模块具体用于根据以下步骤基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前 的状态信息:
基于所述语音特征向量确定所述语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量,所述第一分值特征向量包括所述语音信息中每个语音段落指示醉酒状态的概率值,所述第二分值特征向量包括所述语音信息中每个语音段落指示不醉酒状态的概率值;
基于所述第一分值特征向量、所述第二分值特征向量以及所述语速特征向量,确定所述请求者当前的状态信息。
在一些实施方式中,所述确定模块具体用于根据以下步骤基于所述语音特征向量确定所述语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量:
将所述语音特征向量输入预先训练的声音识别模型中的段级别分类器,得到所述语音信息指示所述醉酒状态的第一分值特征向量和指示所述不醉酒状态的第二分值特征向量;
基于所述第一分值特征向量、所述第二分值特征向量和预先设定的语音分值特征函数,确定所述语音信息的分值特征向量;
将所述分值特征向量和所述语速特征向量合并后,输入所述声音识别模型中的状态级别分类器,确定所述请求者当前的状态信息。
在一些实施方式中,还包括模型训练模块,所述模型训练模块用于根据以下步骤训练所述声音识别模型:
构建声音识别模型的段级别分类器和状态级别分类器;
获取预先构建的训练样本库,所述训练样本库包括多个训练语音信息对应的训练语音特征向量、训练语速特征向量以及每个训练语音信息对应的状态信息;
依次将每个训练语音信息的训练语音特征向量输入所述段级别分类器,得到该训练语音信息对应的第一训练分值特征向量和第二训练分值特征向量;将所述第一训练分值特征向量、所述第二训练分值特征向量和所述训练语速特征向量作为所述状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为所述声音识别模型的输出变量,训练得到所述声音识别模型的模型参数信息。
在一些实施方式中,还包括模型测试模块,所述模型测试模块用于根据以下步骤测试所述声音识别模型:
获取预先构建的测试样本库,所述测试样本库包括多个测试语音信息对应的测试语音特征向量、测试语速特征向量以及每个测试语音信息对应的真实状态信息;
依次将所述测试样本库中的每个测试语音特征向量输入所述声音识别 模型的段级分类器,得到所述测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量;
将所述测试样本库中每个测试语音对应的第一测试分值特征向量、第二测试分值特征向量和测试语速特征向量输入所述声音识别模型的状态级别分类器,得到所述测试样本库中的每个测试语音对应的测试状态信息;
基于所述真实状态信息和所述测试状态信息,确定所述声音识别模型的精确率和召回率;
若所述精确率和所述召回率不满足设定条件,更新所述声音识别模型中的模型训练参数和/或所述训练样本库,重新训练所述声音识别模型,直至所述精确率和所述召回率满足所述设定条件。
第二方面,本申请实施例提供了一种订单处理方法,包括:
在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息;
提取所述语音信息的语音特征向量和语速特征向量,并基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前的状态信息;所述状态信息中包括指示所述请求者当前是否处于醉酒状态的信息;
基于所述状态信息,提示所述服务提供端确认是否接受所述订单。
第三方面,本申请实施例提供了一种电子设备,包括:处理器、存储介质和总线,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储介质之间通过总线通信,所述处理器执行所述机器可读指令,以执行如第二方面所述订单处理方法的步骤。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如第一方面所述订单处理方法的步骤。
本申请实施例提供了一种订单处理方法、装置、服务器及计算机可读存储介质,通过在服务提供端接收到服务请求、并触发语音获取请求后,获取到服务请求端发送的语音信息,然后提取该语音信息的语音特征向量和语速特征向量,并基于该语音特征向量和语速特征向量确定服务请求端的请求者当前的状态信息,即确定请求者是否处于醉酒状态的信息,然后基于请求者是否处于醉酒状态的信息,提示服务提供端确认是否接受订单,这样通过提前确定请求者是否处于醉酒状态,并对服务提供端进行相应的提示,从而对乘车环境的安全性提前进行风险控制,进而提高了整体乘车环境的安全性。
本申请实施例的其他特征和优点将在随后的说明书中阐述,或者,部 分特征和优点可以从说明书推知或毫无疑义地确定,或者通过实施本申请实施例的上述技术即可得知。
为使本申请的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本申请实施例提供的一种订单处理系统的架构示意图;
图2示出了本申请实施例提供的一种订单处理方法的流程图;
图3示出了本申请实施例提供的第一种提取语音信息的语音特征向量的方法流程图;
图4示出了本申请实施例提供的第二种提取语音信息的语音特征向量的方法流程图;
图5示出了本申请实施例提供的一种提取语音信息的语速特征向量的方法流程图;
图6示出了本申请实施例提供的一种基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息的方法流程图;
图7示出了本申请实施例提供的一种声音识别模型的精确率-召回率曲线示意图;
图8示出了本申请实施例提供的一种订单处理装置的结构示意图;
图9示出了本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,应当理解,本申请中附图仅起到说明和描述的目的,并不用于限定本申请的保护范围。另外,应当理解,示意性的附图并未按实物比例绘制。本申请中使用的流程图示出了根据本申请的一些实施例实现的操作。应该理解,流程图的操作可以不按顺序实现,没有逻辑的上下文关系的步骤可以反转顺序或者同时实施。此外,本领域技术人员在本申请内容的指引下,可以向流程图添加一个或多个其他操作,也可以从流程图中移除一个或多个操作。
另外,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了使得本领域技术人员能够使用本申请内容,结合特定应用场景“网约车订单处理”,给出以下实施方式。对于本领域技术人员来说,在不脱离本申请的精神和范围的情况下,可以将这里定义的一般原理应用于其他实施例和应用场景。虽然本申请主要围绕出网约车订单进行描述,但是应该理解,这仅是一个示例性实施例。
需要说明的是,本申请实施例中将会用到术语“包括”,用于指出其后所声明的特征的存在,但并不排除增加其它的特征。
本申请中的术语“乘客”、“请求方”、“请求者”、“服务请求方”和“客户”可互换使用,以指代可以请求或订购服务的个人、实体或工具。本申请中的术语“司机”、“提供方”、“提供者”、“服务提供方”和“供应商”可互换使用,以指代可以提供服务的个人、实体或工具。本申请中的术语“用户”可以指代请求服务、订购服务、提供服务或促成服务的提供的个人、实体或工具。例如,用户可以是乘客、驾驶员、操作员等,或其任意组合。在本申请中,“乘客”和“乘客终端”可以互换使用,“驾驶员”和“驾驶员终端”可以互换使用。
本申请中的术语“服务请求”和“订单”可互换使用,以指代由乘客、服务请求方、司机、服务提供方、或供应商等、或其任意组合发起的请求。接受该“服务请求”或“订单”的可以是乘客、服务请求方、司机、服务提供方、或供应商等、或其任意组合。服务请求可以是收费的或免费的。
本申请中使用的定位技术可以基于全球定位系统(Global Positioning System,GPS)、全球导航卫星系统(Global Navigation Satellite System,GLONASS),罗盘导航系统(COMPASS)、伽利略定位系统、准天顶卫星系统(Quasi-Zenith Satellite System,QZSS)、无线保真(Wireless Fidelity,WiFi)定位技术等,或其任意组合。一个或多个上述定位系统可以在本申请中互换使用。
本申请的一个方面涉及一种订单处理系统。该系统可以通过对服务请求端发送的语音信息进行处理,确定使用服务请求端的请求者当前是否处于醉酒状态,并根据确定的请求者的当前状态,提示服务提供端确认是否接受服务请求端发送的订单。
值得注意的是,在本申请提出申请之前,服务提供方只能在服务请求方上车后通过人为判断是否存在醉酒情况,无法在服务请求方上车前进行预判,从而无法对乘车环境的安全性提前进行风险控制。然而,本申请提供的订单处理系统可以在服务请求方上车之前判断服务请求方是否存在醉酒情况,进而提前对服务请求端进行提示。因此,通过提前对服务请求端进行提示,本申请的订单处理系统可以为乘车环境的安全性提前进行风险控制。
图1是本申请实施例提供的一种订单处理系统100的架构示意图。例如,订单处理系统100可以是用于诸如出租车、代驾服务、快车、拼车、公共汽车服务、驾驶员租赁、或班车服务之类的运输服务、或其任意组合的在线运输服务平台。订单处理系统100可以包括服务器101、网络102、服务请求端103、服务提供端104、和数据库105中的一种或多种。
在一些实施例中,服务器101可以包括处理器。处理器可以处理与服务请求有关的信息和/或数据,以执行本申请中描述的一个或多个功能。在一些实施例中,处理器可以包括一个或多个处理核(例如,单核处理器(S)或多核处理器(S))。仅作为举例,处理器可以包括中央处理单元(Central Processing Unit,CPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用指令集处理器(Application Specific Instruction-set Processor,ASIP)、图形处理单元(Graphics Processing Unit,GPU)、物理处理单元(Physics Processing Unit,PPU)、数字信号处理器(Digital Signal Processor,DSP)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、可编程逻辑器件(Programmable Logic Device,PLD)、控制器、微控制器单元、简化指令集计算机(Reduced Instruction Set Computing,RISC)、或微处理器等,或其任意组合。
在一些实施例中,服务请求端103和服务提供端104对应的设备类型可以是移动设备,比如可以包括智能家居设备、可穿戴设备、智能移动设备、虚拟现实设备、或增强现实设备等,也可以是平板计算机、膝上型计算机、或机动车辆中的内置设备等。
在一些实施例中,数据库105可以连接到网络102以与订单处理系统100中的一个或多个组件(例如,服务器101,服务请求端103,服务提供端104等)通信。订单处理系统100中的一个或多个组件可以经由网络102访问存储在数据库105中的数据或指令。在一些实施例中,数据库150可以直接连接到订单处理系统100中的一个或多个组件,或者,数据库105也可以是服务器101的一部分。
下面结合上述图1示出的订单处理系统100中描述的内容,对本申请 实施例提供的订单处理方法进行详细说明。
参照图2所示,为本申请实施例提供的一种订单处理方法的流程示意图,该方法可以由订单处理系统100中的服务器或服务提供端来执行,具体执行过程包括以下步骤S201~S203:
S201,在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息。
在出行领域,这里的服务提供端可以是司机端的移动设备,服务请求端可以是乘客端的移动设备,这里的移动设备比如可以包括智能家居设备、可穿戴设备、智能移动设备、虚拟现实设备、或增强现实设备等,也可以是平板计算机、膝上型计算机、或机动车辆中的内置设备等。
这里的服务请求可以理解为用车请求或者订单请求,比如乘客通过移动终端上的用车应用程序(Application,APP)发送用车请求,该用车请求可以保护乘客的当前位置信息以及通讯地址,比如移动终端的号码。
司机端在接收到乘客端发送的服务请求时,可以通过服务请求中的通讯地址触发语音获取请求,当司机端与乘客端建立语音通话连接时,服务器或者司机端可以获取服务请求端即乘客端发送的语音信息。
本申请实施例中,这里的语音信息主要指乘客的语音信息,即获取乘客在司机端与乘客端建立语音通话连接后,获取的乘客的语音信息,比如包括乘客叙述的自己的当前位置,以及要去的目的地等信息。
由于乘客在叙述自己的位置以及要去的目的地时,语音信息中可能会包含一部分不含语音的静音段落,去除掉这部分静音段落后,能够提高后续基于语音信息识别乘客的状态信息,且也能防止一些脏数据的侵入,故在获取到乘客端的语音信息后,本申请实施例的订单处理方法还包括:
(1)对语音信息进行语音端点检测,得到至少一个语音段落和静音段落;
(2)删除语音信息中的静音段落。
这里通过语音端点检测(Voice Activity Detection,VAD)能够去除语音信息中的静音段落,比如删除由于乘客正在听对方说话、思考、稍事休息等原因引起的一段话之间的停顿导致的空白段落,或者明显不属于乘客的语音信息的噪音段落,比如汽车鸣笛、风声、雨声等噪音段落。
当语音信息内容比较长时,删除静音段落后,可以得到多个语句信息,然后可以按照设定长度对语句信息进行分段,得到多个语音段落,当语音信息内容比较短时,在删除掉静音段落后,也可以不进行分段,只保留一个语音段落。
S202,提取语音信息的语音特征向量和语速特征向量,并基于语音特 征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息;该状态信息中包括指示请求者当前是否处于醉酒状态的信息。
这里的语音信息即可以指去除掉静音片段后的语音信息,然后提取语音信息中能够表达人类声学特征构成的语音特征向量,以及提取乘客叙述时的语速特征构成的语速特征向量。
这里指示请求者当前是否处于醉酒状态的信息,比如可以通过设定的数字进行表示,如通过1001表示醉酒,通过1002表示不醉酒,或者可以通过文字进行表示,即直接表示为“醉酒”或者“不醉酒”。
S203,基于状态信息,提示服务提供端确认是否接受订单。
在确定出请求者当前的状态信息后,可以通过声音提示服务提供端是否接受该订单,或者控制服务提供端显示是否接受该订单的触发按钮,以便司机能够自主进行选择。
比如,若确定出乘客当前处于醉酒状态,则可以进行声音提示司机“该乘客处于醉酒状态,请确定是否进行接单”,或者可以在司机的移动终端上进行显示“该乘客处于醉酒状态,请确定是否进行接单”,这样司机能够提前确定乘客的当前状态,从而能够提前采取相应措施以对乘车环境的安全性提前进行风险控制。
具体地,在得到乘客的语音信息后,如图3所示,可以按照以下过程提取语音信息的语音特征向量,具体包括步骤S301~S303:
S301,对语音信息中的各个语音段落分别进行分帧处理,得到每个语音段落对应的语音帧。
这里得到语音信息后,对该语音信息中的各个语音段落进行分帧处理,比如按照10ms的间隔进行分帧,将每个语音段落成多个语音帧。
S302,针对每个语音段落,提取该语音段落中每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差,并基于语音帧特征、语音帧特征差和预先设定的语音段落特征函数,确定该语音段落的第一段落语音特征向量。
这里的语音帧特征可以包括语音帧的基频特征、梅尔频率倒谱系数、过零率、谐噪比和能量等声学特征,这里语音帧与其相邻的语音帧之间的语音帧特征差,在本申请实施例中可以指该语音帧与其前一个语音帧之间的基频特征差、梅尔频率倒谱系数差、过零率差、谐噪比差和能量差等声学特征差。
具体地,基频特征、梅尔频率倒谱系数、过零率、谐噪比和能量通过以下方式提取:
(1)基频(fundamental frequency,F0)是基音的振动频率,它决定语 音音调的高低,实际应用中常用最高音pitch来表示基频,这里在提取语音帧的基频特征时,即提取该语音帧的pitch。
(2)梅尔频率倒谱系数(Mel-frequency cepstral coefficients,MFCC)是模拟人耳听觉感知机理设计的一种特征,其提取方法是:首先对语音帧做短时傅里叶变换(short-term Fourier transform,STFT)得到语谱上的能量分布,接下来将语谱通过一组梅尔谱上均匀分布的三角滤波器,相邻滤波器之间有一半重叠部分,得到每帧在每个梅尔滤波器中的能量水平。最后对三角滤波器组的输出求对数,在通过离散余弦变换(discrete cosine transform,DCT)进行求倒谱的操作,同时也完成了特征的去相关操作。通常只保留前12个DCT系数,因为舍弃高倒频域值的DCT系数可以起到一个类似低通滤波器的作用,能够使信号平滑化,提高语音信号处理的性能。
本申请实施例中的梅尔频率倒谱系数MFCC包括12个,则语音帧特征中的梅尔频率倒谱系数有12个。
(3)过零率(zero crossing rate,ZCR)表示单位采样点内信号通过零点的次数。其计算公式为:
Figure PCTCN2020089669-appb-000001
其中,
Figure PCTCN2020089669-appb-000002
其中,N表示语音帧的帧长(或者可以指语音帧中采样点的个数,比如若语音帧为间隔10ms采集到的,且每1ms均有一个采样点,则这里的帧长即为10),x n(m)表示第n帧的第m个信号采样点。过零率常常用来区分清音和浊音,前者过零率较高,后者较低。
(4)谐噪比(harmonics-to-noise ratio,HNR)是由自相关系数(autocorrelation coefficient function,ACF)计算得到的,可以反映发声概率,其计算公式为:
Figure PCTCN2020089669-appb-000003
其中,
Figure PCTCN2020089669-appb-000004
其中,T 0代表的是基音周期(fundamental period),τ表示同一个语音帧中相邻两个采样点之间的间隔时间。
(5)能量:语音信号短时能量的计算方法为一帧内各点信号值的平方 和,即:
Figure PCTCN2020089669-appb-000005
能量反映了音频信号的振幅。通过能量,可以大体判断各帧包含的信息多少,也可以用来区分浊音段和轻音段、有声段和静音段的分界等。
通过以上声学特征,能够表达出请求者语音信息中的声音特点,从而为后续确定请求者的状态信息做参考量。
针对语音信息中的任一语音段落,当提取出该任一语音段落中的每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差后,按照设定顺序,得到每个语音帧对应的语音帧特征向量,比如,针对该任一语音段落中的每个语音帧,得到的的任一语音帧的语音帧特征向量为包括上述语音帧特征和语音帧特征差组成的特征向量,以第n个语音帧为例,其得到的语音帧特征向量可以通过以下向量表示:
Y n=(pitch n EMCC n1 ... EMCC n12 ... E n Δpitch n ΔEMCC n1 ... ΔEMCC n12 ... ΔE n) T
这里第n个语言帧的语音特征向量中,第一个省略号处为EMCC n2~EMCC n11,第二个省略号处依次为ZCR n和HNR n,第三个省略号处为ΔEMCC n2~ΔEMCC n11,第四个省略号处依次为ΔZCR n和ΔHNR n
可以看到第n个语音帧的语音帧特征向量是32维度的特征向量,这里的n可以指该任一语音段落中的任一语音帧,若该任一语音段落包括10个语音帧,在得到该任一语音段落中10个语音帧对应的语音帧特征向量为10个语音帧特征向量(Y 1~Y 10)后,针对这10个语音帧特征向量中的属于同一维度的语音帧特征或语音帧特征差按照预先设定的语音段落特征函数进行处理,确定该任一语音段落对应的第一段落语音特征向量,这里的同一维度语音帧特征向量中的元素位置,比如一个向量包括32个元素,即这里定位为该向量为32维度的向量,属于同一维度是指在每个向量中属于同一个元素位置的元素。
具体地,比如针对该任一语音段落包含的所有语音帧特征向量中第一个维度的语音帧特征,即pitch特征,对10个语音帧特征向量中的所有pitch特征分别通过平均值函数、标准差函数,峰度函数,偏度函数、最大值函数,最小值函数,相对最大位置函数、相对最小位置函数,范围函数、以及线性回归函数进行处理,从而得到上述10个语音帧特征向量中所有第一个维度的pitch特征的平均值、标准差,峰度,偏度、最大值,最小值,相对最大位置、相对最小位置、范围、以及线性回归函数的抵消项、斜率和最小均方差这12个函数值,即针对该任一语音段落中每个语音帧特征向量中属于同一维度的语音帧特征或语音帧特征差按照上述函数计算每个维度 对应的12个函数值。
按照上述方式,即能够得到该任一语音段落的所有语音帧特征和语音帧特征差对应的函数值,然后按照设定的语音帧特征和语音帧特征差顺序以及函数值顺序,组成该任一语音段落的第一段落语音特征向量,若每个语音帧特征向量为32维度时,语音帧特征和语音帧特征差分别对应的函数值个数为12个时,即得到的该第一段落语音特征向量即可以为32*12即384维度的特征向量。
为了便于描述,将任一语音段落中每个语音帧特征向量中属于同一维度语音帧特征或语音帧特征差作为一个样本组,比如上述任一语音段落中包括10个语音帧特征向量,每个语音帧特征向量均为32维度,若针对属于第一维度的pitch特征,则该pitch特征样本组总共包括10个样本,32个维度则包括32个样本组,下面分别介绍平均值、标准差,峰度,偏度、最大值,最小值,相对最大位置、相对最小位置、范围、抵消项、斜率和最小均方差的含义:
平均值即表示任一语音段落中所有语音帧特征向量中每个样本组中所有样本的平均值,比如针对pitch特征样本组,这里的平均值即该任一语音段落中所有语音帧特征向量中的pitch的平均值。
标准差即表示任一语音段落中所有语音帧特征向量中每组样本的标准差。
峰度(kurtosis)是描述样本分布的形态与正态分布相比的陡缓程度的统计量,峰度函数通过以下公式表示:
Figure PCTCN2020089669-appb-000006
其中,D为属于该任一语音段落中所有语音帧特征向量中每个样本组的样本方差,
Figure PCTCN2020089669-appb-000007
表示每个样本组的样本均值。
偏度(Skewness)与峰度类似,同样是一种描述样本分布形态的统计值,其描述的是某总体的样本分布的对称性,偏度的计算公式如下:
Figure PCTCN2020089669-appb-000008
最大值和最小值分别指属于同一个样本组中的最大值和最小值。
相对最大位置指的是同一个样本组中最大值所属于的语音帧在该任一语音段落中的位置、相对最小位置指的是同一样本中最小值所属于的语音帧在该任一语音段落中的位置,比如针对pitch特征组成的样本中,若发现最大的pitch来自于该任一语音段落中的第3个位置的语音帧,则相对最大 位置为3,若发现最小的pitch来自于该任一语音段落中的第7个位置的语音帧,则相对最小位置为7。
范围(Range)指的是同一个样本组中,最大值与最小值之差:Range=Max-Min;其中,Max是最大值,Min是最小值。
抵消项、斜率和最小均方差分别指将同一个样本组构成线性回归函数后,该线性回归函数对应的截距、斜率和最小均方差。
按照上述方法,即可以确定出语音信息中每个语音段落的第一段落语音特征向量,若该语音信息包括10个语音段落,即可以得到10个384维度的第一段落语音特征向量。
这里将语音帧特征向量转换为第一段落语音特征向量,即得到语音信息中各个语音段落的特征向量,这样能够完整的表达请求者的语音段落的声学特征,便于后期确定请求者的状态信息。
当然,以上给出的预先设定的语音段落特征函数并不限于本申请给出的上述几种,上述几种语言段落特征函数仅仅为一种具体实施例。
S303,基于语音信息的各个语音段落分别对应的第一段落语音特征向量,提取语音信息的语音特征向量。
考虑到每个人的醉酒状态大多是不同的,是具有差异性的,具体地,步骤S303中,基于语音信息的各个语音段落分别对应的第一段落语音特征向量,提取语音信息的语音特征向量,如图4所示,包括以下步骤S401~S403:
S401,针对每个语音段落,基于该语音段落的第一段落语音特征向量和预先存储的清醒状态语音特征向量,确定每个语音段落的差异性语音特征向量。
这里的清醒状态语音特征向量即预先采集的大量处于清醒状态的乘客的语音信息,并对每个处于清醒状态的乘客的语音信息按照上述方式处理,得到每个处于清醒状态的乘客的第一段落语音特征向量,然后对这些乘客的所有第一段落语音特征向量进行取平均计算,得到平均的第一段落语音特征信息,并将该平均的第一段落语音信息作为清醒状态语音特征向量进行存储。
这里针对每个语音段落的第一段落语音特征向量,将该第一段落语音特征向量与请求状态语音特征向量进行求差并取绝对值,即得到每个语音段落的差异性语音特征向量,若以C i表示差异性语音特征向量中第i个元素位置的差异性特征,以D i表示第一段落语音特征向量中第i个元素位置的第一段落语音特征,以Q i表示清醒状态语音特征向量中第i个元素位置的清醒状态语音特征,则可以通过以下公式确定差异性语音特征向量中的每个差 异性语音特征:
C i=|D i-Q i|;
其中,若差异性i∈(1,K),K表示差异性特征向量、第一段落语音特征向量或者清醒状态语音特征向量的维度,这里i从1到K依次取值,这样即可以得到差异性特征向量中每个元素位置的值,即每个元素位置的差异性特征。
若第一段落语音特征向量为上述提到的384维度的特征向量,这里的差异性语音特征向量也为384维度的特征向量,即差异性语音特征向量的维度与第一段落语音特征向量的维度相同。
S402,基于每个语音段落的第一段落语音特征向量和差异性语音特征向量,确定每个语音段落的第二段落语音特征向量。
在得到每个语音段落的差异性语音特征向量后,将每个语音段落的差异性语音特征向量与该语音段落的第一段落语音特征向量进行拼接,即得到各个语音段落的第二段落语音特征向量,比如第一段落语音特征向量为384维度的特征向量,差异性语音特征向量也为384维度的特征向量,则第二段落语音特征向量记为768维度的特征向量。
S403,对各个第二段落语音特征向量进行合并,得到语音信息的语音特征向量。
在得到语音信息中各个语音段落的第二段落语音特征向量后,将这些第二段落语音特征向量进行合并,比如本申请实施例语音信息中包括10个语音段落,即得到10个768维度的第二段落语音特征向量,即这10个768维度的第二段落语音特征向量即为该语音信息的语音特征向量。
这里的第二段落语音特征向量即可以表示请求者与其他乘客不同的段落语音特征向量,具有一定的差异性,通过第二段落语音特征向量能够更准确地确定请求者的状态信息。
另外,在得到乘客的语音信息后,如图5所示,可以按照以下过程提取语音信息的语速特征向量,具体包括以下步骤S501~S504:
S501,将语音信息中的各个语音段落转换为文本段落,每个文本段落包括多个字符。
S502,基于每个文本段落对应的字符个数以及该文本段落对应的语音段落的时长,确定每个文本段落的语速。
比如将语音信息中任一语音段落转换为文本段落后,得到该任一语音段落对应的文本段落,该文本段落可以包括M个字符,若该任一语音段落的时长为N,则该任一文本段落的语速V可以通过V=M/N确定。
S503,基于每个文本段落对应的语速,确定语音信息的最大语速、最小语速和平均语速。
S504,基于语音信息的最大语速、最小语速和平均语速,提取语音信息的语速特征向量。
按照上述方式,可以计算出语音信息中每个文本段落的语速,然后在这些语速中确定出最大语速、最小语速和平均语速,然后将该语音信息的最大语速、最小语速和平均语速进行组合,即得到一个3维度的语速特征向量。
这里因为醉酒状态能够让乘客的语速变得缓慢,说话吞吞,故本申请实施例中将语速特征向量作为确定请求者状态信息的参考量,提高了识别请求者的状态信息的准确性。
在按照上述方式提取出语音信息的语音特征向量和语速特征向量后,就可以基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息,如图6所示,具体包括以下步骤S601~S602:
S601,基于语音特征向量确定语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量。
这里,第一分值特征向量包括语音信息中每个语音段落指示醉酒状态的概率值,第二分值特征向量包括语音信息中每个语音段落指示不醉酒状态的概率值。
具体地,这里的语音特征向量中包括多个第二段落语音特征向量,这里第一分值特征向量中每个分值特征即表示语言特征向量中对应的第二段落语音特征向量为醉酒状态的概率值,第二分值特征向量中每个分值特征即表示语言特征向量中对应的第二段落语音特征向量为不醉酒状态的概率值。
具体地,这里基于语音特征向量确定语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量,包括:
将语音特征向量输入预先训练的声音识别模型中的段级别分类器,得到语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量。
这里的声音识别模型以及声音识别模型中的段级别分类器将在后文进行介绍,这里将语音特征向量输入该声音识别模型中的段级别分类器后,即可以得到由语音特征向量中每个第二段落语音特征向量属于醉酒状态的概率值组成的第一分值特征向量,以及得到由每个第二段落语音特征向量属于不醉酒状态的概率值组成的第二分值特征向量。
S602,基于第一分值特征向量、第二分值特征向量以及语速特征向量, 确定请求者当前的状态信息。
在得到第一分值特征向量和第二分值特征向量后,可以结合上文提到的语速特征向量,确定出请求者当前的状态信息,该过程具体包括:
(1)基于第一分值特征向量、第二分值特征向量和预先设定的语音分值特征函数,确定语音信息的分值特征向量;
(2)将分值特征向量和语速特征向量合并后,输入声音识别模型中的状态级别分类器,确定请求者当前的状态信息。
这里,在得到第一分值特征向量和第二分值特征向量后,可以通过预先设定的语音分值特征函数,得到能够表示语音信息的分值特征向量,比如通过最大值函数、最小值函数、平均值函数、以及分位数函数,分别计算第一分值特征向量和第二分值特征向量中各个分值特征的最大值、最小值、平均值以及9分位数,即得到与第一分值特向量相关的12个函数值,以及与第二分值特向量相关的12个函数值,然后将这24个函数值进行拼接,得到24维度的语音信息的分值特征向量。
比如,针对第一分值特征向量,这里的9分位数包括对该第一分值特征向量中所有分值特征进行一分位时对应的分值特征、二分位时对应的分值特征、三分位时对应的分值特征、四分位时对应的分值特征、五分位时对应的分值特征、六分位时对应的分值特征、七分位时对应的分值特征、八分位时对应的分值特征以及九分位时对应的分值特征,即9个分值特征。
然后,再将语音信息的分值特征向量与语速特征向量进行合并后,可以得到27维度的特征向量,即包括24维度的分值特征向量和3维度的语速特征向量,将合并后的分值特征向量和语速特征向量输入声音识别模型中的状态级别分类器,即可确定请求者当前的状态信息。
或者,这里可以直接将第一分值特征向量、第二分值特征向量和语速特征向量输入声音识别模型中的状态级别分类器,由声音识别模型中的状态级别分类器基于第一分值特征向量、第二分值特征向量和预先设定的语音分值特征函数,确定语音信息的分值特征向量,然后再将分值特征向量和语速特征向量合并后,确定请求者当前的状态信息。
本申请实施例中,通过语音特征向量和语速特征向量共同确定请求者当前的状态信息,一方面考虑请求者当前的声学特征,另一方面考虑请求者当前的语速特征,两种特征共同确定请求者的状态信息,能够更加准确地确定该请求者当前是否处于醉酒状态。
以下将针对上述提到的声音识别模型进行介绍,按照以下方式训练声音识别模型:
(1)构建声音识别模型的段级别分类器和状态级别分类器。
(2)获取预先构建的训练样本库,训练样本库包括多个训练语音信息对应的训练语音特征向量、训练语速特征向量以及每个训练语音信息对应的状态信息。
这里训练样本库中可以包括大量乘客的训练语音信息,这些乘客的状态信息是确定的,比如可以包括1000个处于醉酒状态的乘客和1000个处于不醉酒状态的乘客,这样即能够得到2000个训练语音信息对应的训练语音特征向量和训练语速特征向量。
这里的每个训练语音特征向量可以由至少一个第二训练段落语音特征向量,比如每个训练语音特征向量由10个第二训练段落语音特征向量,这里的第二训练段落语音特征向量的确定过程与上文提到的确定第二段落语音特征向量的过程相似,在此不再赘述。
另外,这里每个训练语音信息的训练语速特征向量与上文提到的确定语音信息的语速特征向量的过程相似,在此不再赘述。
(3)依次将每个训练语音信息的训练语音特征向量输入段级别分类器,得到该训练语音信息对应的第一训练分值特征向量和第二训练分值特征向量;将第一训练分值特征向量、第二训练分值特征向量和训练语速特征向量作为状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为声音识别模型的输出变量,训练得到声音识别模型的模型参数信息。
本申请实施例中的声音识别模型可以包括两种分类器,第一种即段级别分类器,其输入端可以为训练语音特征向量,输出即为训练语音特征向量中每个第二训练段落语音特征向量属于醉酒状态的概率值组成的第一训练分值特征向量,以及每个第二训练段落语音特征向量属于不醉酒状态的概率值组成的第二训练分值特征向量。
然后可以将第一训练分值特征向量、第二训练分值特征向量和语速特征向量作为状态级别分类器的输入变量,具体地,可以先基于第一训练分值特征向量、第二训练分值特征向量和预先设定的语音分值特征函数确定训练语音信息的训练分值特征向量,这里确定训练分值特征向量的过程与上文提到的确定语音信息的分值特征向量的过程相似,在此不再赘述,然后同样将训练分值特征向量和训练语速特征向量合并后再作为状态级别分类器的输入变量,这里的输入变量均携带有其属于的乘客的标码信息,这里的编码信息即可以指示该合并后的训练分值特征向量和训练语速特征向量来自哪个乘客,然后将该训练语音信息对应的状态信息作为声音识别模型的输出变量,训练得到声音识别模型的模型参数信息。
在具体训练过程中,声音识别模型中的段级别分类器和状态级别分类器可以分开进行训练,比如先训练段级分类器中的模型参数信息,可以将 各个语音信息的训练语音特征向量输入段级分类器,得到第一训练分值特征向量和第二训练分值特征向量,然后将第一训练分值特征向量和第二训练分值特征向量输入预设的损失函数,通过调整段级分类器中的模型参数信息,直至损失函数收敛,即得到段级分类器中的模型参数信息。
然后将得到的第一训练分值特征向量、第二训练分值特征向量和训练语速特征向量作为状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为声音识别模型的输出变量,训练状态级别分类器的模型参数信息。
在得到模型参数信息后,为了验证声音识别模型的测试性能,一般还需要通过测试样本对上述得到的声音识别模型进行测试,具体过程如下:
(1)获取预先构建的测试样本库,测试样本库包括多个测试语音信息对应的测试语音特征向量、测试语速特征向量以及每个测试语音信息对应的真实状态信息。
这里测试样本库可以包括大量已知状态信息的乘客的测试语音信息,这里的测试语音特征向量和测试语速特征向量与上文的语音特征向量和语速特征向量的确定过程相似,在此不再赘述。
(2)依次将测试样本库中的每个测试语音特征向量输入声音识别模型的段级分类器,得到测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量;
这里将每个测试语音特征向量输入声音识别模型的段级分类器,可以得到测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量,这里的测试语音特征向量包含乘客的编码信息。
(3)将测试样本库中每个测试语音对应的第一测试分值特征向量、第二测试分值特征向量和测试语速特征向量输入声音识别模型的状态级别分类器,得到测试样本库中的每个测试语音对应的测试状态信息;
这里同样可以先基于第一测试分值特征向量、第二测试分值特征向量和预先设定的语音分值特征函数确定测试语音信息的测试分值特征向量,然后再将测试分值特征向量和测试语速特征向量合并后输入声音识别模型的状态级别分类器,得到测试样本库中的每个测试语音对应的测试状态信息。
(4)基于真实状态信息和测试状态信息,确定声音识别模型的精确率和召回率。
具体地,本申请可以通过精确率和召回率来作为声音识别模型的测试性能评价指标,为了说明精确率和召回率的含义,本申请实施例引入四种分类情况:真正例(True Positives,TP)、假正例(False Positives,FP)、 假反例(False Negatives,FN)和真反例(True Negatives,TN),其具体含义如下表所示:
Figure PCTCN2020089669-appb-000009
精确率Precision计算的是所有被正确分类的样本占所有实际被判别为该类的样本比例,公式为:
Figure PCTCN2020089669-appb-000010
召回率Recall计算的是所有被正确分类的样本(TP)占该类实际样本数量的比例,公式为:
Figure PCTCN2020089669-appb-000011
这样,按照精确率公式和召回率公式,即可得到声音识别模型的精确率和召回率。
(5)若精确率和召回率不满足设定条件,更新声音识别模型中的模型训练参数和/或训练样本库,重新训练声音识别模型,直至精确率和召回率满足设定条件。
这里的设定条件包括:(1)精确率不小于设定精确率同时召回率不小于设定召回率;(2)精确率不小于设定精确率,召回率不限定;(3)精确率不限定,召回率不小于设定召回率;(4)与精确率和召回率相关的鲁棒性符合设定鲁棒性条件。
另外,本申请实施例还可以通过精确率和召回率得到声音识别模型的Precision-Recall曲线,如图7所示,可以从图7中看到,这里的AUC指标为0.819,AP指标为0.122,这里的AUC指标和AP指标能够表示声音识别模型是否满足设定鲁棒性条件。
这里的AUC的全称是Area Under Curve,就是ROC曲线和x轴(FPR轴)之间的面积,AP的全称是Average Precision,指的是平均精确率,具 体可以为Precision-Recall曲线和x轴和y轴之间的面积。
当设定条件确定后,若经过测试样本测试后,确定声音识别模型的精确率和召回率不满足设定条件后,可以更新声音识别模型中的模型训练参数或训练样本库中的训练样本重新训练声音识别模型,或者同时更新声音识别模型中的模型训练参数以及训练样本库中的训练样本重新训练声音识别模型,直至直至精确率和召回率满足设定条件后,停止训练,即得到训练完成的声音识别模型。
基于同一申请构思,本申请实施例中还提供了与订单处理方法对应的订单处理装置,由于本申请实施例中的装置解决问题的原理与本申请实施例上述订单处理方法相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
参照图8所示,为本申请提供的一种订单处理装置800的示意图,该订单处理装置800包括:获取模块801、确定模块802、提示模块803;其中,
获取模块801,用于在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息,并将语音信息传输至确定模块802;
确定模块802,用于提取语音信息的语音特征向量和语速特征向量,并基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息,并将该状态信息传输至提示模块803;状态信息中包括指示请求者当前是否处于醉酒状态的信息;
提示模块803,用于基于状态信息,提示服务提供端确认是否接受订单。
在一种实施方式中,订单处理装置还包括处理模块804,处理模块804在获取模块801获取服务请求端发送的语音信息之后,确定模块802提取语音信息的语音特征向量和语速特征向量之前,用于:
对语音信息进行语音端点检测,得到至少一个语音段落和静音段落;
删除语音信息中的静音段落。
在一种实施方式中,确定模块802具体用于根据以下步骤提取语音信息时的特征向量:
对语音信息中的各个语音段落分别进行分帧处理,得到每个语音段落对应的语音帧;
针对每个语音段落,提取该语音段落中每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差,并基于语音帧特征、语音帧特征差和预先设定的语音段落特征函数,确定该语音段落的第一段落语音特征向量;
基于语音信息的各个语音段落分别对应的第一段落语音特征向量,提 取语音信息的语音特征向量。
在一种实施方式中,确定模块802具体用于根据以下步骤基于语音信息的各个语音段落分别对应的第一段落语音特征向量,提取语音信息的语音特征向量:
针对每个语音段落,基于该语音段落的第一段落语音特征向量和预先存储的清醒状态语音特征向量,确定每个语音段落的差异性语音特征向量;
基于每个语音段落的第一段落语音特征向量和差异性语音特征向量,确定每个语音段落的第二段落语音特征向量;
对各个第二段落语音特征向量进行合并,得到语音信息的语音特征向量。
在一种实施方式中,确定模块802具体用于根据以下步骤提取语音信息的语速特征向量:
将语音信息中的各个语音段落转换为文本段落,每个文本段落包括多个字符;
基于每个文本段落对应的字符个数以及该文本段落对应的语音段落的时长,确定每个文本段落的语速;
基于每个文本段落对应的语速,确定语音信息的最大语速、最小语速和平均语速;
基于语音信息的最大语速、最小语速和平均语速,提取语音信息的语速特征向量。
在一种实施方式中,确定模块802具体用于根据以下步骤基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息:
基于语音特征向量确定语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量,第一分值特征向量包括语音信息中每个语音段落指示醉酒状态的概率值,第二分值特征向量包括语音信息中每个语音段落指示不醉酒状态的概率值;
基于第一分值特征向量、第二分值特征向量以及语速特征向量,确定请求者当前的状态信息。
在一种实施方式中,确定模块802具体用于根据以下步骤基于语音特征向量确定语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量:
将语音特征向量输入预先训练的声音识别模型中的段级别分类器,得到语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量;
基于第一分值特征向量、第二分值特征向量和预先设定的语音分值特 征函数,确定语音信息的分值特征向量;
将分值特征向量和语速特征向量合并后,输入声音识别模型中的状态级别分类器,确定请求者当前的状态信息。
在一种实施方式中,还包括模型训练模块805,模型训练模块805用于根据以下步骤训练声音识别模型:
构建声音识别模型的段级别分类器和状态级别分类器;
获取预先构建的训练样本库,训练样本库包括多个训练语音信息对应的训练语音特征向量、训练语速特征向量以及每个训练语音信息对应的状态信息;
依次将每个训练语音信息的训练语音特征向量输入段级别分类器,得到该训练语音信息对应的第一训练分值特征向量和第二训练分值特征向量;将第一训练分值特征向量、第二训练分值特征向量和语速特征向量作为状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为声音识别模型的输出变量,训练得到声音识别模型的模型参数信息。
在一种实施方式中,订单处理装置还包括模型测试模块806,模型测试模块806用于根据以下步骤测试声音识别模型:
获取预先构建的测试样本库,测试样本库包括多个测试语音信息对应的测试语音特征向量、测试语速特征向量以及每个测试语音信息对应的真实状态信息;
依次将测试样本库中的每个测试语音特征向量输入声音识别模型的段级分类器,得到测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量;
将测试样本库中每个测试语音对应的第一测试分值特征向量、第二测试分值特征向量和测试语速特征向量输入声音识别模型的状态级别分类器,得到测试样本库中的每个测试语音对应的测试状态信息;
基于真实状态信息和测试状态信息,确定声音识别模型的精确率和召回率。
若精确率小于设定精确率,和/或,召回率小于设定召回率,可以更新声音识别模型中的模型训练参数和训练样本库中的至少一种,使得模型训练模块805重新训练声音识别模型,直至声音识别模型的精确率不小于设定精确率,且召回率不小于设定召回率。
本申请实施例还提供了一种电子设备900,如图9所示,为本申请实施例提供的电子设备900的结构示意图,包括:处理器901、存储介质902和总线903,存储介质902存储有处理器901可执行的机器可读指令(比如获取模块801、确定模块802、提示模块803等),当电子设备900运行时, 处理器901与存储介质902之间通过总线903通信,机器可读指令被处理器901执行时执行如下处理:
在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息;
提取语音信息的语音特征向量和语速特征向量,并基于语音特征向量和语速特征向量,确定使用服务请求端的请求者当前的状态信息;状态信息中包括指示请求者当前是否处于醉酒状态的信息;
基于状态信息,提示服务提供端确认是否接受订单。
一种可能的实施方式中,在获取服务请求端发送的语音信息之后,提取语音信息的语音特征向量和语速特征向量之前,处理器901执行的指令还包括:
对语音信息进行语音端点检测,得到至少一个语音段落和静音段落;
删除语音信息中的静音段落。
一种可能的实施方式中,处理器901执行的指令包括:
对语音信息中的各个语音段落分别进行分帧处理,得到每个语音段落对应的语音帧;
针对每个语音段落,提取该语音段落中每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差,并基于语音帧特征、语音帧特征差和预先设定的语音段落特征函数,确定该语音段落的第一段落语音特征向量;
基于语音信息的各个语音段落分别对应的第一段落语音特征向量,提取语音信息的语音特征向量。
一种可能的实施方式中,处理器901执行的指令包括:
针对每个语音段落,基于该语音段落的第一段落语音特征向量和预先存储的清醒状态语音特征向量,确定每个语音段落的差异性语音特征向量;
基于每个语音段落的第一段落语音特征向量和差异性语音特征向量,确定每个语音段落的第二段落语音特征向量;
对各个第二段落语音特征向量进行合并,得到语音信息的语音特征向量。
一种可能的实施方式中,处理器901执行的指令包括:
将语音信息中的各个语音段落转换为文本段落,每个文本段落包括多个字符;
基于每个文本段落对应的字符个数以及该文本段落对应的语音段落的时长,确定每个文本段落的语速;
基于每个文本段落对应的语速,确定语音信息的最大语速、最小语速 和平均语速;
基于语音信息的最大语速、最小语速和平均语速,提取语音信息的语速特征向量。
一种可能的实施方式中,处理器901执行的指令包括:
基于语音特征向量确定语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量,第一分值特征向量包括语音信息中每个语音段落指示醉酒状态的概率值,第二分值特征向量包括语音信息中每个语音段落指示不醉酒状态的概率值;
基于第一分值特征向量、第二分值特征向量以及语速特征向量,确定请求者当前的状态信息。
一种可能的实施方式中,处理器901执行的指令包括:
将语音特征向量输入预先训练的声音识别模型中的段级别分类器,得到语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量;
基于第一分值特征向量、第二分值特征向量和预先设定的语音分值特征函数,确定语音信息的分值特征向量;
将分值特征向量和语速特征向量合并后,输入声音识别模型中的状态级别分类器,确定请求者当前的状态信息。
一种可能的实施方式中,处理器901执行的指令还包括:
构建声音识别模型的段级别分类器和状态级别分类器;
获取预先构建的训练样本库,训练样本库包括多个训练语音信息对应的训练语音特征向量、训练语速特征向量以及每个训练语音信息对应的状态信息;
依次将每个训练语音信息的训练语音特征向量输入段级别分类器,得到该训练语音信息对应的第一训练分值特征向量和第二训练分值特征向量;将第一训练分值特征向量、第二训练分值特征向量和训练语速特征向量作为状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为声音识别模型的输出变量,训练得到声音识别模型的模型参数信息。
一种可能的实施方式中,处理器901执行的指令还包括:
获取预先构建的测试样本库,测试样本库包括多个测试语音信息对应的测试语音特征向量、测试语速特征向量以及每个测试语音信息对应的真实状态信息;
依次将测试样本库中的每个测试语音特征向量输入声音识别模型的段级分类器,得到测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量;
将测试样本库中每个测试语音对应的第一测试分值特征向量、第二测试分值特征向量和测试语速特征向量输入声音识别模型的状态级别分类器,得到测试样本库中的每个测试语音对应的测试状态信息;
基于真实状态信息和测试状态信息,确定声音识别模型的精确率和召回率;
若精确率和召回率不满足设定条件,更新声音识别模型中的模型训练参数和/或训练样本库,重新训练声音识别模型,直至精确率和召回率满足设定条件。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器运行时执行上述订单处理方法的步骤。
具体地,该存储介质能够为通用的存储介质,如移动磁盘、硬盘等,该存储介质上的计算机程序被运行时,能够执行上述订单处理方法方法,从而解决无法对乘车环境的安全性提前进行风险控制的问题,能够对乘车环境的安全性提前进行风险控制,进而提高了整体乘车环境的安全性。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考方法实施例中的对应过程,本申请中不再赘述。在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的 部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种订单处理装置,其特征在于,包括:
    获取模块,用于在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息,并将所述语音信息传输至确定模块;
    确定模块,用于提取所述语音信息的语音特征向量和语速特征向量,并基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前的状态信息,并将所述状态信息传输至提示模块;所述状态信息中包括指示所述请求者当前是否处于醉酒状态的信息;
    提示模块,用于基于所述状态信息,提示所述服务提供端确认是否接受所述订单。
  2. 根据权利要求1所述的订单处理装置,其特征在于,还包括处理模块,所述处理模块在所述获取模块获取服务请求端发送的语音信息之后,在所述确定模块提取所述语音信息的语音特征向量和语速特征向量之前,用于:
    对所述语音信息进行语音端点检测,得到至少一个语音段落和静音段落;
    删除所述语音信息中的静音段落。
  3. 根据权利要求2所述的订单处理装置,其特征在于,所述确定模块具体用于根据以下步骤提取所述语音信息时的特征向量:
    对所述语音信息中的各个语音段落分别进行分帧处理,得到每个语音段落对应的语音帧;
    针对每个语音段落,提取该语音段落中每个语音帧的语音帧特征以及该语音帧与其相邻的语音帧之间的语音帧特征差,并基于所述语音帧特征、所述语音帧特征差和预先设定的语音段落特征函数,确定该语音段落的第一段落语音特征向量;
    基于所述语音信息的各个语音段落分别对应的第一段落语音特征向量,提取所述语音信息的语音特征向量。
  4. 根据权利要求3所述的订单处理装置,其特征在于,所述确定模块具体用于根据以下步骤基于所述语音信息的各个语音段落分别对应的第一段落语音特征向量,提取所述语音信息的语音特征向量:
    针对每个语音段落,基于该语音段落的所述第一段落语音特征向量和预先存储的清醒状态语音特征向量,确定每个语音段落的差异性语音特征 向量;
    基于每个语音段落的所述第一段落语音特征向量和所述差异性语音特征向量,确定每个语音段落的第二段落语音特征向量;
    对各个第二段落语音特征向量进行合并,得到所述语音信息的语音特征向量。
  5. 根据权利要求2所述的订单处理装置,其特征在于,所述确定模块具体用于根据以下步骤提取所述语音信息的语速特征向量:
    将所述语音信息中的各个语音段落转换为文本段落,每个文本段落包括多个字符;
    基于每个文本段落对应的字符个数以及该文本段落对应的语音段落的时长,确定每个文本段落的语速;
    基于每个文本段落对应的语速,确定所述语音信息的最大语速、最小语速和平均语速;
    基于所述语音信息的最大语速、最小语速和平均语速,提取所述语音信息的语速特征向量。
  6. 根据权利要求2所述的订单处理装置,其特征在于,所述确定模块具体用于根据以下步骤基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前的状态信息:
    基于所述语音特征向量确定所述语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量,所述第一分值特征向量包括所述语音信息中每个语音段落指示醉酒状态的概率值,所述第二分值特征向量包括所述语音信息中每个语音段落指示不醉酒状态的概率值;
    基于所述第一分值特征向量、所述第二分值特征向量以及所述语速特征向量,确定所述请求者当前的状态信息。
  7. 根据权利要求6所述的订单处理装置,其特征在于,所述确定模块具体用于根据以下步骤基于所述语音特征向量确定所述语音信息指示醉酒状态的第一分值特征向量和指示不醉酒状态的第二分值特征向量:
    将所述语音特征向量输入预先训练的声音识别模型中的段级别分类器,得到所述语音信息指示所述醉酒状态的第一分值特征向量和指示所述不醉酒状态的第二分值特征向量;
    基于所述第一分值特征向量、所述第二分值特征向量和预先设定的语音分值特征函数,确定所述语音信息的分值特征向量;
    将所述分值特征向量和所述语速特征向量合并后,输入所述声音识别模型中的状态级别分类器,确定所述请求者当前的状态信息。
  8. 根据权利要求7所述的订单处理装置,其特征在于,还包括模型训 练模块,所述模型训练模块用于根据以下步骤训练所述声音识别模型:
    构建声音识别模型的段级别分类器和状态级别分类器;
    获取预先构建的训练样本库,所述训练样本库包括多个训练语音信息对应的训练语音特征向量、训练语速特征向量以及每个训练语音信息对应的状态信息;
    依次将每个训练语音信息的训练语音特征向量输入所述段级别分类器,得到该训练语音信息对应的第一训练分值特征向量和第二训练分值特征向量;将所述第一训练分值特征向量、所述第二训练分值特征向量和所述训练语速特征向量作为所述状态级别分类器的输入变量,将该训练语音信息对应的状态信息作为所述声音识别模型的输出变量,训练得到所述声音识别模型的模型参数信息。
  9. 根据权利要求8所述的订单处理装置,其特征在于,还包括模型测试模块,所述模型测试模块用于根据以下步骤测试所述声音识别模型:
    获取预先构建的测试样本库,所述测试样本库包括多个测试语音信息对应的测试语音特征向量、测试语速特征向量以及每个测试语音信息对应的真实状态信息;
    依次将所述测试样本库中的每个测试语音特征向量输入所述声音识别模型的段级分类器,得到所述测试样本库中每个测试语音对应的第一测试分值特征向量和第二测试分值特征向量;
    将所述测试样本库中每个测试语音对应的第一测试分值特征向量、第二测试分值特征向量和测试语速特征向量输入所述声音识别模型的状态级别分类器,得到所述测试样本库中的每个测试语音对应的测试状态信息;
    基于所述真实状态信息和所述测试状态信息,确定所述声音识别模型的精确率和召回率。
  10. 一种订单处理方法,其特征在于,包括:
    在服务提供端接收到服务请求、并触发语音获取请求后,获取服务请求端发送的语音信息;
    提取所述语音信息的语音特征向量和语速特征向量,并基于所述语音特征向量和所述语速特征向量,确定使用所述服务请求端的请求者当前的状态信息;所述状态信息中包括指示所述请求者当前是否处于醉酒状态的信息;
    基于所述状态信息,提示所述服务提供端确认是否接受所述订单。
  11. 一种电子设备,其特征在于,包括:处理器、存储介质和总线,所述存储介质存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储介质之间通过总线通信,所述处理器执行所述 机器可读指令,以执行如权利要求10所述订单处理方法的步骤。
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如权利要求10所述订单处理方法的步骤。
PCT/CN2020/089669 2019-05-17 2020-05-11 一种订单处理方法、装置、电子设备及存储介质 WO2020233440A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910414644.8A CN111862946B (zh) 2019-05-17 2019-05-17 一种订单处理方法、装置、电子设备及存储介质
CN201910414644.8 2019-05-17

Publications (1)

Publication Number Publication Date
WO2020233440A1 true WO2020233440A1 (zh) 2020-11-26

Family

ID=72965990

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089669 WO2020233440A1 (zh) 2019-05-17 2020-05-11 一种订单处理方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN111862946B (zh)
WO (1) WO2020233440A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786054A (zh) * 2021-02-25 2021-05-11 深圳壹账通智能科技有限公司 基于语音的智能面试评估方法、装置、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951229A (zh) * 2021-02-07 2021-06-11 深圳市今视通数码科技有限公司 理疗机器人的语音唤醒方法、系统和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160275638A1 (en) * 2015-03-20 2016-09-22 David M. Korpi Vehicle service request system having enhanced safety features
CN107358484A (zh) * 2017-05-27 2017-11-17 上海与德科技有限公司 一种网约车监控方法及系统
CN108140202A (zh) * 2016-01-07 2018-06-08 谷歌有限责任公司 乘车共享平台中的信誉系统
CN108877146A (zh) * 2018-09-03 2018-11-23 深圳市尼欧科技有限公司 一种基于智能语音识别的乘驾安全自动报警装置及其方法
CN109102825A (zh) * 2018-07-27 2018-12-28 科大讯飞股份有限公司 一种饮酒状态检测方法及装置
CN109636257A (zh) * 2019-01-31 2019-04-16 长安大学 一种网约车出行前的风险评价方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006061632A (ja) * 2004-08-30 2006-03-09 Ishisaki:Kk 感情データ提供装置、心理解析装置、および電話ユーザ心理解析方法
KR101029786B1 (ko) * 2006-09-13 2011-04-19 니뽄 덴신 덴와 가부시키가이샤 감정 검출 방법, 감정 검출 장치, 그 방법을 실장한 감정 검출 프로그램 및 그 프로그램을 기록한 기록 매체
US8676586B2 (en) * 2008-09-16 2014-03-18 Nice Systems Ltd Method and apparatus for interaction or discourse analytics
CN105912667A (zh) * 2016-04-12 2016-08-31 玉环看知信息科技有限公司 一种信息推荐方法、装置及移动终端
CN107181864B (zh) * 2017-05-19 2020-01-31 维沃移动通信有限公司 一种信息提示方法及移动终端
CN108986801B (zh) * 2017-06-02 2020-06-05 腾讯科技(深圳)有限公司 一种人机交互方法、装置及人机交互终端
CN107680602A (zh) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 语音欺诈识别方法、装置、终端设备及存储介质
CN107481718B (zh) * 2017-09-20 2019-07-05 Oppo广东移动通信有限公司 语音识别方法、装置、存储介质及电子设备
CN108182524B (zh) * 2017-12-26 2021-07-06 北京三快在线科技有限公司 一种订单分配方法及装置、电子设备
CN109243490A (zh) * 2018-10-11 2019-01-18 平安科技(深圳)有限公司 司机情绪识别方法及终端设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160275638A1 (en) * 2015-03-20 2016-09-22 David M. Korpi Vehicle service request system having enhanced safety features
CN108140202A (zh) * 2016-01-07 2018-06-08 谷歌有限责任公司 乘车共享平台中的信誉系统
CN107358484A (zh) * 2017-05-27 2017-11-17 上海与德科技有限公司 一种网约车监控方法及系统
CN109102825A (zh) * 2018-07-27 2018-12-28 科大讯飞股份有限公司 一种饮酒状态检测方法及装置
CN108877146A (zh) * 2018-09-03 2018-11-23 深圳市尼欧科技有限公司 一种基于智能语音识别的乘驾安全自动报警装置及其方法
CN109636257A (zh) * 2019-01-31 2019-04-16 长安大学 一种网约车出行前的风险评价方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786054A (zh) * 2021-02-25 2021-05-11 深圳壹账通智能科技有限公司 基于语音的智能面试评估方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111862946B (zh) 2024-04-19
CN111862946A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
US11373641B2 (en) Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN106683680B (zh) 说话人识别方法及装置、计算机设备及计算机可读介质
EP3628098B1 (en) System and method for key phrase spotting
US20200381130A1 (en) Systems and Methods for Machine Learning of Voice Attributes
CN110085211B (zh) 语音识别交互方法、装置、计算机设备和存储介质
US20190120649A1 (en) Dialogue system, vehicle including the dialogue system, and accident information processing method
WO2020233440A1 (zh) 一种订单处理方法、装置、电子设备及存储介质
Yousaf et al. A novel technique for speech recognition and visualization based mobile application to support two-way communication between deaf-mute and normal peoples
EP3667660A1 (en) Information processing device and information processing method
Bořil et al. Towards multimodal driver’s stress detection
KR101834008B1 (ko) 음성 데이터 기반 신용평가 장치, 방법 및 컴퓨터 프로그램
US20180357269A1 (en) Address Book Management Apparatus Using Speech Recognition, Vehicle, System and Method Thereof
US20220108680A1 (en) Text-to-speech using duration prediction
US20220406304A1 (en) Intent driven voice interface
CN112382266A (zh) 一种语音合成方法、装置、电子设备及存储介质
CN114141251A (zh) 声音识别方法、声音识别装置及电子设备
CN111681670B (zh) 信息识别方法、装置、电子设备及存储介质
JP2008233782A (ja) パタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法
Lashkov et al. Dangerous state detection in vehicle cabin based on audiovisual analysis with smartphone sensors
CN111243607A (zh) 用于生成说话人信息的方法、装置、电子设备和介质
CN111199750A (zh) 一种发音评测方法、装置、电子设备及存储介质
WO2021139737A1 (zh) 一种人机交互的方法和系统
He et al. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers
KR20200095636A (ko) 대화 시스템이 구비된 차량 및 그 제어 방법
Wijnker et al. Hear-and-avoid for unmanned air vehicles using convolutional neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810533

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20810533

Country of ref document: EP

Kind code of ref document: A1