CN112820295B - Voice processing device and system, cloud server and vehicle - Google Patents

Voice processing device and system, cloud server and vehicle Download PDF

Info

Publication number
CN112820295B
CN112820295B CN202011600283.5A CN202011600283A CN112820295B CN 112820295 B CN112820295 B CN 112820295B CN 202011600283 A CN202011600283 A CN 202011600283A CN 112820295 B CN112820295 B CN 112820295B
Authority
CN
China
Prior art keywords
cloud
result
module
target
asr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011600283.5A
Other languages
Chinese (zh)
Other versions
CN112820295A (en
Inventor
丁磊
王超
蒋瑞
李梦龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human Horizons Shanghai Internet Technology Co Ltd
Original Assignee
Human Horizons Shanghai Internet Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human Horizons Shanghai Internet Technology Co Ltd filed Critical Human Horizons Shanghai Internet Technology Co Ltd
Priority to CN202011600283.5A priority Critical patent/CN112820295B/en
Publication of CN112820295A publication Critical patent/CN112820295A/en
Application granted granted Critical
Publication of CN112820295B publication Critical patent/CN112820295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing

Abstract

The application provides a speech processing device and system and high in the clouds server and vehicle, and the speech processing device in high in the clouds includes: the voice gateway is used for receiving target audio at the vehicle terminal; the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected; the cloud arbitration module is used for determining a target cloud voice processing result of the target audio from a plurality of cloud voice processing results to be selected according to a preset cloud arbitration strategy; the voice gateway is also used for returning the target cloud voice processing result to the vehicle terminal. The technical scheme of the embodiment of the application can provide accurate and optimized man-machine interaction service for the user.

Description

Voice processing device and system, cloud server and vehicle
Technical Field
The application relates to the internet of vehicles technology, in particular to a voice processing device and system, a cloud server and a vehicle.
Background
The vehicle end refers to a vehicle-mounted information entertainment product installed in a vehicle for short, and can functionally realize information communication between people and the vehicle and between the vehicle and the outside (such as between the vehicle and the vehicle). The voice assistant on the traditional vehicle end has the problems of unstable network, poor local identification, limited vehicle hardware performance and the like, so that the man-machine conversation experience of a user is influenced.
Disclosure of Invention
The embodiment of the application provides a speech processing device and system and high in the clouds server and vehicle to solve the problem that correlation technique exists, technical scheme includes:
in a first aspect, an embodiment of the present application provides a voice processing apparatus, which is applied to a cloud, and the voice processing apparatus includes:
the voice gateway is used for receiving target audio at the vehicle end;
the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected;
the cloud arbitration module is used for determining a target cloud voice processing result of the target audio from a plurality of cloud voice processing results to be selected according to a preset cloud arbitration strategy;
the voice gateway is also used for returning the target cloud voice processing result to the vehicle end.
In one embodiment, the plurality of speech processing capability modules comprises at least one cloud ASR module, and the target cloud speech processing result comprises a target cloud ASR result for the target audio;
the voice processing device also comprises a cloud data routing module which is used for routing the target audio to a cloud ASR module so as to carry out ASR processing on the target audio and generate a cloud ASR result to be selected;
the cloud arbitration module is used for determining a to-be-selected cloud ASR result as a target cloud ASR result under the condition that one cloud ASR module is adopted;
the cloud end arbitration strategy comprises a cloud end ASR arbitration strategy, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration module is used for determining a target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategy.
In one embodiment, the cloud ASR arbitration policy includes arbitrating based on timeliness and confidence of each to-be-selected cloud ASR result, and the priority of timeliness is higher than the confidence.
In an embodiment, the plurality of speech processing capability modules include at least one cloud NLU module, and the cloud data routing module is further configured to route the target cloud ASR result to the cloud NLU module, so as to perform NLU processing on the target cloud ASR result, and generate a candidate cloud NLU result.
In an embodiment, the speech processing device further includes a normalization call module, a scenario service module, and a cloud normalization engine module, the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the cloud normalization engine module through the scenario service module, and the cloud normalization engine module is configured to normalize the to-be-selected cloud NLU result, and generate a target cloud NLU result.
In one embodiment, the plurality of scene service modules and the plurality of cloud end normalization engine modules correspond to each other, the voice processing device further comprises a distributed module, and the normalization calling module is further configured to send the cloud end NLU result to be selected to the distributed module, so that the distributed module matches the corresponding scene service module and the corresponding cloud end normalization engine module according to the vehicle end identifier corresponding to the cloud end NLU result to be selected.
In one embodiment, the plurality of voice processing capability modules comprise a plurality of personal voice assistants, the target cloud voice processing result comprises a cloud dialogue result to the target cloud NLU result, and the cloud arbitration policy comprises a cloud dialogue arbitration policy;
the cloud data routing module is further used for routing the target cloud NLU result to the cloud arbitration module so that the cloud arbitration module determines a target personal voice assistant matched with the target cloud NLU result from the plurality of personal voice assistants according to a cloud dialogue arbitration strategy, and the target personal voice assistant is used for generating a cloud dialogue result according to the target cloud NLU result.
In one embodiment, the cloud dialogue arbitration policy includes arbitration based on a domain, a dialogue intention, and a keyword corresponding to the target cloud NLU result.
In one embodiment, the target personal voice assistant comprises a car service provider personal voice assistant, the voice processing device further comprises a cloud scene engine module, and the car service provider personal voice assistant calls the cloud scene engine module through the scene service module to generate a cloud dialogue result to the target cloud NLU result.
In one embodiment, the scene service modules and the cloud scene engine modules are multiple corresponding ones, the voice processing device further comprises a distributed module, and the car service provider personal voice assistant is further configured to send the target cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service modules and the corresponding cloud scene engine modules according to the car end identifier corresponding to the target cloud NLU result.
In one embodiment, the scene engine module is further configured to obtain a previous round of conversation result, and generate a current cloud conversation result according to the previous round of conversation result and the target cloud NLU result.
In a second aspect, an embodiment of the present application provides a cloud server, including any one of the foregoing voice processing apparatuses.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a vehicle end, and the speech processing apparatus includes:
the vehicle-end voice processing module is used for carrying out voice processing on the target audio to generate a vehicle-end voice processing result;
and the vehicle end arbitration module is used for determining a target voice processing result from the vehicle end voice processing result and the target cloud end voice processing result according to a preset vehicle end arbitration strategy, wherein the target cloud end voice processing result is generated by the cloud end voice processing device according to the target audio.
In a fourth aspect, an embodiment of the present application provides a vehicle, including a vehicle-end speech processing device.
In a fifth aspect, an embodiment of the present application provides a voice processing system, a cloud server and a vehicle.
The advantages or benefits in the technical scheme of the embodiment of the application at least include: the voice gateway provided by the cloud end can be used for carrying out voice processing on the audio at the vehicle end, so that accurate man-machine interaction service is provided for a user. Furthermore, the cloud voice gateway can be connected with a plurality of voice processing capability modules for voice processing, and the optimal voice processing result is selected through the arbitration service of the cloud arbitration module, so that a better man-machine interaction service is provided for the vehicle end.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 is a schematic diagram of a speech processing system according to one implementation of an embodiment of the present application;
fig. 2 is a schematic diagram of a cloud-side voice processing device according to an embodiment of the present application;
FIG. 3 is a schematic diagram of the operation of a normalization engine module according to an embodiment of the present application;
fig. 4 is a schematic diagram of a cloud-side voice processing device according to another embodiment of the present application;
fig. 5 is a schematic diagram of a cloud-side voice processing device according to another embodiment of the present application;
FIG. 6 is a schematic diagram of a consistent hashing algorithm according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a vehicle-end speech processing device according to an embodiment of the present application;
fig. 8 is a schematic diagram of a local agent of a vehicle-end voice processing device according to an embodiment of the present application.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
As shown in fig. 1, an embodiment of the present application provides a voice processing system, which includes a terminal and a cloud, and can implement voice processing through communication between the terminal and the cloud, so as to further provide a basis for a man-machine conversation of the terminal. For example, the terminal may be a vehicle-mounted terminal, such as a voice assistant Application (APP) on the vehicle-mounted terminal; the method can also be used for a voice assistant APP of intelligent equipment such as a mobile phone, a tablet computer and a personal computer; a platform may also be developed for third parties. Specifically, voice processing is performed through a voice processing device at the cloud and a voice processing device at the terminal, so that a basis is provided for man-machine conversation of the terminal. The following is an example of a vehicle-end terminal.
In one embodiment, as shown in fig. 2, the voice processing device in the cloud includes a voice gateway, a plurality of voice processing capability modules, and a cloud arbitration module. The voice processing capability module can be a third-party voice processing capability module or a vehicle service provider voice processing capability module. The car machine service provider can provide voice service, navigation service, system upgrading service and the like for the car machine end.
The voice gateway receives the target audio frequency of the vehicle terminal and forwards the target audio frequency to the voice processing capability module. The voice processing capacity module performs voice processing on the target audio, generates a cloud voice processing result to be selected, and sends the cloud voice processing result to the cloud arbitration module. The cloud arbitration module determines a target cloud voice processing result end voice processing result of the target audio from a plurality of to-be-selected cloud voice processing results according to a preset cloud arbitration strategy, and sends the target cloud voice processing result end voice processing result to the voice gateway. And the voice gateway returns the cloud voice processing result to the vehicle terminal.
According to the technical scheme of the embodiment of the application, the voice processing can be carried out on the audio at the vehicle end through the voice gateway provided by the cloud end, so that accurate human-computer interaction service is provided for a user. Furthermore, the voice gateway at the cloud end can be connected into a plurality of voice processing capability modules for voice processing. The voice processing capability module can be provided by a dominant third-party voice Service Provider (SP), or can be self-developed by a car machine Service Provider. And the optimal voice processing result is selected through the arbitration service of the cloud arbitration module, so that a better man-machine interaction service is provided for a vehicle end.
In one embodiment, the plurality of Speech processing capability modules include at least one cloud ASR (Automatic Speech Recognition) module, and the cloud ASR module has ASR processing capability, so as to generate a candidate cloud ASR result for the target audio. Further, as shown in fig. 2, the cloud speech processing device may further include a cloud data routing module, configured to route the target audio to the cloud ASR module.
For example, some of the multiple third-party speech processing capability modules may include one or more cloud ASR modules, and some may not include a cloud ASR module, which may be set according to actual situations, and the embodiment of the present application is not limited.
Further, under the condition that one cloud ASR module is adopted, the cloud arbitration module is used for determining the to-be-selected cloud ASR result as a target cloud ASR result; the cloud end arbitration strategies comprise cloud end ASR arbitration strategies, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration modules are used for determining a target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategies.
The cloud data routing module routes the target cloud ASR result to the voice gateway, and the voice gateway sends the target cloud ASR result to the vehicle end, so that the vehicle end obtains an ASR processing result of the cloud to the target audio, namely the target cloud voice processing result comprises a target cloud ASR result.
Illustratively, the cloud-side voice processing device comprises a third-party provider client (SP client), and the cloud-side data routing module calls a cloud-side ASR module in a third-party voice processing capability module corresponding to the SP through the SP client. The cloud-side voice Processing device may further include a cloud-side NLP (Natural voice Processing) module, and the cloud-side data routing module calls a cloud-side ASR module in the car service provider voice Processing capability module through the cloud-side NLP module.
In one embodiment, the cloud ASR arbitration policy includes arbitrating based on timeliness and confidence of each candidate cloud ASR result, and the timeliness is higher in priority than the confidence.
Illustratively, the two candidate cloud ASR results are ASR1 and ASR2, respectively. First, arbitration is performed based on the timeliness of ASR1 and ASR 2: and if the ASR2 does not reach beyond the first preset time, the timeliness of the ASR1 is higher than that of the ASR2, and the ASR1 is taken as a target cloud ASR result. And otherwise, taking the ASR2 as the target cloud ASR result. In a first preset time, under the condition that both the ASR1 and the ASR2 arrive, judging based on the confidence: and if the confidence of the ASR1 is higher than that of the ASR2, taking the ASR1 as a target cloud ASR result. And otherwise, taking the ASR2 as the target cloud ASR result. The confidence coefficient can be obtained by combining preset evaluation parameters with weight calculation, and can also be obtained by recognition according to a trained model.
Speech recognition results for the target audio, i.e., text from the audio, can be derived based on ASR processing. However, the result cannot be understood by the machine, and if a man-machine conversation needs to be completed, the ASR result needs to be converted into speech which can be understood by the machine, that is, NLU (Natural Language Understanding) processing is performed.
In one embodiment, the plurality of speech processing capability modules includes at least one cloud NLU module, and the cloud NLU module has NLU processing capability. For example, some of the plurality of third-party speech processing capability modules may include one or more cloud NLU modules, and some of the plurality of third-party speech processing capability modules may not include a cloud NLU module, and may be set according to actual situations, which is not limited in the embodiment of the present application.
Further, the target cloud ASR result may be sent to the cloud data routing module after being returned to the voice gateway. And the cloud data routing module routes the target cloud ASR result to the cloud NLU module so as to perform NLU processing on the target cloud ASR result and generate a to-be-selected cloud NLU result.
It should be noted that the target cloud ASR result may be generated by the cloud, or may be sent by the vehicle through a voice gateway, for example, the text shown in fig. 4.
Illustratively, the cloud data routing module calls a cloud NLU module in a third-party voice processing capability module corresponding to the SP through a third-party provider (SP) client; and the cloud data routing module calls a cloud NLU module in the car service provider voice processing capability module through the cloud NLP module.
In an embodiment, the to-be-selected cloud NLU result may be normalized to obtain the target cloud NLU result. Illustratively, as shown in fig. 2, the speech processing device further includes a normalization call module, a scenario service module, and a cloud normalization engine module, where the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the cloud normalization engine module through the scenario service module, and the cloud normalization engine module is configured to perform normalization processing on the to-be-selected cloud NLU result, so as to generate a target cloud NLU result.
Based on the normalization processing, the original semantic result (for example, the to-be-selected cloud NLU result) can be mapped according to the preset rule and the provider information (SP role) of the NLU result (for example, the to-be-selected cloud NLU result), and the mapped result is subjected to the normalization processing to obtain the normalized semantic result (for example, the target cloud NLU result).
Illustratively, as shown in fig. 3, the mapping framework of the cloud normalization engine module calls a mapping rule script from a file system of a cloud Data Base (DB) through a script interpreter, and loads the mapping rule script through a script loader. The original parameters comprise an ASR result (such as a target cloud ASR result) and an original semantic result (such as a candidate cloud NLU result) and an original field, an original intention and an original word slot corresponding to the ASR result and the original semantic result. The original parameters also correspond to the SP role, i.e., the identity of the provider corresponding to the speech processing capability module that provided the ASR result and the original semantic result. Wherein, the word slot can also be understood as a keyword.
The method comprises the steps that original parameters, SP roles and mapping rule scripts are input into a normalization mapping factory, normalization mapping is carried out by calling a task iteration executor, so that a new field (such as weather searching), a new intention (such as vehicle door opening) and a new word slot (such as tomorrow) are obtained, and a normalization semantic result (such as a target cloud NLU result) is obtained according to an ASR result (such as a target cloud ASR result), the new field, the new intention and the new word slot.
In one embodiment, the plurality of voice processing capability modules include a plurality of Personal Assistant (PA).
For example, as shown in fig. 2, the personal voice assistant may call a cloud context engine (leg B) through a context service module to generate a dialog result for the car service provider voice processing capability module (i.e., the car service provider personal voice assistant). The personal voice assistant may be in a third-party voice processing capability module, some of the plurality of third-party voice processing capability modules may include one or more personal voice assistants, some of the plurality of third-party voice processing capability modules may not include a personal voice assistant, and the setting may be performed according to actual situations, which is not limited in the embodiment of the present application.
Further, the cloud arbitration policy comprises a cloud dialogue arbitration policy. The cloud data routing module is further used for routing the target cloud NLU result to the cloud arbitration module, so that the cloud arbitration module determines a target personal voice assistant matched with the target cloud NLU result from the plurality of personal voice assistants according to a cloud dialogue arbitration strategy, and the target personal voice assistant is used for generating a cloud dialogue result according to the target cloud NLU result.
In one embodiment, the cloud dialogue arbitration policy includes arbitrating based on the corresponding domain, dialogue intent, and keyword of the target cloud NLU result.
Illustratively, the corresponding domain, the dialog intention and the keyword can be obtained from the target cloud NLU result, and an optimal personal voice assistant is determined from the plurality of personal voice assistants as the target personal voice assistant according to the priority order of the domain, the dialog intention and the keyword.
In one embodiment, the target personal voice assistant may be the car attendant PA. The voice processing device further comprises a cloud end scene engine module (leg B), and the vehicle service provider personal voice assistant calls the cloud end scene engine module through the scene service module to generate a cloud end conversation result of the target cloud end NLU result.
The cloud data routing module routes the cloud dialogue result to the voice gateway, and the voice gateway sends the cloud dialogue result to the vehicle end, namely the target cloud voice processing result comprises the cloud dialogue result. Illustratively, the cloud data routing module calls a personal voice assistant in a third-party voice processing capability module corresponding to the SP through the SP client.
An application example of the cloud speech processing apparatus according to the embodiment of the present application is described below with reference to fig. 4.
Through the voice gateway, the information of the vehicle end can be subjected to voice processing by the voice processing device at the cloud end, a target cloud end voice processing result is generated, and the target cloud end voice processing result is returned to the vehicle end. The information of the vehicle end is input to the cloud data routing module through information stacking and information decoding, the cloud data routing module routes the information to the corresponding module according to the type of the information, and then the information is output to the SP client or the voice gateway through information decoding and information unstacking.
The message type is determined to be a target audio or a to-be-selected cloud ASR result or a target cloud ASR result or a to-be-selected cloud NLU result or a target cloud NLU result or a cloud dialogue result.
The vehicle-side message includes the target audio. And the target audio is stacked through the vehicle-end input interface, so that the message is decoded, and the decoded target audio is sent to the cloud data routing module. The target audio needs to be subjected to ASR processing by the cloud, so the decoded target audio can be routed to the message code by the cloud data routing module, the encoded target audio is called by the SP client after the message is popped up, the cloud ASR module in the third-party language processing module corresponding to the SP is called by the SP client, ASR processing is carried out, and a third-party ASR result (belonging to the to-be-selected cloud ASR result) is generated.
And the third-party ASR result (belonging to the to-be-selected cloud ASR result) is used as a message of the third-party language processing capacity providing module, is stacked through the cloud input interface message, is decoded and is sent to the cloud data routing module. The vehicle-side message can also comprise text (belonging to the to-be-selected cloud ASR result), and the text is used as the vehicle-side message and is stacked through the vehicle-side input interface message, decoded and sent to the cloud data routing module.
When the to-be-selected cloud ASR results are multiple, the cloud data routing module routes the to-be-selected cloud ASR results to the cloud arbitration module, and the cloud arbitration module determines the target cloud ASR results from the multiple to-be-selected cloud ASR results according to the cloud ASR arbitration strategy.
When the target cloud ASR result needs to be returned to the vehicle end, the cloud data routing module routes the target cloud ASR result to the message code for coding, and the message is sent to the voice gateway through the vehicle end output interface where the message is popped up, and then the voice gateway is sent to the vehicle end.
And when the target cloud ASR result needs to be subjected to NLU processing, the cloud data routing module routes the target cloud ASR result to the message encoding module, the decoded target cloud ASR result is popped out of the message, and the SP client calls a cloud NLU module in the third-party language processing module corresponding to the SP to perform NLU processing, so that a third-party NLU result (namely a to-be-selected cloud NLU result) is generated.
And the to-be-selected cloud NLU result is used as a message of a third-party language processing capacity providing module, the message is stacked through a cloud input interface, and the message is decoded and then sent to a cloud data routing module. Because the to-be-selected cloud NLU result needs to be normalized, the cloud data routing module routes the decoded to-be-selected cloud NLU result to the normalization calling module. And the normalization calling module calls the cloud normalization engine module through the scene service module, and normalizes the to-be-selected cloud NLU result to generate a target cloud NLU result.
The cloud data routing module routes the target NLU result to the cloud arbitration module, and the cloud arbitration module determines the target PA from the PAs (including the car service provider PA and one or more third party PAs) according to the cloud dialogue arbitration policy.
When the target PA is the vehicle service provider PA, the cloud data routing module routes the target NLU result to the vehicle service provider PA, and the scene service module calls the cloud scene engine module to generate a cloud dialogue result of the target NLU result.
When the target PA is a third party PA, the cloud data routing module routes the target NLU result to the SP client corresponding to the third party PA, and then the cloud dialogue result (namely the third party dialogue result) of the target NLU result is generated through the third party PA. And the third party conversation result needs to be subjected to message stacking and message decoding through the cloud input interface, and the decoded third party conversation result is sent to the cloud data routing module.
And the cloud data routing module routes the cloud dialogue result to the message code for coding, and then sends the message to the voice gateway through the vehicle end output interface of the message unstacking so as to send the message to the vehicle end.
In one embodiment, for voice processing requests of different vehicle ends, the cloud end can be matched with different scene service modules for voice processing, so that voice service can be provided more accurately.
As shown in fig. 2 and fig. 5, the cloud speech processing apparatus further includes a Distributed (Distributed) module. The voice processing devices 1 and 2 of different vehicle terminals 8230, 8230N correspond to different vehicle terminal identifiers, and messages of different vehicle terminals are processed through different voice gateways. The distributed module matches corresponding scene service modules from a plurality of scene service modules 1, 2 \8230, 8230A and N according to the vehicle terminal identification. The cloud scene engine modules LegB1 and LegB2 \8230 \ 8230 \ LegBN and the cloud normalization modules 1 and 2 \8230 \ 8230 \ N are respectively corresponding to each scene service module.
Exemplarily, the normalization calling module is further configured to send the to-be-selected cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and the cloud normalization engine module according to the vehicle end identifier corresponding to the to-be-selected cloud NLU result.
Illustratively, the car service provider PA is further configured to send the target cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and the cloud scene engine module according to the car-end identifier corresponding to the target cloud NLU result.
In one example, as shown in fig. 6, a consistent Hash (Hash) algorithm may be used for matching between the scenario service module and the cloud normalization engine module (or the cloud scenario engine module). For example: the physical nodes n1, n2, n3 and n4 of each scene service module may be respectively allocated with a plurality of virtual nodes. Then, a consistent hash algorithm is carried out based on each virtual node and the vehicle end identifier, so that the virtual node closest to the vehicle end identifier is obtained, and the corresponding scene service module can be obtained through the physical node corresponding to the virtual node.
In an embodiment, the scene engine module is further configured to obtain a previous round of dialog result, and generate a current cloud-end dialog result according to the previous round of dialog result and the target cloud-end NLU result.
Exemplarily, a vehicle-end scene engine module in the vehicle-end voice processing device and a scene service module in the cloud-end scene processing device can perform semantic synchronization, and further synchronize a semantic context, so that the cloud-end scene engine module can obtain a previous-round conversation result based on the semantic context.
The embodiment of the application further provides a cloud server, which comprises the cloud voice processing device in any one of the above embodiments.
An embodiment of the present application further provides a speech processing device at a vehicle end, including: the vehicle-end voice processing module is used for carrying out voice processing on the target audio to generate a vehicle-end voice processing result; and the vehicle end arbitration module is used for determining a target voice processing result from the vehicle end voice processing result and the target cloud end voice processing result according to a preset vehicle end arbitration strategy.
In one embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle-end ASR module and a vehicle-end arbitration module, the vehicle-end arbitration policy includes a vehicle-end ASR arbitration policy, and the target speech processing result includes a target ASR result. The vehicle-end ASR module is used for carrying out ASR processing on the acquired target audio to obtain a vehicle-end ASR result, the vehicle-end receives a target cloud-end ASR result sent by the cloud end through the voice gateway, and the vehicle-end arbitration module determines the target ASR result from the vehicle-end ASR result and the target cloud-end ASR result according to a vehicle-end ASR arbitration strategy.
The vehicle-end ASR arbitration policy may include arbitration based on network connectivity, timeliness and confidence of the results, among other things. Illustratively, under the condition that the network connectivity between the cloud end and the vehicle end is normal, the timeliness of the vehicle end ASR result and the target cloud end ASR result is judged, for example, if the target cloud end ASR result exceeds a second preset time length and does not reach, the vehicle end ASR result is determined as the target ASR result. And if the target ASR result and the target ASR result both arrive within a second preset time length, judging the confidence level, namely judging which result has high confidence level, and taking the result as the target ASR result.
In an embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle end NLU module and a vehicle end normalization engine module, and is configured to perform NLU processing on the vehicle end ASR result to generate a vehicle end NLU result to be selected. And the vehicle end normalization engine module is used for carrying out normalization processing on the vehicle end NLU result to be selected so as to obtain the target vehicle end NLU result.
The vehicle-side normalization engine module can adopt results and working principles similar to those of the cloud-side normalization engine module, and details are not repeated here.
In one embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle end scene engine module (Leg a) for generating a vehicle end dialogue result according to the target vehicle end NLU result.
In one embodiment, the vehicle-side arbitration policy comprises a vehicle-side dialogue arbitration policy, and the target speech processing result comprises a target dialogue result. The vehicle end receives a cloud end conversation result sent by the cloud end through the voice gateway, and the vehicle end arbitration module determines a target conversation result from the vehicle end conversation result and the cloud end conversation result according to a vehicle end conversation arbitration strategy.
The vehicle-end ASR arbitration strategy can comprise arbitration based on a conversation field, network connectivity and the timeliness of the result, wherein the priority of the conversation field is higher than the network connectivity, and the priority of the network connectivity is higher than the timeliness of the result.
Illustratively, the dialog domain can be determined according to NLU results (vehicle-side NLU results or cloud-side NLU results) or target ASR results or application scenarios. If the dialogue field is the local dialogue field, preferentially using the vehicle-end dialogue result, and taking the vehicle-end dialogue result as a target dialogue result; if the conversation field is the cloud conversation field, preferentially using a cloud conversation result, namely taking the cloud conversation result as a target conversation result; and if the conversation field is a mixed field, judging the network connectivity between the cloud end and the vehicle end, and judging the timeliness under the condition that the network connectivity is normal. If the cloud terminal conversation result does not arrive within a third preset time length, taking the vehicle terminal conversation result as a target conversation result; and if both of the conversation results arrive within a third preset time length, and which conversation result arrives first, taking the conversation result as a target conversation result.
And under the condition that the network connectivity between the cloud end and the vehicle end is normal, judging the timeliness of the vehicle end ASR result and the target cloud end ASR result, for example, if the target cloud end ASR result exceeds a second preset time length and does not reach, determining the vehicle end ASR result as the target ASR result. And if the target ASR result and the target ASR result both arrive within a second preset time length, judging the confidence level, namely judging which result has high confidence level, and taking the result as the target ASR result.
An application example of the vehicle-end voice processing apparatus according to the embodiment of the present application is described below with reference to fig. 7 and 8.
And the vehicle-end microphone collects audio, and soft noise reduction and echo suppression processing (SSE) are carried out to generate target audio. The target audio is forwarded to the message transmission channel through the local gateway, namely after the target audio is only subjected to message stacking and message decoding, the target audio is routed to the message coding module through the vehicle-end data routing module, and after coding and message unstacking are carried out, the target audio is sent to the cloud-end voice gateway through the vehicle-end client. And the cloud dialogue result and the target cloud ASR result sent by the voice gateway are both forwarded to the message stacking module through the vehicle-end client, and are forwarded to the vehicle-end arbitration module through the vehicle-end data routing module after being decoded.
The vehicle end arbitration module arbitrates the vehicle end ASR result and the target cloud ASR result, outputs the target ASR result To the vehicle end data routing module, then routes the target ASR result To a TTS (Text To Speech, from Text To voice) module, and plays the result To a user through a horn at the vehicle end. The vehicle-side arbitration module can also arbitrate the vehicle-side conversation result and the cloud-side conversation result, then outputs the target conversation result To the vehicle-side data routing module, then routes the target conversation result To a Text-To-Speech (TTS) module, and broadcasts the target conversation result To the user through a loudspeaker at the vehicle side.
Illustratively, the target ASR result and the target dialog result may also be routed by the car-end data routing module to a voice UI (User Interface) module for presentation.
An embodiment of the present application further provides a vehicle, including the vehicle-end voice processing apparatus according to any one of the above embodiments.
The embodiment of the application further provides a system, including foretell high in the clouds server and vehicle.
According to the technical scheme of the embodiment of the application, a plurality of voice processing capacities (ASR, NLU, DM, TTS, arbitration and the like) are accessed through the cloud end, so that powerful dialogue service can be provided for the vehicle end, and under the condition of no network or poor network, the dialogue service can be completed based on the voice processing capacity of the vehicle end. Furthermore, the cloud end provides a voice gateway, and the vehicle end is directly connected with the voice gateway of the service provider, so that the service provider can provide better voice service for the vehicle end; the vehicle end and the cloud end are multiple engines each other, and the optimal conversation result is selected through an arbitration strategy, so that better voice man-machine conversation experience is provided for a user. According to the scheme, the communication between the voice gateway and the vehicle end and between the voice gateway and the cloud voice processing capacity module adopts long connection of asynchronous duplex (ELB), a consistency hash algorithm is introduced, and the voice interaction is better in timeliness than that of a direct connection provider through a proper session management strategy. Furthermore, NLU results output by different voice processing capability modules are different, normalization processing can be carried out on different semantic results through visual intelligent editing and mapping scripts, the normalized NLU results are used as input of a conversation, and design of man-machine conversation flow can be greatly simplified
In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more (two or more) executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. The utility model provides a speech processing apparatus, its characterized in that is applied to the high in the clouds, speech processing apparatus includes:
the voice gateways are used for receiving target audio frequencies of different vehicle terminals;
the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected; the voice processing device further comprises a cloud data routing module, wherein the cloud data routing module is used for routing the target voice frequency to the cloud ASR module, so that the cloud ASR module performs ASR processing on the target voice frequency to generate the cloud ASR result to be selected;
the cloud arbitration module is used for determining a target cloud ASR result of the target audio from a plurality of to-be-selected cloud ASR results according to a preset cloud arbitration strategy; the cloud speech processing result to be selected further comprises a cloud NLU result to be selected, the plurality of speech processing capability modules further comprise at least one cloud NLU module, and the cloud data routing module is further used for routing the target cloud ASR result to the cloud NLU module so that the cloud NLU module performs NLU processing on the target cloud ASR result to generate the cloud NLU result to be selected;
the system comprises a plurality of corresponding scene service modules and a cloud end normalization engine module;
the distributed module is used for matching the corresponding scene service module and the cloud normalization engine module according to the identification of the target vehicle end corresponding to the to-be-selected cloud NLU result;
the matched scene service module is used for calling a corresponding cloud end normalization engine module to normalize the to-be-selected cloud end NLU result, and generate a target cloud end NLU result of the target vehicle end, so that a voice gateway corresponding to the target vehicle end sends the target cloud end NLU result to the target vehicle end.
2. The speech processing device according to claim 1, wherein in a case that there is one cloud ASR module, the cloud arbitration module is configured to determine the candidate cloud ASR result as the target cloud ASR result;
the cloud end arbitration strategy comprises a cloud end ASR arbitration strategy, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration module is used for determining the target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategy.
3. The speech processing device according to claim 2, wherein the cloud ASR arbitration policy comprises arbitration based on timeliness and confidence of each candidate cloud ASR result, and timeliness is prioritized over confidence.
4. The speech processing device according to claim 1, further comprising a normalization call module, wherein the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the corresponding cloud normalization engine module through the matched scene service module.
5. The speech processing device according to claim 4, wherein the normalization call module is further configured to send the to-be-selected cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and cloud normalization engine module according to the vehicle end identifier corresponding to the to-be-selected cloud NLU result.
6. The speech processing device according to claim 4, wherein the plurality of speech processing capability modules comprises a plurality of personal speech assistants, the target cloud speech processing result comprises a cloud dialogue result to the target cloud NLU result, and the cloud arbitration policy comprises a cloud dialogue arbitration policy;
the cloud data routing module is further configured to route the target cloud NLU result to the cloud arbitration module, so that the cloud arbitration module determines a target personal voice assistant matching the target cloud NLU result from the multiple personal voice assistants according to the cloud dialogue arbitration policy, and the target personal voice assistant is configured to generate the cloud dialogue result according to the target cloud NLU result.
7. The speech processing apparatus of claim 6, wherein the cloud-based dialogue arbitration policy comprises arbitration based on a domain, a dialogue intent, and a keyword corresponding to the target cloud-based NLU result.
8. The speech processing apparatus of claim 6, wherein the target personal speech assistant comprises a car facilitator personal speech assistant, and the speech processing apparatus further comprises a cloud context engine module, and the car facilitator personal speech assistant invokes the cloud context engine module through the context service module to generate a cloud dialog result to the target cloud NLU result.
9. The speech processing device according to claim 8, wherein the plurality of scenario service modules and the plurality of cloud-end scenario engine modules correspond to each other, the speech processing device further comprises a distributed module, and the car-service provider personal speech assistant is further configured to send the target cloud-end NLU result to the distributed module, so that the distributed module matches the corresponding scenario service module and the corresponding cloud-end scenario engine module according to the car-end identifier corresponding to the target cloud-end NLU result.
10. The speech processing device of claim 8, wherein the scene engine module is further configured to obtain a previous round of dialog results, and generate a current cloud-end dialog result according to the previous round of dialog results and the target cloud-end NLU result.
11. A cloud server, comprising the speech processing apparatus according to any one of claims 1 to 10.
12. A speech processing apparatus, applied to a vehicle side, the speech processing apparatus comprising:
the vehicle-end voice processing module is used for carrying out voice processing on the target audio to generate a vehicle-end voice processing result;
a vehicle-end arbitration module, configured to determine a target speech processing result from the vehicle-end speech processing result and the target cloud-end speech processing result according to a preset vehicle-end arbitration policy, where the target cloud-end speech processing result includes a target cloud-end ASR result and a target cloud-end NLU result, and the target cloud-end speech processing result is generated by the speech processing apparatus according to any one of claims 1 to 10 according to the target audio.
13. A vehicle characterized by comprising the speech processing apparatus of claim 12.
14. A speech processing system comprising the cloud server of claim 11 and the vehicle of claim 13.
CN202011600283.5A 2020-12-29 2020-12-29 Voice processing device and system, cloud server and vehicle Active CN112820295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011600283.5A CN112820295B (en) 2020-12-29 2020-12-29 Voice processing device and system, cloud server and vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011600283.5A CN112820295B (en) 2020-12-29 2020-12-29 Voice processing device and system, cloud server and vehicle

Publications (2)

Publication Number Publication Date
CN112820295A CN112820295A (en) 2021-05-18
CN112820295B true CN112820295B (en) 2022-12-23

Family

ID=75855323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011600283.5A Active CN112820295B (en) 2020-12-29 2020-12-29 Voice processing device and system, cloud server and vehicle

Country Status (1)

Country Link
CN (1) CN112820295B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114063956A (en) * 2021-11-11 2022-02-18 上汽通用五菱汽车股份有限公司 Vehicle-mounted device and mobile terminal program interaction method, vehicle-mounted device and readable storage medium
CN115146615A (en) * 2022-09-02 2022-10-04 深圳联友科技有限公司 Natural language processing method, system, equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645122B1 (en) * 2002-12-19 2014-02-04 At&T Intellectual Property Ii, L.P. Method of handling frequently asked questions in a natural language dialog service
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN111464977A (en) * 2020-06-18 2020-07-28 华人运通(上海)新能源驱动技术有限公司 Voice scene updating method, device, terminal, server and system
CN111916070A (en) * 2019-05-10 2020-11-10 罗伯特·博世有限公司 Speech recognition using natural language understanding related knowledge via deep feedforward neural networks

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10817672B2 (en) * 2014-10-01 2020-10-27 Nuance Communications, Inc. Natural language understanding (NLU) processing based on user-specified interests
CN105551494A (en) * 2015-12-11 2016-05-04 奇瑞汽车股份有限公司 Mobile phone interconnection-based vehicle-mounted speech recognition system and recognition method
CN105843797A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Normalization method and device
US10984788B2 (en) * 2017-08-18 2021-04-20 Blackberry Limited User-guided arbitration of speech processing results
CN109949816A (en) * 2019-02-14 2019-06-28 安徽云之迹信息技术有限公司 Robot voice processing method and processing device, cloud server
US11462216B2 (en) * 2019-03-28 2022-10-04 Cerence Operating Company Hybrid arbitration system
CN110245221B (en) * 2019-05-13 2023-05-23 华为技术有限公司 Method and computer device for training dialogue state tracking classifier
CN110322882A (en) * 2019-05-13 2019-10-11 厦门亿联网络技术股份有限公司 A kind of method and system generating mixing voice data
CN110909543A (en) * 2019-11-15 2020-03-24 广州洪荒智能科技有限公司 Intention recognition method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645122B1 (en) * 2002-12-19 2014-02-04 At&T Intellectual Property Ii, L.P. Method of handling frequently asked questions in a natural language dialog service
CN111916070A (en) * 2019-05-10 2020-11-10 罗伯特·博世有限公司 Speech recognition using natural language understanding related knowledge via deep feedforward neural networks
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN111464977A (en) * 2020-06-18 2020-07-28 华人运通(上海)新能源驱动技术有限公司 Voice scene updating method, device, terminal, server and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于BERT的任务导向对话系统自然语言理解的改进模型与调优方法;周奇安等;《中文信息学报》;20200515(第05期);全文 *

Also Published As

Publication number Publication date
CN112820295A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
US8824659B2 (en) System and method for speech-enabled call routing
CN112820295B (en) Voice processing device and system, cloud server and vehicle
US9761241B2 (en) System and method for providing network coordinated conversational services
US8600013B2 (en) Real time automatic caller speech profiling
US7450698B2 (en) System and method of utilizing a hybrid semantic model for speech recognition
US6574601B1 (en) Acoustic speech recognizer system and method
US8831939B2 (en) Voice data transferring device, terminal device, voice data transferring method, and voice recognition system
WO2021135604A1 (en) Voice control method and apparatus, server, terminal device, and storage medium
US20200211560A1 (en) Data Processing Device and Method for Performing Speech-Based Human Machine Interaction
US20020138272A1 (en) Method for improving speech recognition performance using speaker and channel information
US10923101B2 (en) Pausing synthesized speech output from a voice-controlled device
US10593318B2 (en) Initiating synthesized speech outpout from a voice-controlled device
CN110956955B (en) Voice interaction method and device
CN111261151A (en) Voice processing method and device, electronic equipment and storage medium
CN109545203A (en) Audio recognition method, device, equipment and storage medium
CN109918492A (en) System is arranged in a kind of human-computer dialogue setting method and human-computer dialogue
KR20220045114A (en) Method and apparatus for in-vehicle call, device, medium, and program
US20050177371A1 (en) Automated speech recognition
US10657951B2 (en) Controlling synthesized speech output from a voice-controlled device
CN110502631B (en) Input information response method and device, computer equipment and storage medium
US10580406B2 (en) Unified N-best ASR results
CN109262617A (en) Robot control method, device, equipment and storage medium
CN114221940B (en) Audio data processing method, system, device, equipment and storage medium
US20230005465A1 (en) Voice communication between a speaker and a recipient over a communication network
US11893996B1 (en) Supplemental content output

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant