CN112820295B

CN112820295B - Voice processing device and system, cloud server and vehicle

Info

Publication number: CN112820295B
Application number: CN202011600283.5A
Authority: CN
Inventors: 丁磊; 王超; 蒋瑞; 李梦龙
Original assignee: Human Horizons Shanghai Internet Technology Co Ltd
Current assignee: Human Horizons Shanghai Internet Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-12-23
Anticipated expiration: 2040-12-29
Also published as: CN112820295A

Abstract

The application provides a speech processing device and system and high in the clouds server and vehicle, and the speech processing device in high in the clouds includes: the voice gateway is used for receiving target audio at the vehicle terminal; the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected; the cloud arbitration module is used for determining a target cloud voice processing result of the target audio from a plurality of cloud voice processing results to be selected according to a preset cloud arbitration strategy; the voice gateway is also used for returning the target cloud voice processing result to the vehicle terminal. The technical scheme of the embodiment of the application can provide accurate and optimized man-machine interaction service for the user.

Description

Voice processing device and system, cloud server and vehicle

Technical Field

The application relates to the internet of vehicles technology, in particular to a voice processing device and system, a cloud server and a vehicle.

Background

The vehicle end refers to a vehicle-mounted information entertainment product installed in a vehicle for short, and can functionally realize information communication between people and the vehicle and between the vehicle and the outside (such as between the vehicle and the vehicle). The voice assistant on the traditional vehicle end has the problems of unstable network, poor local identification, limited vehicle hardware performance and the like, so that the man-machine conversation experience of a user is influenced.

Disclosure of Invention

The embodiment of the application provides a speech processing device and system and high in the clouds server and vehicle to solve the problem that correlation technique exists, technical scheme includes:

in a first aspect, an embodiment of the present application provides a voice processing apparatus, which is applied to a cloud, and the voice processing apparatus includes:

the voice gateway is used for receiving target audio at the vehicle end;

the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected;

the cloud arbitration module is used for determining a target cloud voice processing result of the target audio from a plurality of cloud voice processing results to be selected according to a preset cloud arbitration strategy;

the voice gateway is also used for returning the target cloud voice processing result to the vehicle end.

In one embodiment, the plurality of speech processing capability modules comprises at least one cloud ASR module, and the target cloud speech processing result comprises a target cloud ASR result for the target audio;

the voice processing device also comprises a cloud data routing module which is used for routing the target audio to a cloud ASR module so as to carry out ASR processing on the target audio and generate a cloud ASR result to be selected;

the cloud arbitration module is used for determining a to-be-selected cloud ASR result as a target cloud ASR result under the condition that one cloud ASR module is adopted;

the cloud end arbitration strategy comprises a cloud end ASR arbitration strategy, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration module is used for determining a target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategy.

In one embodiment, the cloud ASR arbitration policy includes arbitrating based on timeliness and confidence of each to-be-selected cloud ASR result, and the priority of timeliness is higher than the confidence.

In an embodiment, the plurality of speech processing capability modules include at least one cloud NLU module, and the cloud data routing module is further configured to route the target cloud ASR result to the cloud NLU module, so as to perform NLU processing on the target cloud ASR result, and generate a candidate cloud NLU result.

In an embodiment, the speech processing device further includes a normalization call module, a scenario service module, and a cloud normalization engine module, the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the cloud normalization engine module through the scenario service module, and the cloud normalization engine module is configured to normalize the to-be-selected cloud NLU result, and generate a target cloud NLU result.

In one embodiment, the plurality of scene service modules and the plurality of cloud end normalization engine modules correspond to each other, the voice processing device further comprises a distributed module, and the normalization calling module is further configured to send the cloud end NLU result to be selected to the distributed module, so that the distributed module matches the corresponding scene service module and the corresponding cloud end normalization engine module according to the vehicle end identifier corresponding to the cloud end NLU result to be selected.

In one embodiment, the plurality of voice processing capability modules comprise a plurality of personal voice assistants, the target cloud voice processing result comprises a cloud dialogue result to the target cloud NLU result, and the cloud arbitration policy comprises a cloud dialogue arbitration policy;

the cloud data routing module is further used for routing the target cloud NLU result to the cloud arbitration module so that the cloud arbitration module determines a target personal voice assistant matched with the target cloud NLU result from the plurality of personal voice assistants according to a cloud dialogue arbitration strategy, and the target personal voice assistant is used for generating a cloud dialogue result according to the target cloud NLU result.

In one embodiment, the cloud dialogue arbitration policy includes arbitration based on a domain, a dialogue intention, and a keyword corresponding to the target cloud NLU result.

In one embodiment, the target personal voice assistant comprises a car service provider personal voice assistant, the voice processing device further comprises a cloud scene engine module, and the car service provider personal voice assistant calls the cloud scene engine module through the scene service module to generate a cloud dialogue result to the target cloud NLU result.

In one embodiment, the scene service modules and the cloud scene engine modules are multiple corresponding ones, the voice processing device further comprises a distributed module, and the car service provider personal voice assistant is further configured to send the target cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service modules and the corresponding cloud scene engine modules according to the car end identifier corresponding to the target cloud NLU result.

In one embodiment, the scene engine module is further configured to obtain a previous round of conversation result, and generate a current cloud conversation result according to the previous round of conversation result and the target cloud NLU result.

In a second aspect, an embodiment of the present application provides a cloud server, including any one of the foregoing voice processing apparatuses.

In a third aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a vehicle end, and the speech processing apparatus includes:

the vehicle-end voice processing module is used for carrying out voice processing on the target audio to generate a vehicle-end voice processing result;

and the vehicle end arbitration module is used for determining a target voice processing result from the vehicle end voice processing result and the target cloud end voice processing result according to a preset vehicle end arbitration strategy, wherein the target cloud end voice processing result is generated by the cloud end voice processing device according to the target audio.

In a fourth aspect, an embodiment of the present application provides a vehicle, including a vehicle-end speech processing device.

In a fifth aspect, an embodiment of the present application provides a voice processing system, a cloud server and a vehicle.

The advantages or benefits in the technical scheme of the embodiment of the application at least include: the voice gateway provided by the cloud end can be used for carrying out voice processing on the audio at the vehicle end, so that accurate man-machine interaction service is provided for a user. Furthermore, the cloud voice gateway can be connected with a plurality of voice processing capability modules for voice processing, and the optimal voice processing result is selected through the arbitration service of the cloud arbitration module, so that a better man-machine interaction service is provided for the vehicle end.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a schematic diagram of a speech processing system according to one implementation of an embodiment of the present application;

fig. 2 is a schematic diagram of a cloud-side voice processing device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the operation of a normalization engine module according to an embodiment of the present application;

fig. 4 is a schematic diagram of a cloud-side voice processing device according to another embodiment of the present application;

fig. 5 is a schematic diagram of a cloud-side voice processing device according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a consistent hashing algorithm according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a vehicle-end speech processing device according to an embodiment of the present application;

fig. 8 is a schematic diagram of a local agent of a vehicle-end voice processing device according to an embodiment of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

As shown in fig. 1, an embodiment of the present application provides a voice processing system, which includes a terminal and a cloud, and can implement voice processing through communication between the terminal and the cloud, so as to further provide a basis for a man-machine conversation of the terminal. For example, the terminal may be a vehicle-mounted terminal, such as a voice assistant Application (APP) on the vehicle-mounted terminal; the method can also be used for a voice assistant APP of intelligent equipment such as a mobile phone, a tablet computer and a personal computer; a platform may also be developed for third parties. Specifically, voice processing is performed through a voice processing device at the cloud and a voice processing device at the terminal, so that a basis is provided for man-machine conversation of the terminal. The following is an example of a vehicle-end terminal.

In one embodiment, as shown in fig. 2, the voice processing device in the cloud includes a voice gateway, a plurality of voice processing capability modules, and a cloud arbitration module. The voice processing capability module can be a third-party voice processing capability module or a vehicle service provider voice processing capability module. The car machine service provider can provide voice service, navigation service, system upgrading service and the like for the car machine end.

The voice gateway receives the target audio frequency of the vehicle terminal and forwards the target audio frequency to the voice processing capability module. The voice processing capacity module performs voice processing on the target audio, generates a cloud voice processing result to be selected, and sends the cloud voice processing result to the cloud arbitration module. The cloud arbitration module determines a target cloud voice processing result end voice processing result of the target audio from a plurality of to-be-selected cloud voice processing results according to a preset cloud arbitration strategy, and sends the target cloud voice processing result end voice processing result to the voice gateway. And the voice gateway returns the cloud voice processing result to the vehicle terminal.

According to the technical scheme of the embodiment of the application, the voice processing can be carried out on the audio at the vehicle end through the voice gateway provided by the cloud end, so that accurate human-computer interaction service is provided for a user. Furthermore, the voice gateway at the cloud end can be connected into a plurality of voice processing capability modules for voice processing. The voice processing capability module can be provided by a dominant third-party voice Service Provider (SP), or can be self-developed by a car machine Service Provider. And the optimal voice processing result is selected through the arbitration service of the cloud arbitration module, so that a better man-machine interaction service is provided for a vehicle end.

In one embodiment, the plurality of Speech processing capability modules include at least one cloud ASR (Automatic Speech Recognition) module, and the cloud ASR module has ASR processing capability, so as to generate a candidate cloud ASR result for the target audio. Further, as shown in fig. 2, the cloud speech processing device may further include a cloud data routing module, configured to route the target audio to the cloud ASR module.

For example, some of the multiple third-party speech processing capability modules may include one or more cloud ASR modules, and some may not include a cloud ASR module, which may be set according to actual situations, and the embodiment of the present application is not limited.

Further, under the condition that one cloud ASR module is adopted, the cloud arbitration module is used for determining the to-be-selected cloud ASR result as a target cloud ASR result; the cloud end arbitration strategies comprise cloud end ASR arbitration strategies, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration modules are used for determining a target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategies.

The cloud data routing module routes the target cloud ASR result to the voice gateway, and the voice gateway sends the target cloud ASR result to the vehicle end, so that the vehicle end obtains an ASR processing result of the cloud to the target audio, namely the target cloud voice processing result comprises a target cloud ASR result.

Illustratively, the cloud-side voice processing device comprises a third-party provider client (SP client), and the cloud-side data routing module calls a cloud-side ASR module in a third-party voice processing capability module corresponding to the SP through the SP client. The cloud-side voice Processing device may further include a cloud-side NLP (Natural voice Processing) module, and the cloud-side data routing module calls a cloud-side ASR module in the car service provider voice Processing capability module through the cloud-side NLP module.

In one embodiment, the cloud ASR arbitration policy includes arbitrating based on timeliness and confidence of each candidate cloud ASR result, and the timeliness is higher in priority than the confidence.

Illustratively, the two candidate cloud ASR results are ASR1 and ASR2, respectively. First, arbitration is performed based on the timeliness of ASR1 and ASR 2: and if the ASR2 does not reach beyond the first preset time, the timeliness of the ASR1 is higher than that of the ASR2, and the ASR1 is taken as a target cloud ASR result. And otherwise, taking the ASR2 as the target cloud ASR result. In a first preset time, under the condition that both the ASR1 and the ASR2 arrive, judging based on the confidence: and if the confidence of the ASR1 is higher than that of the ASR2, taking the ASR1 as a target cloud ASR result. And otherwise, taking the ASR2 as the target cloud ASR result. The confidence coefficient can be obtained by combining preset evaluation parameters with weight calculation, and can also be obtained by recognition according to a trained model.

Speech recognition results for the target audio, i.e., text from the audio, can be derived based on ASR processing. However, the result cannot be understood by the machine, and if a man-machine conversation needs to be completed, the ASR result needs to be converted into speech which can be understood by the machine, that is, NLU (Natural Language Understanding) processing is performed.

In one embodiment, the plurality of speech processing capability modules includes at least one cloud NLU module, and the cloud NLU module has NLU processing capability. For example, some of the plurality of third-party speech processing capability modules may include one or more cloud NLU modules, and some of the plurality of third-party speech processing capability modules may not include a cloud NLU module, and may be set according to actual situations, which is not limited in the embodiment of the present application.

Further, the target cloud ASR result may be sent to the cloud data routing module after being returned to the voice gateway. And the cloud data routing module routes the target cloud ASR result to the cloud NLU module so as to perform NLU processing on the target cloud ASR result and generate a to-be-selected cloud NLU result.

It should be noted that the target cloud ASR result may be generated by the cloud, or may be sent by the vehicle through a voice gateway, for example, the text shown in fig. 4.

Illustratively, the cloud data routing module calls a cloud NLU module in a third-party voice processing capability module corresponding to the SP through a third-party provider (SP) client; and the cloud data routing module calls a cloud NLU module in the car service provider voice processing capability module through the cloud NLP module.

In an embodiment, the to-be-selected cloud NLU result may be normalized to obtain the target cloud NLU result. Illustratively, as shown in fig. 2, the speech processing device further includes a normalization call module, a scenario service module, and a cloud normalization engine module, where the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the cloud normalization engine module through the scenario service module, and the cloud normalization engine module is configured to perform normalization processing on the to-be-selected cloud NLU result, so as to generate a target cloud NLU result.

Based on the normalization processing, the original semantic result (for example, the to-be-selected cloud NLU result) can be mapped according to the preset rule and the provider information (SP role) of the NLU result (for example, the to-be-selected cloud NLU result), and the mapped result is subjected to the normalization processing to obtain the normalized semantic result (for example, the target cloud NLU result).

Illustratively, as shown in fig. 3, the mapping framework of the cloud normalization engine module calls a mapping rule script from a file system of a cloud Data Base (DB) through a script interpreter, and loads the mapping rule script through a script loader. The original parameters comprise an ASR result (such as a target cloud ASR result) and an original semantic result (such as a candidate cloud NLU result) and an original field, an original intention and an original word slot corresponding to the ASR result and the original semantic result. The original parameters also correspond to the SP role, i.e., the identity of the provider corresponding to the speech processing capability module that provided the ASR result and the original semantic result. Wherein, the word slot can also be understood as a keyword.

The method comprises the steps that original parameters, SP roles and mapping rule scripts are input into a normalization mapping factory, normalization mapping is carried out by calling a task iteration executor, so that a new field (such as weather searching), a new intention (such as vehicle door opening) and a new word slot (such as tomorrow) are obtained, and a normalization semantic result (such as a target cloud NLU result) is obtained according to an ASR result (such as a target cloud ASR result), the new field, the new intention and the new word slot.

In one embodiment, the plurality of voice processing capability modules include a plurality of Personal Assistant (PA).

For example, as shown in fig. 2, the personal voice assistant may call a cloud context engine (leg B) through a context service module to generate a dialog result for the car service provider voice processing capability module (i.e., the car service provider personal voice assistant). The personal voice assistant may be in a third-party voice processing capability module, some of the plurality of third-party voice processing capability modules may include one or more personal voice assistants, some of the plurality of third-party voice processing capability modules may not include a personal voice assistant, and the setting may be performed according to actual situations, which is not limited in the embodiment of the present application.

Further, the cloud arbitration policy comprises a cloud dialogue arbitration policy. The cloud data routing module is further used for routing the target cloud NLU result to the cloud arbitration module, so that the cloud arbitration module determines a target personal voice assistant matched with the target cloud NLU result from the plurality of personal voice assistants according to a cloud dialogue arbitration strategy, and the target personal voice assistant is used for generating a cloud dialogue result according to the target cloud NLU result.

In one embodiment, the cloud dialogue arbitration policy includes arbitrating based on the corresponding domain, dialogue intent, and keyword of the target cloud NLU result.

Illustratively, the corresponding domain, the dialog intention and the keyword can be obtained from the target cloud NLU result, and an optimal personal voice assistant is determined from the plurality of personal voice assistants as the target personal voice assistant according to the priority order of the domain, the dialog intention and the keyword.

In one embodiment, the target personal voice assistant may be the car attendant PA. The voice processing device further comprises a cloud end scene engine module (leg B), and the vehicle service provider personal voice assistant calls the cloud end scene engine module through the scene service module to generate a cloud end conversation result of the target cloud end NLU result.

The cloud data routing module routes the cloud dialogue result to the voice gateway, and the voice gateway sends the cloud dialogue result to the vehicle end, namely the target cloud voice processing result comprises the cloud dialogue result. Illustratively, the cloud data routing module calls a personal voice assistant in a third-party voice processing capability module corresponding to the SP through the SP client.

An application example of the cloud speech processing apparatus according to the embodiment of the present application is described below with reference to fig. 4.

Through the voice gateway, the information of the vehicle end can be subjected to voice processing by the voice processing device at the cloud end, a target cloud end voice processing result is generated, and the target cloud end voice processing result is returned to the vehicle end. The information of the vehicle end is input to the cloud data routing module through information stacking and information decoding, the cloud data routing module routes the information to the corresponding module according to the type of the information, and then the information is output to the SP client or the voice gateway through information decoding and information unstacking.

The message type is determined to be a target audio or a to-be-selected cloud ASR result or a target cloud ASR result or a to-be-selected cloud NLU result or a target cloud NLU result or a cloud dialogue result.

The vehicle-side message includes the target audio. And the target audio is stacked through the vehicle-end input interface, so that the message is decoded, and the decoded target audio is sent to the cloud data routing module. The target audio needs to be subjected to ASR processing by the cloud, so the decoded target audio can be routed to the message code by the cloud data routing module, the encoded target audio is called by the SP client after the message is popped up, the cloud ASR module in the third-party language processing module corresponding to the SP is called by the SP client, ASR processing is carried out, and a third-party ASR result (belonging to the to-be-selected cloud ASR result) is generated.

And the third-party ASR result (belonging to the to-be-selected cloud ASR result) is used as a message of the third-party language processing capacity providing module, is stacked through the cloud input interface message, is decoded and is sent to the cloud data routing module. The vehicle-side message can also comprise text (belonging to the to-be-selected cloud ASR result), and the text is used as the vehicle-side message and is stacked through the vehicle-side input interface message, decoded and sent to the cloud data routing module.

When the to-be-selected cloud ASR results are multiple, the cloud data routing module routes the to-be-selected cloud ASR results to the cloud arbitration module, and the cloud arbitration module determines the target cloud ASR results from the multiple to-be-selected cloud ASR results according to the cloud ASR arbitration strategy.

When the target cloud ASR result needs to be returned to the vehicle end, the cloud data routing module routes the target cloud ASR result to the message code for coding, and the message is sent to the voice gateway through the vehicle end output interface where the message is popped up, and then the voice gateway is sent to the vehicle end.

And when the target cloud ASR result needs to be subjected to NLU processing, the cloud data routing module routes the target cloud ASR result to the message encoding module, the decoded target cloud ASR result is popped out of the message, and the SP client calls a cloud NLU module in the third-party language processing module corresponding to the SP to perform NLU processing, so that a third-party NLU result (namely a to-be-selected cloud NLU result) is generated.

And the to-be-selected cloud NLU result is used as a message of a third-party language processing capacity providing module, the message is stacked through a cloud input interface, and the message is decoded and then sent to a cloud data routing module. Because the to-be-selected cloud NLU result needs to be normalized, the cloud data routing module routes the decoded to-be-selected cloud NLU result to the normalization calling module. And the normalization calling module calls the cloud normalization engine module through the scene service module, and normalizes the to-be-selected cloud NLU result to generate a target cloud NLU result.

The cloud data routing module routes the target NLU result to the cloud arbitration module, and the cloud arbitration module determines the target PA from the PAs (including the car service provider PA and one or more third party PAs) according to the cloud dialogue arbitration policy.

When the target PA is the vehicle service provider PA, the cloud data routing module routes the target NLU result to the vehicle service provider PA, and the scene service module calls the cloud scene engine module to generate a cloud dialogue result of the target NLU result.

When the target PA is a third party PA, the cloud data routing module routes the target NLU result to the SP client corresponding to the third party PA, and then the cloud dialogue result (namely the third party dialogue result) of the target NLU result is generated through the third party PA. And the third party conversation result needs to be subjected to message stacking and message decoding through the cloud input interface, and the decoded third party conversation result is sent to the cloud data routing module.

And the cloud data routing module routes the cloud dialogue result to the message code for coding, and then sends the message to the voice gateway through the vehicle end output interface of the message unstacking so as to send the message to the vehicle end.

In one embodiment, for voice processing requests of different vehicle ends, the cloud end can be matched with different scene service modules for voice processing, so that voice service can be provided more accurately.

As shown in fig. 2 and fig. 5, the cloud speech processing apparatus further includes a Distributed (Distributed) module. The

voice processing devices

1 and 2 of different vehicle terminals 8230, 8230N correspond to different vehicle terminal identifiers, and messages of different vehicle terminals are processed through different voice gateways. The distributed module matches corresponding scene service modules from a plurality of

scene service modules

1, 2 \8230, 8230A and N according to the vehicle terminal identification. The cloud scene engine modules LegB1 and LegB2 \8230 \ 8230 \ LegBN and the

cloud normalization modules

1 and 2 \8230 \ 8230 \ N are respectively corresponding to each scene service module.

Exemplarily, the normalization calling module is further configured to send the to-be-selected cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and the cloud normalization engine module according to the vehicle end identifier corresponding to the to-be-selected cloud NLU result.

Illustratively, the car service provider PA is further configured to send the target cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and the cloud scene engine module according to the car-end identifier corresponding to the target cloud NLU result.

In one example, as shown in fig. 6, a consistent Hash (Hash) algorithm may be used for matching between the scenario service module and the cloud normalization engine module (or the cloud scenario engine module). For example: the physical nodes n1, n2, n3 and n4 of each scene service module may be respectively allocated with a plurality of virtual nodes. Then, a consistent hash algorithm is carried out based on each virtual node and the vehicle end identifier, so that the virtual node closest to the vehicle end identifier is obtained, and the corresponding scene service module can be obtained through the physical node corresponding to the virtual node.

In an embodiment, the scene engine module is further configured to obtain a previous round of dialog result, and generate a current cloud-end dialog result according to the previous round of dialog result and the target cloud-end NLU result.

Exemplarily, a vehicle-end scene engine module in the vehicle-end voice processing device and a scene service module in the cloud-end scene processing device can perform semantic synchronization, and further synchronize a semantic context, so that the cloud-end scene engine module can obtain a previous-round conversation result based on the semantic context.

The embodiment of the application further provides a cloud server, which comprises the cloud voice processing device in any one of the above embodiments.

An embodiment of the present application further provides a speech processing device at a vehicle end, including: the vehicle-end voice processing module is used for carrying out voice processing on the target audio to generate a vehicle-end voice processing result; and the vehicle end arbitration module is used for determining a target voice processing result from the vehicle end voice processing result and the target cloud end voice processing result according to a preset vehicle end arbitration strategy.

In one embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle-end ASR module and a vehicle-end arbitration module, the vehicle-end arbitration policy includes a vehicle-end ASR arbitration policy, and the target speech processing result includes a target ASR result. The vehicle-end ASR module is used for carrying out ASR processing on the acquired target audio to obtain a vehicle-end ASR result, the vehicle-end receives a target cloud-end ASR result sent by the cloud end through the voice gateway, and the vehicle-end arbitration module determines the target ASR result from the vehicle-end ASR result and the target cloud-end ASR result according to a vehicle-end ASR arbitration strategy.

The vehicle-end ASR arbitration policy may include arbitration based on network connectivity, timeliness and confidence of the results, among other things. Illustratively, under the condition that the network connectivity between the cloud end and the vehicle end is normal, the timeliness of the vehicle end ASR result and the target cloud end ASR result is judged, for example, if the target cloud end ASR result exceeds a second preset time length and does not reach, the vehicle end ASR result is determined as the target ASR result. And if the target ASR result and the target ASR result both arrive within a second preset time length, judging the confidence level, namely judging which result has high confidence level, and taking the result as the target ASR result.

In an embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle end NLU module and a vehicle end normalization engine module, and is configured to perform NLU processing on the vehicle end ASR result to generate a vehicle end NLU result to be selected. And the vehicle end normalization engine module is used for carrying out normalization processing on the vehicle end NLU result to be selected so as to obtain the target vehicle end NLU result.

The vehicle-side normalization engine module can adopt results and working principles similar to those of the cloud-side normalization engine module, and details are not repeated here.

In one embodiment, as shown in fig. 7, the speech processing device at the vehicle end includes a vehicle end scene engine module (Leg a) for generating a vehicle end dialogue result according to the target vehicle end NLU result.

In one embodiment, the vehicle-side arbitration policy comprises a vehicle-side dialogue arbitration policy, and the target speech processing result comprises a target dialogue result. The vehicle end receives a cloud end conversation result sent by the cloud end through the voice gateway, and the vehicle end arbitration module determines a target conversation result from the vehicle end conversation result and the cloud end conversation result according to a vehicle end conversation arbitration strategy.

The vehicle-end ASR arbitration strategy can comprise arbitration based on a conversation field, network connectivity and the timeliness of the result, wherein the priority of the conversation field is higher than the network connectivity, and the priority of the network connectivity is higher than the timeliness of the result.

Illustratively, the dialog domain can be determined according to NLU results (vehicle-side NLU results or cloud-side NLU results) or target ASR results or application scenarios. If the dialogue field is the local dialogue field, preferentially using the vehicle-end dialogue result, and taking the vehicle-end dialogue result as a target dialogue result; if the conversation field is the cloud conversation field, preferentially using a cloud conversation result, namely taking the cloud conversation result as a target conversation result; and if the conversation field is a mixed field, judging the network connectivity between the cloud end and the vehicle end, and judging the timeliness under the condition that the network connectivity is normal. If the cloud terminal conversation result does not arrive within a third preset time length, taking the vehicle terminal conversation result as a target conversation result; and if both of the conversation results arrive within a third preset time length, and which conversation result arrives first, taking the conversation result as a target conversation result.

And under the condition that the network connectivity between the cloud end and the vehicle end is normal, judging the timeliness of the vehicle end ASR result and the target cloud end ASR result, for example, if the target cloud end ASR result exceeds a second preset time length and does not reach, determining the vehicle end ASR result as the target ASR result. And if the target ASR result and the target ASR result both arrive within a second preset time length, judging the confidence level, namely judging which result has high confidence level, and taking the result as the target ASR result.

An application example of the vehicle-end voice processing apparatus according to the embodiment of the present application is described below with reference to fig. 7 and 8.

And the vehicle-end microphone collects audio, and soft noise reduction and echo suppression processing (SSE) are carried out to generate target audio. The target audio is forwarded to the message transmission channel through the local gateway, namely after the target audio is only subjected to message stacking and message decoding, the target audio is routed to the message coding module through the vehicle-end data routing module, and after coding and message unstacking are carried out, the target audio is sent to the cloud-end voice gateway through the vehicle-end client. And the cloud dialogue result and the target cloud ASR result sent by the voice gateway are both forwarded to the message stacking module through the vehicle-end client, and are forwarded to the vehicle-end arbitration module through the vehicle-end data routing module after being decoded.

The vehicle end arbitration module arbitrates the vehicle end ASR result and the target cloud ASR result, outputs the target ASR result To the vehicle end data routing module, then routes the target ASR result To a TTS (Text To Speech, from Text To voice) module, and plays the result To a user through a horn at the vehicle end. The vehicle-side arbitration module can also arbitrate the vehicle-side conversation result and the cloud-side conversation result, then outputs the target conversation result To the vehicle-side data routing module, then routes the target conversation result To a Text-To-Speech (TTS) module, and broadcasts the target conversation result To the user through a loudspeaker at the vehicle side.

Illustratively, the target ASR result and the target dialog result may also be routed by the car-end data routing module to a voice UI (User Interface) module for presentation.

An embodiment of the present application further provides a vehicle, including the vehicle-end voice processing apparatus according to any one of the above embodiments.

The embodiment of the application further provides a system, including foretell high in the clouds server and vehicle.

According to the technical scheme of the embodiment of the application, a plurality of voice processing capacities (ASR, NLU, DM, TTS, arbitration and the like) are accessed through the cloud end, so that powerful dialogue service can be provided for the vehicle end, and under the condition of no network or poor network, the dialogue service can be completed based on the voice processing capacity of the vehicle end. Furthermore, the cloud end provides a voice gateway, and the vehicle end is directly connected with the voice gateway of the service provider, so that the service provider can provide better voice service for the vehicle end; the vehicle end and the cloud end are multiple engines each other, and the optimal conversation result is selected through an arbitration strategy, so that better voice man-machine conversation experience is provided for a user. According to the scheme, the communication between the voice gateway and the vehicle end and between the voice gateway and the cloud voice processing capacity module adopts long connection of asynchronous duplex (ELB), a consistency hash algorithm is introduced, and the voice interaction is better in timeliness than that of a direct connection provider through a proper session management strategy. Furthermore, NLU results output by different voice processing capability modules are different, normalization processing can be carried out on different semantic results through visual intelligent editing and mapping scripts, the normalized NLU results are used as input of a conversation, and design of man-machine conversation flow can be greatly simplified

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more (two or more) executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The utility model provides a speech processing apparatus, its characterized in that is applied to the high in the clouds, speech processing apparatus includes:

the voice gateways are used for receiving target audio frequencies of different vehicle terminals;

the voice processing capability modules are used for performing voice processing on the target audio to generate a cloud voice processing result to be selected; the voice processing device further comprises a cloud data routing module, wherein the cloud data routing module is used for routing the target voice frequency to the cloud ASR module, so that the cloud ASR module performs ASR processing on the target voice frequency to generate the cloud ASR result to be selected;

the cloud arbitration module is used for determining a target cloud ASR result of the target audio from a plurality of to-be-selected cloud ASR results according to a preset cloud arbitration strategy; the cloud speech processing result to be selected further comprises a cloud NLU result to be selected, the plurality of speech processing capability modules further comprise at least one cloud NLU module, and the cloud data routing module is further used for routing the target cloud ASR result to the cloud NLU module so that the cloud NLU module performs NLU processing on the target cloud ASR result to generate the cloud NLU result to be selected;

the system comprises a plurality of corresponding scene service modules and a cloud end normalization engine module;

the distributed module is used for matching the corresponding scene service module and the cloud normalization engine module according to the identification of the target vehicle end corresponding to the to-be-selected cloud NLU result;

the matched scene service module is used for calling a corresponding cloud end normalization engine module to normalize the to-be-selected cloud end NLU result, and generate a target cloud end NLU result of the target vehicle end, so that a voice gateway corresponding to the target vehicle end sends the target cloud end NLU result to the target vehicle end.

2. The speech processing device according to claim 1, wherein in a case that there is one cloud ASR module, the cloud arbitration module is configured to determine the candidate cloud ASR result as the target cloud ASR result;

the cloud end arbitration strategy comprises a cloud end ASR arbitration strategy, and under the condition that the number of the cloud end ASR modules is multiple, the cloud end arbitration module is used for determining the target cloud end ASR result from multiple to-be-selected cloud end ASR results according to the cloud end ASR arbitration strategy.

3. The speech processing device according to claim 2, wherein the cloud ASR arbitration policy comprises arbitration based on timeliness and confidence of each candidate cloud ASR result, and timeliness is prioritized over confidence.

4. The speech processing device according to claim 1, further comprising a normalization call module, wherein the cloud data routing module is further configured to route the to-be-selected cloud NLU result to the normalization call module, so that the normalization call module calls the corresponding cloud normalization engine module through the matched scene service module.

5. The speech processing device according to claim 4, wherein the normalization call module is further configured to send the to-be-selected cloud NLU result to the distributed module, so that the distributed module matches the corresponding scene service module and cloud normalization engine module according to the vehicle end identifier corresponding to the to-be-selected cloud NLU result.

6. The speech processing device according to claim 4, wherein the plurality of speech processing capability modules comprises a plurality of personal speech assistants, the target cloud speech processing result comprises a cloud dialogue result to the target cloud NLU result, and the cloud arbitration policy comprises a cloud dialogue arbitration policy;

the cloud data routing module is further configured to route the target cloud NLU result to the cloud arbitration module, so that the cloud arbitration module determines a target personal voice assistant matching the target cloud NLU result from the multiple personal voice assistants according to the cloud dialogue arbitration policy, and the target personal voice assistant is configured to generate the cloud dialogue result according to the target cloud NLU result.

7. The speech processing apparatus of claim 6, wherein the cloud-based dialogue arbitration policy comprises arbitration based on a domain, a dialogue intent, and a keyword corresponding to the target cloud-based NLU result.

8. The speech processing apparatus of claim 6, wherein the target personal speech assistant comprises a car facilitator personal speech assistant, and the speech processing apparatus further comprises a cloud context engine module, and the car facilitator personal speech assistant invokes the cloud context engine module through the context service module to generate a cloud dialog result to the target cloud NLU result.

9. The speech processing device according to claim 8, wherein the plurality of scenario service modules and the plurality of cloud-end scenario engine modules correspond to each other, the speech processing device further comprises a distributed module, and the car-service provider personal speech assistant is further configured to send the target cloud-end NLU result to the distributed module, so that the distributed module matches the corresponding scenario service module and the corresponding cloud-end scenario engine module according to the car-end identifier corresponding to the target cloud-end NLU result.

10. The speech processing device of claim 8, wherein the scene engine module is further configured to obtain a previous round of dialog results, and generate a current cloud-end dialog result according to the previous round of dialog results and the target cloud-end NLU result.

11. A cloud server, comprising the speech processing apparatus according to any one of claims 1 to 10.

12. A speech processing apparatus, applied to a vehicle side, the speech processing apparatus comprising:

a vehicle-end arbitration module, configured to determine a target speech processing result from the vehicle-end speech processing result and the target cloud-end speech processing result according to a preset vehicle-end arbitration policy, where the target cloud-end speech processing result includes a target cloud-end ASR result and a target cloud-end NLU result, and the target cloud-end speech processing result is generated by the speech processing apparatus according to any one of claims 1 to 10 according to the target audio.

13. A vehicle characterized by comprising the speech processing apparatus of claim 12.

14. A speech processing system comprising the cloud server of claim 11 and the vehicle of claim 13.