CN111768756A

CN111768756A - Information processing method, information processing apparatus, vehicle, and computer storage medium

Info

Publication number: CN111768756A
Application number: CN202010589864.7A
Authority: CN
Inventors: 丁磊; 郭刘飞; 黄骏; 周宏波; 郭昊
Original assignee: Human Horizons Shanghai Internet Technology Co Ltd
Current assignee: Human Horizons Shanghai Internet Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-13
Anticipated expiration: 2040-06-24
Also published as: CN111768756B

Abstract

The application discloses an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and the vehicle is provided with a speech synthesis TTS engine, and the information processing method comprises the following steps: the method comprises the steps that a vehicle-mounted application obtains information to be converted and converts the information to be converted into information carrying a target voice style; and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.

Description

Information processing method, information processing apparatus, vehicle, and computer storage medium

Technical Field

The present application relates to the field of audio processing, and in particular, to an information processing method, apparatus, vehicle, and computer storage medium.

Background

Along with the development of intellectuality, also add the on-vehicle application that promotes intelligent degree in the vehicle, including the intelligent scene that the on-vehicle application control goes on the sound production. However, how to make the sound production effect more personalized through the control of the vehicle-mounted application in the vehicle, so that the audio playing scene is richer, is a problem to be solved.

Disclosure of Invention

In order to solve at least one of the above problems in the prior art, embodiments of the present application provide an information processing method, apparatus, device and computer storage medium.

In a first aspect, an embodiment of the present application provides an information processing method applied to a vehicle, where a vehicle-mounted application is installed in the vehicle, and a speech synthesis TTS engine is installed in the vehicle, including:

the method comprises the steps that a vehicle-mounted application obtains information to be converted and converts the information to be converted into information carrying a target voice style;

and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.

In a second aspect, an embodiment of the present application provides an information processing apparatus, including:

the conversion module is used for acquiring information to be converted and converting the information to be converted into information carrying a target voice style;

and the TTS calling module is used for sending the information carrying the target voice style to a TTS engine so as to carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.

In a third aspect, an embodiment of the present application provides a vehicle, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

One embodiment in the above application has the following advantages or benefits: the text information to be converted can be converted through the vehicle-mounted application to obtain information carrying a target voice style, and a TTS engine is called to synthesize audio information; therefore, the vehicle-mounted application can have richer audio playing styles, more personalized sound style information can be output in audio playing, personalized requirements are met, and hearing experience of users is improved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart diagram of an information processing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of generating information carrying a target sound style according to the present application;

FIG. 3 is a schematic diagram of a processing scenario according to the information processing method of the present application;

FIG. 4 is a schematic view of another scenario processing according to the information processing method of the present application;

FIG. 5 is a first block diagram of an information processing apparatus according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a second exemplary structure of an information processing apparatus according to an embodiment of the present application;

FIG. 7 is a block diagram of an information processing apparatus according to an embodiment of the present application;

FIG. 8 is a block diagram of an information processing apparatus according to an embodiment of the present application;

FIG. 9 is a block diagram of an information processing apparatus according to an embodiment of the present application;

FIG. 10 is a sixth block diagram of an information processing apparatus according to an embodiment of the present application;

FIG. 11 is a seventh schematic structural diagram of an information processing apparatus according to another embodiment of the present application;

fig. 12 is a block diagram of a vehicle for implementing the information processing method of the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the application provides an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and a speech synthesis TTS engine is arranged in the vehicle, as shown in fig. 1, the method comprises the following steps:

s101: the method comprises the steps that a vehicle-mounted application obtains information to be converted and converts the information to be converted into information carrying a target voice style;

s102: and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.

In S101, the in-vehicle application may be one of a plurality of applications installed in a vehicle. For example, the in-vehicle application may be a news application, an encyclopedia application, or the like.

The information to be converted acquired by the vehicle-mounted application can be text information to be converted, audio information to be converted or grammar information to be converted.

The manner of acquiring the information to be converted may be to acquire the information to be converted from a manually input text, or to acquire the information to be converted from a detected voice signal. Besides, the information to be converted can be acquired from the cloud.

In S101, the information to be converted is converted into information carrying a target voice style, and the information to be converted may be converted into information carrying one target voice style, or may be converted into information carrying two or more different target voice styles.

In S101, converting the information to be converted into information carrying the target voice style may include:

under the condition that the information to be converted is text information, the vehicle-mounted application determines a target sound style corresponding to the text information to be converted, labels audio-related attributes of the text information to be converted according to the target sound style to obtain a target SSML (Speech Synthesis Markup Language), and takes the target SSML as the information carrying the target sound style, namely the information carrying the target sound style can be the target SSML.

The determination method of the target sound style may be determined according to user settings, for example, the vehicle-mounted application may provide a selection menu interface for selecting a sound style for a user, and the user obtains the sound style to be obtained this time as the target sound style by selecting on the interface. In the embodiment of the present application, the target sound style may be one of multiple preset styles, for example, six preset styles, and a user may select one of the preset styles as the target sound style. Specifically, the user selects the imitation show module on the menu selection interface, and the obtained target sound style is the sound style corresponding to the imitation show mode. The method can also comprise that the user selects the repeater module on a menu selection interface to obtain the target sound style to be obtained at this time as the sound style corresponding to the repeater module. The method can also comprise that the user selects the strongest drama editing module on a menu selection interface to obtain the sound style corresponding to the strongest drama editing mode, wherein the target sound style to be obtained at this time is the sound style corresponding to the strongest drama editing mode.

The target sound style may be determined according to a context of the information to be converted, for example, the information to be converted is text information, and a framework for manually setting the text information may be used. The frame of the text information corresponds to a certain context, and the frame of the text information also corresponds to a certain target sound style.

Under the condition that the information to be converted is the text information to be converted, the determination mode of the target sound style can be generated according to cloud modification information. The cloud end can automatically detect the content of the text information to be converted, and modify the style of the target sound according to the content of the text information to be converted.

In addition, the vehicle-mounted application can also have a default sound style; accordingly, the determining method of the target sound style may include: if the user does not select the target sound style, the default sound style can be directly adopted as the target sound style (the sound style can be related to the type of the vehicle-mounted application); and if the user selects the sound style required by the processing, taking the sound style selected by the user as the target sound style. For example, the default sound style is the sound style corresponding to the mimic show mode or the sound style corresponding to the repeater mode, so that the sound style corresponding to the mimic show mode or the repeater mode can be directly adopted as the target sound style when the user does not select the target sound style.

The target sound style may also be determined according to the content of the information to be converted. For example, if the content of the information to be converted includes a multi-person conversation, the information to be converted may be identified as a drama, and the target sound style is determined to be the strongest drama style accordingly. As another example, if the content of the information to be converted includes celebrity names, the information to be converted may be recognized as a mimic show text, and accordingly the target sound style may be determined as the mimic show style. If the information to be converted comprises a plurality of repeated contents or the information to be converted is too short, the information to be converted can be identified as the information needing to be repeated, and the target sound style is correspondingly determined to be the language of the repeater.

In one example, the process of converting the information to be converted into the information carrying the target voice style may include the following steps as shown in fig. 2:

step S201: SSML is created.

The creating of the information carrying the target sound style may include adding information such as version, language, URI (uniform resource Identifier), and output voice addition. For example, a canonical version of information carrying a target sound style for interpreting document markup, a language of a root document, and a URL of a document for defining a markup vocabulary of the document carrying information of the target sound style may be specified.

Step S202: and adjusting the selected voice according to the target voice style, and selecting service information.

For example, voice corresponding to the information to be converted is recorded. And adjusting the recorded voice according to the target sound style, so that the style of the recorded voice is the target sound style.

The service may be understood as the audio property related information of the present application, that is, the target sound style may ultimately correspond to the audio property related information.

It is to be understood that different genres may also correspond to different services. The service may include at least one of speech rate, intonation, pitch, pause, and the like.

For example, the service information may include at least one of: add or delete interruptions/pauses in speech; specifying paragraphs and sentences in speech; using phonemes to improve pronunciation; using a user-defined dictionary to improve pronunciation; adjusting rhythm; changing the speed of speech; changing the volume; changing the pitch; changing a pitch lifting curve; adding the recorded audio; adding background audio, etc.

Step S203: generating the target SSML. Information carrying the target sound style may be generated for the result based on the aforementioned selection and adjustment. In the embodiment of the present application, the information carrying the target sound style may include a tag of certain audio-related attributes, and the audio-related attributes may be understood as service information, such as a speech rate, a tone, a pitch, background music, and the like. The service information has some fixed settings for each style, for example, for the serious style and for the entertainment style, the speech speed, tone, pitch, background music, etc. may be different in at least one of the different service information.

Based on the above description of the process of generating the target SSML, in an embodiment of the present application, the acquiring information to be converted includes:

acquiring audio information to be converted;

correspondingly, the converting the information to be converted into the information carrying the target voice style includes:

determining a target sound style according to the audio special effect label of the audio information to be converted;

and generating a first target SSML according to the mark of the audio correlation attribute corresponding to the target sound style, and taking the first target SSML as first information carrying the target sound style.

The audio information to be converted can be acquired through the in-vehicle sound acquisition device, can also be acquired according to the selection result of the user on the audio stored in the storage medium, and can also be acquired according to the selection result of the user on the audio stored in the cloud. For example, the user selects the audio of the recited poetry from the cloud to obtain the audio to be converted, or the user reads a section of script and records the script to obtain the audio to be converted, or records the conversation between the user and other people to obtain the audio to be converted.

The audio special effect tag may be an audio special effect tag added according to a selection of a user.

The audio effect tag may be specific information of the audio effect, or may be audio effect code information. There may be a many-to-one correspondence between audio special effect labels and sound styles. For example, audio effect label A, B, C corresponds to a first sound style and audio effect label D, E, F corresponds to a second sound style. And if the audio special effect label of the audio to be converted is A, determining that the target sound style is the first sound style. Of course, the audio feature tags and the sound styles may also be in a one-to-one correspondence, which is not described herein again.

In one embodiment, the obtaining information to be converted includes:

the vehicle-mounted application acquires collected voice information;

the vehicle-mounted application converts the collected voice information to obtain text information corresponding to the voice information; and taking the text information as information to be converted.

The collected voice information may be voice information recorded by a voice recording device. For example, a user uses a recording device to record his or her conversation with another person as voice information.

The collected voice information may also be voice information acquired from other applications. For example, the in-vehicle application acquires the broadcasted voice information through a broadcast application. For another example, the in-vehicle application downloads voice information on the internet through a web browser.

The collected voice information can also be received voice information. For example, the in-vehicle application receives voice information sent by other users through the same in-vehicle application through the internet.

And the vehicle-mounted application converts the collected voice information to obtain text information corresponding to the voice information.

Specifically, the text information that can be recognized by the computer is obtained by converting the voice information through an Automatic Speech Recognition technology (Automatic Speech Recognition). In the actual processing, other manners may also be adopted to perform the conversion of the voice information, and as long as the manner can convert the audio information into the text information, the method is within the protection scope of the embodiment of the present application.

In one embodiment, the obtaining information to be converted includes:

the vehicle-mounted application acquires collected voice information;

the vehicle-mounted application converts the collected voice information to obtain text information corresponding to the voice information; taking the text information as information to be converted;

meanwhile, the converting the information to be converted into the information carrying the target voice style comprises the following steps:

determining a target sound style according to the audio special effect label of the information to be converted;

and marking the audio related attributes of the information to be converted according to the target sound style to generate a second target SSML, and taking the second target SSML as second information carrying the target sound style. Here, the process of generating the second target SSML is the same as the process of fig. 2 described above, and is not described here again.

In one embodiment, the obtaining information to be converted includes:

the vehicle-mounted application acquires a target script frame;

the vehicle-mounted application acquires a target text and takes the target text as information to be converted.

The target scenario framework may be selected by a user from a given scenario framework of the in-vehicle application. For example, the user selects the strongest drama editing module in the in-vehicle application, and after entering the strongest drama editing module, the user selects a script frame from predetermined script frames provided by the in-vehicle application as a target script frame. After the target script frame is determined, the user can manually input a target text based on the prompt of the script, and the vehicle-mounted application determines a target style and information to be converted based on the target text and the target script frame.

The target text may be text generated in conjunction with the target scenario framework.

In one embodiment, the obtaining information to be converted includes:

the vehicle-mounted application acquires a target script frame;

the vehicle-mounted application acquires a target text, and the target text is used as information to be converted; meanwhile, the converting the information to be converted into the information carrying the target voice style includes:

determining a target sound style according to the target script frame;

and marking the audio related attributes of the information to be converted according to the target sound style to generate a third target SSML, and taking the third target SSML as third information carrying the target sound style. The manner in which the third target SSML is generated is the same as that of the target SSML generated in fig. 2 described above, and a description thereof will not be repeated.

Based on the above processing, further, in an embodiment, the method further includes:

the vehicle-mounted application detects a playing instruction and sends a calling request to a TTS engine;

and the vehicle-mounted application determines whether to send the information carrying the target voice style to the TTS engine based on the information fed back by the TTS engine.

Specifically, the vehicle-mounted application determines to send the information carrying the target voice style to the TTS engine based on the information fed back by the TTS engine.

Here, the information carrying the target sound style may be the foregoing: the first information carrying the target sound style, the second information carrying the target sound style, and the third information carrying the target sound style.

The playing instruction can be triggered by pressing a virtual key for controlling playing in a control key area contained in a display interface of the vehicle-mounted application or by pressing a certain designated physical key in the vehicle under the condition that the user confirms the information to be converted.

The playing instruction can also be an instruction sent out in a voice information mode under the condition that the user confirms the information to be converted. After a user sends out voice information, the voice information of the user is collected through a sound collection unit; and performing voice recognition to obtain voice instruction information, and if the voice instruction information represents and determines to play the input text information, understanding the voice instruction information as a playing instruction.

In a specific implementation manner of the foregoing embodiment, the target style may be: serious, humorous, recreational, etc. After the vehicle-mounted application sends the call request to the TTS engine, the vehicle-mounted application determines whether the information carrying the target voice style is sent to the TTS engine or not based on the information of the call request fed back by the TTS engine, so that the TTS engine can carry out audio synthesis according to the information carrying the target voice style and output the synthesized audio information.

For example, a voice mimic show module (or function or option) exists in the vehicle application, and after clicking to enter the mimic show module, the user displays an information selection interface, and the user can further select the mimic show mode or the repeater mode on the information selection interface. When the user selects the imitation show mode, the target sound style can be a sound style corresponding to the imitation show mode; when the user selects the repeater mode, the target sound style may be a sound style corresponding to the repeater mode. And when the user determines the selected mode and clicks to issue the selected instruction, acquiring the information to be converted, and converting the information to be converted into the information carrying the target sound style corresponding to the mode selected by the user. The user can also press the virtual key for playing and send a playing instruction. After receiving the playing instruction, the vehicle-mounted application sends a calling request to the TTS engine. Then, the vehicle-mounted application determines whether to send the information carrying the target voice style to the TTS engine based on the information of the call request fed back by the TTS engine, so that the TTS engine can carry out audio synthesis according to the information carrying the target voice style and output the synthesized audio information.

In one embodiment, the sending the invocation request to the TTS engine further comprises:

the vehicle-mounted application determines a voice output channel and sends a calling request aiming at the voice output channel to the TTS engine.

Further, the vehicle-mounted application sends a call request to the TTS engine, and may select a corresponding audio output channel for the vehicle-mounted application based on the currently determined target voice style; then sending a calling request for calling the audio output channel to a TTS engine; and under the condition that the audio output channel can output audio, the vehicle-mounted application sends the information carrying the target voice style to a TTS engine.

Furthermore, in the processing of the TTS engine, the target audio information corresponding to the information carrying the target voice style may be added to the output queue of the corresponding audio output channel; and outputting the target audio information corresponding to the target sound style in the output queue through the audio output channel.

It should be further noted that the processing of the TTS engine may include performing speech synthesis in combination with a speech model corresponding to the target voice style. The speech model can be preset locally for the vehicle, and can also be cloud-end.

For example, the TTS engine of the vehicle may send the information carrying the target voice style to the cloud, the cloud selects the corresponding voice model to synthesize the information carrying the target voice style, and sequentially feeds back the audio obtained by synthesis to the vehicle, and outputs the audio through the audio output channel corresponding to the TTS engine of the vehicle. This kind of mode can adopt when the vehicle can be connected in the high in the clouds, perhaps can adopt when the communication quality of vehicle and high in the clouds is good, perhaps, can adopt when setting up the vehicle and can connect in the clouds and communication quality is greater than preset threshold value for the user.

For another example, the TTS engine of the vehicle may perform speech synthesis on the information carrying the target voice style directly according to the local speech model. In this case, the local speech model may be a speech model that can be updated when the cloud can be connected, and/or may also be a speech model preset locally.

In addition, the method further comprises: and when the vehicle-mounted application detects a command of pausing the playing, sending the command of pausing the playing to the TTS engine so as to control the TTS engine to pause audio synthesis.

Wherein, the instruction for pausing the playing may be generated as follows: after the vehicle-mounted application detects the click operation of a pause virtual key in a current playing interface or a control interface displayed by the vehicle-mounted application, generating a play pause instruction; or, after detecting the click operation of a specified play pause key (physical key) in the vehicle, the vehicle-mounted application generates a play pause instruction.

In one embodiment, the method further comprises:

when a request for canceling audio playing sent by the first application is received, controlling to stop audio synthesis;

and/or when a request for canceling the audio output channel sent by the first application is received, controlling to release the audio output channel.

That is, the TTS application can be cancelled according to the actual need of the first application side, and audio synthesis can be cancelled.

An example of the present application, as shown in fig. 3, may include:

the vehicle-mounted application supports a voice simulation show mode, the vehicle-mounted application (such as an AI voice skill APP) has an AI (Artificial Intelligence) voice skill, and the running hardware basis can be a chip (IDCM) of the vehicle-mounted application.

The in-vehicle application may have a plurality of optional functions (or optional modules, such as A, B, C in the figure and a sound emulation show module) that may be entered upon user selection.

Upon entering the page of the sound mimic show module, the user can manually click to select either "mimic show mode" or "repeater mode". After the user makes a selection, a voice recorder is invoked to intercept the audio, or a speech capability is invoked to obtain ASR results.

Respectively speaking:

if the user selects the 'imitation show mode', automatically opening the microphone, prompting the user to speak for a period of time through voice information or character information, calling the recorder to intercept the voice frequency of the user speaking, and taking the voice frequency as the information to be converted;

if the vehicle-mounted application applies for calling the TTS engine, if the current TTS engine can be called, the TTS engine gives a feedback of successful calling of the vehicle-mounted application, otherwise, the feedback of calling is failed;

after the vehicle-mounted application receives the feedback of successful calling, the vehicle-mounted application determines corresponding information carrying the target sound style according to the audio and the corresponding special effect label, and the information carrying the target sound style can be obtained; and sending the information carrying the target voice style to a TTS engine to synthesize audio and play.

If the user selects the repeater mode, the microphone is automatically opened, the user is prompted to speak for a period through voice information or character information, and the voice frequency of the user speaking is intercepted by calling the recorder; then the vehicle-mounted application carries out ASR processing on the collected audio to obtain corresponding text information;

after the vehicle-mounted application receives the feedback of successful calling, the vehicle-mounted application determines corresponding information carrying the target sound style according to the text information and the corresponding special effect label; and sending the information carrying the target voice style to a TTS engine to synthesize audio and play.

In still another example of the present application, as shown in fig. 4, the method may include:

the vehicle-mounted application can have a plurality of selectable functions (or selectable modules, such as A, B, C in the figure and the strongest drama editing module), can enter the strongest drama editing module according to the selection of a user, and then displays a selection script interface;

and selecting a script frame in a script selection interface. The selected screenplay framework may determine a text context such that text under the screenplay framework exhibits an output effect in the context corresponding to the screenplay framework. That is, the scenario frame may correspond to a certain style.

After the script frame is selected, the vehicle-mounted application can display an input case page and prompt a user to input a text on the displayed input case page. And combining the text and the script framework by the vehicle-mounted application to obtain the final file corresponding to the strongest script editing module. The final copy is converted to information carrying the target voice style (or target SSML).

After obtaining the file, the user can manually operate the 'begin performance' button, and after the operation takes effect, the vehicle-mounted application applies for calling the TTS engine.

If the current TTS engine can be called, the TTS engine gives a feedback of successful calling of the vehicle-mounted application, otherwise, the feedback calling fails; after the vehicle-mounted application receives the feedback of successful calling, the information (or the file or the target SSML) carrying the target voice style is sent to a TTS engine, and the TTS engine synthesizes and plays the audio.

The processing of sending the information carrying the target voice style may be sending sentence by sentence (or sending word by word), and simultaneously carrying the voice style information carried by each sentence, so that the TTS performs sentence by sentence synthesis according to the information.

In addition, after receiving the feedback of successful TTS synthesis, the vehicle-mounted application can randomly select an audio frequency from the background sound library as background audio information, and call a media player of the android system to play the background audio information so as to perform sound mixing playing on the background audio information and the synthesized audio frequency output by the TTS engine; it should be noted that, during audio mixing, the strongest drama (i.e., the synthesized audio of the TTS engine) is the main one, and the background audio information is the auxiliary one; that is, the audio sounds of the TTS audio output are larger than the played sounds of the background audio information.

In the embodiment of the application, the information to be converted can be converted through the vehicle-mounted application to obtain the information carrying the target voice style, and then a TTS engine is called to synthesize the audio information; in this process, the user may choose to have audio, speech, or a script as the information to be converted, so that the synthesized audio information may have a variety of styles, such as sound imitation shows, strongest drama, and the like. Therefore, the vehicle-mounted application has richer audio playing styles, more personalized sound style information can be output in audio playing, and personalized requirements are met.

An embodiment of the present application further provides an information processing apparatus, as shown in fig. 5, including:

the conversion module 51 is configured to acquire information to be converted, and convert the information to be converted into information carrying a target sound style;

the TTS invoking module 52 is configured to send the information with the target voice style to a TTS engine, so as to perform audio synthesis on the information with the target voice style through the TTS engine and output the synthesized audio information.

In one embodiment, as shown in fig. 6, the conversion module 51 comprises:

a first obtaining unit 61, configured to obtain audio information to be converted;

a first style unit 62, configured to determine a target sound style according to the audio special effect tag of the audio information to be converted;

and the first SSML unit 63 is configured to generate a first target SSML according to the label of the audio correlation attribute corresponding to the target sound style, and use the first target SSML as first information carrying the target sound style.

In one embodiment, as shown in fig. 7, the conversion module 51 further includes:

a second obtaining unit 71, configured to obtain the collected voice information;

the voice conversion unit 72 is configured to convert the acquired voice information to obtain text information corresponding to the voice information; and taking the text information as information to be converted.

In one embodiment, as shown in fig. 8, the conversion module further comprises:

the second style unit 81 is configured to determine a target sound style according to the audio special effect label of the information to be converted;

and the second SSML unit 82 is configured to perform audio-related attribute labeling on the information to be converted according to the target sound style to generate a second target SSML, and use the second target SSML as second information carrying the target sound style.

In one embodiment, as shown in fig. 9, the conversion module 51 further includes:

a third acquiring unit 91, configured to acquire a target scenario frame;

the text unit 92 is configured to acquire a target text, and use the target text as information to be converted.

In one embodiment, as shown in fig. 10, the conversion module 51 includes:

a third style unit 1001 for determining a target sound style according to the target scenario frame;

and a third SSML unit 1002, configured to perform audio-related attribute labeling on the information to be converted according to the target sound style to generate a third target SSML, and use the third target SSML as third information carrying a target sound style.

In one embodiment, as shown in fig. 11, the apparatus further comprises:

the calling module 1101 is configured to detect a play instruction and send a calling request to the TTS engine;

a sending module 1102, configured to determine whether to send the information carrying the target voice style to a TTS engine based on the information fed back by the TTS engine.

In one embodiment, the sending module is further configured to:

As shown in fig. 12, is a block diagram of a vehicle according to an information processing method of an embodiment of the present application. The vehicle is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The vehicle may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 12, the vehicle includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the vehicle, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple vehicles may be connected, with each device providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 12 illustrates an example of one processor 1201.

Memory 1202 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor, so that the at least one processor executes the information processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the information processing method provided by the present application.

The memory 1202, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., units shown in fig. 5 and 6) corresponding to the information processing method in the embodiments of the present application. The processor 1201 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 1202, that is, implements the information processing method in the above-described method embodiment.

The memory 1202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the vehicle information processing vehicle, and the like. Further, the memory 1202 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 1202 optionally includes memory remotely located from processor 1201, which may be connected to the vehicle information handling vehicle via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The vehicle of the information processing method may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or other means, and the bus connection is exemplified in fig. 12.

The input device 1203 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the vehicle information handling vehicle, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 1204 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An information processing method applied to a vehicle in which an in-vehicle application is installed and in which a speech synthesis TTS engine is installed, characterized by comprising:

2. The method of claim 1, wherein the obtaining information to be converted comprises:

acquiring audio information to be converted;

3. The method of claim 1, wherein the obtaining information to be converted comprises:

the vehicle-mounted application acquires collected voice information;

4. The method of claim 3, wherein the converting the information to be converted into information carrying a target voice style comprises:

and marking the audio related attributes of the information to be converted according to the target sound style to generate a second target SSML, and taking the second target SSML as second information carrying the target sound style.

5. The method of claim 1, wherein the obtaining information to be converted comprises:

the vehicle-mounted application acquires a target script frame;

6. The method of claim 5, wherein the converting the information to be converted into information carrying a target voice style comprises:

determining a target sound style according to the target script frame;

and marking the audio related attributes of the information to be converted according to the target sound style to generate a third target SSML, and taking the third target SSML as third information carrying the target sound style.

7. The method of any of claims 1-6, wherein the method further comprises:

8. The method of claim 7, wherein said sending a call request to a TTS engine further comprises:

9. An information processing apparatus characterized by comprising:

10. The apparatus of claim 9, wherein the conversion module comprises:

the first acquisition unit is used for acquiring audio information to be converted;

the first style unit is used for determining a target sound style according to the audio special effect label of the audio information to be converted;

and the first SSML unit is used for generating a first target SSML according to the mark of the audio correlation attribute corresponding to the target sound style, and taking the first target SSML as first information carrying the target sound style.

11. The apparatus of claim 9, wherein the conversion module further comprises:

the second acquisition unit is used for acquiring the acquired voice information;

the voice conversion unit is used for converting the collected voice information to obtain text information corresponding to the voice information; and taking the text information as information to be converted.

12. The apparatus of claim 11, wherein the conversion module further comprises:

the second style unit is used for determining a target sound style according to the audio special effect label of the information to be converted;

and the second SSML unit is used for marking the audio related attributes of the information to be converted according to the target voice style to generate a second target SSML, and the second target SSML is used as second information carrying the target voice style.

13. The apparatus of claim 9, wherein the conversion module further comprises:

the third acquisition unit is used for acquiring the target script frame;

and the text unit is used for acquiring the target text and taking the target text as the information to be converted.

14. The apparatus of claim 13, wherein the conversion module comprises:

a third style unit for determining a target sound style according to the target script frame;

and the third SSML unit is used for marking the audio related attributes of the information to be converted according to the target voice style to generate a third target SSML, and the third target SSML is used as third information carrying the target voice style.

15. The apparatus of any of claims 9-14, wherein the apparatus further comprises:

the calling module is used for detecting a playing instruction and sending a calling request to the TTS engine;

and the sending module is used for determining whether to send the information carrying the target voice style to the TTS engine based on the information fed back by the TTS engine.

16. The apparatus of claim 15, wherein the means for transmitting is further configured to:

17. A vehicle, characterized by comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.