CN111768755A - Information processing method, information processing apparatus, vehicle, and computer storage medium - Google Patents

Information processing method, information processing apparatus, vehicle, and computer storage medium Download PDF

Info

Publication number
CN111768755A
CN111768755A CN202010589862.8A CN202010589862A CN111768755A CN 111768755 A CN111768755 A CN 111768755A CN 202010589862 A CN202010589862 A CN 202010589862A CN 111768755 A CN111768755 A CN 111768755A
Authority
CN
China
Prior art keywords
information
vehicle
tts engine
target
styles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010589862.8A
Other languages
Chinese (zh)
Inventor
丁磊
郭刘飞
黄骏
周宏波
郭昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human Horizons Shanghai Internet Technology Co Ltd
Original Assignee
Human Horizons Shanghai Internet Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human Horizons Shanghai Internet Technology Co Ltd filed Critical Human Horizons Shanghai Internet Technology Co Ltd
Priority to CN202010589862.8A priority Critical patent/CN111768755A/en
Publication of CN111768755A publication Critical patent/CN111768755A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and the vehicle is provided with a speech synthesis TTS engine, and the information processing method comprises the following steps: the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different; and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.

Description

Information processing method, information processing apparatus, vehicle, and computer storage medium
Technical Field
The present application relates to the field of audio processing, and in particular, to an information processing method, apparatus, vehicle, and computer storage medium.
Background
Along with the development of intellectuality, also add the on-vehicle application that promotes intelligent degree in the vehicle, including the intelligent scene that the on-vehicle application control goes on the sound production. However, how to make the sound production effect more personalized through the control of the vehicle-mounted application in the vehicle, so that the audio playing scene is richer, is a problem to be solved.
Disclosure of Invention
In order to solve at least one of the above problems in the prior art, embodiments of the present application provide an information processing method, apparatus, device and computer storage medium.
In a first aspect, an embodiment of the present application provides an information processing method applied to a vehicle, where a vehicle-mounted application is installed in the vehicle, and a speech synthesis TTS engine is installed in the vehicle, including:
the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In a second aspect, an embodiment of the present application provides an information processing apparatus, including:
the conversion module is used for acquiring text information to be converted and converting the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS calling module is used for sending the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In a third aspect, an embodiment of the present application provides a vehicle, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.
One embodiment in the above application has the following advantages or benefits: the text information to be converted can be converted through the vehicle-mounted application to obtain a plurality of pieces of information carrying target voice styles, and then a TTS engine is called to synthesize audio information. Therefore, the vehicle-mounted application has richer audio playing scenes, and a plurality of different personalized sound styles of information can be output one by one based on the same text in the audio playing, so that the personalized requirements are met, and the hearing experience of a user is improved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic flow chart diagram of an information processing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of generating information carrying a target sound style according to the present application;
FIG. 3 is a schematic view of another scenario processing according to the information processing method of the present application;
FIG. 4 is a first diagram illustrating an information processing apparatus according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a second exemplary structure of an information processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a vehicle for implementing the information processing method of the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the application provides an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and a speech synthesis TTS engine is arranged in the vehicle, as shown in fig. 1, the method comprises the following steps:
s101: the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
s102: and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In S101, the in-vehicle application may be one of a plurality of applications installed in a vehicle.
The method for acquiring the text information to be converted may include the following steps:
the first mode,
And the vehicle-mounted application acquires the text to be converted from the text information.
That is, the user may input text information at the information input interface of the in-vehicle application; alternatively, a stored text may be selected for the user from locally stored texts as the information input by the user.
The vehicle-mounted application displays the text information input by the user on the display interface, and can display the text information input by the user through the display interface if it is detected that the user clicks a preview key (which may be a virtual key or a physical key).
The second mode,
And the vehicle-mounted application acquires text information input by a user, and takes the text information input by the user as the text to be converted.
The difference from the first method is that in this method, the user is required to input text information, and the text information input by the user is directly converted as a text to be converted.
Besides the two modes, the text information can be obtained from the cloud end, and the text information to be converted can be a piece of text information randomly obtained from the cloud end.
In the above S101, the converting the text information to be converted into a plurality of pieces of information carrying target sound styles includes:
the vehicle-mounted application marks audio-related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs (Speech Synthesis markup language), and the target SSMLs are used as a plurality of pieces of information carrying the target sound styles. That is, the plurality of information carrying the target sound style may be a plurality of target SSMLs.
The determination method of the target sound style may be determined according to user settings, for example, the vehicle-mounted application may provide a selection menu interface for selecting a sound style for a user, and the user obtains a plurality of target sound styles to be obtained this time as the plurality of target sound styles by selecting on the interface. In the embodiment of the present application, each target sound style may be one of multiple preset styles, for example, six preset styles, and a plurality of the preset styles may be selected by the user as multiple target sound styles. The plurality of pieces of information carrying the target sound style may be information and sound style in a many-to-one relationship. Specifically, for example, 9 sound styles are preset, and the converted information may correspond to 3 different target sound styles.
In addition, the vehicle-mounted application can also be provided with a plurality of default sound styles; accordingly, the determining of the plurality of target sound styles may include: if the user does not select a plurality of target sound styles, a plurality of default sound styles can be directly adopted as the plurality of target sound styles (the plurality of sound styles can be related to the type of the vehicle-mounted application); if the user selects a plurality of sound styles required by the processing, the plurality of sound styles selected by the user are used as a plurality of target sound styles; if the user selects one or more sound styles required by the processing, the sound style selected by the user and the default sound style can be used as a plurality of target sound styles.
The default sound styles of the vehicle-mounted application may be preset default sound styles, or may be determined according to a preset mapping relationship for the vehicle-mounted application.
For example, a plurality of default sound styles may be pre-configured in the vehicle-mounted application, for example, a news type may be pre-configured with a serious sound style and other default sound styles, an entertainment type vehicle-mounted application may be pre-configured with a default easy-to-call sound style and other default sound styles, and so on.
The vehicle-mounted application determines a plurality of default sound styles according to a preset mapping relation, and may determine a plurality of default scenes corresponding to the vehicle-mounted application according to the identifier of the vehicle-mounted application and the mapping relation between the preset identifier and the scenes; then, according to the mapping relationship between the scenes and the styles, a plurality of corresponding default sound styles can be determined.
In an implementation mode, information of a plurality of carrying target sound styles aiming at the same text to be converted can be generated according to the sound style selected by a user; among the plurality of pieces of information carrying the target sound style, one may be the default style described above.
In one embodiment, the information processing method includes steps S101 and S102 shown in fig. 1, and on the basis, the information processing method further includes:
the vehicle-mounted application detects a playing instruction and sends a calling request to a TTS engine;
and the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
The playing instruction can be triggered by pressing a virtual key for controlling playing in a control key area contained in a display interface of the vehicle-mounted application or by pressing a certain designated physical key in the vehicle under the condition that the user confirms the text information to be converted.
The playing instruction can also be an instruction sent out in a voice message mode under the condition that the user confirms the text information to be converted. After a user sends out voice information, the voice information of the user is collected through a sound collection unit; and performing voice recognition to obtain voice instruction information, and if the voice instruction information represents and determines to play the input text information, understanding the voice instruction information as a playing instruction.
For example, assuming that a birth module (function or option) of a drama exists in the vehicle-mounted application, after clicking to enter the birth module of the drama, the user may display an information input interface in which the user may input text information, for example, the user may create a poem or an article;
after the user finishes inputting the text information, the user can click a 'play type broadcast' button to generate a play instruction;
the vehicle-mounted application can respond to the playing instruction and determine the text information currently input by the user as the text information to be converted. And then, generating information with various different target sound styles according to the text information to be converted, and broadcasting the information with different styles to present different broadcasting effects of the same sentence.
In a specific implementation manner of the foregoing embodiment, the plurality of target sound styles may include: serious, humorous, recreational, etc. After the vehicle-mounted application sends the call request to the TTS engine, the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine or not based on the information of the call request fed back by the TTS engine, so that the TTS engine can carry out audio synthesis according to the information carrying the target voices and output the synthesized audio information.
Further, the vehicle-mounted application sends a call request to the TTS engine, and may select a corresponding audio output channel based on the vehicle-mounted application; then sending a calling request for calling the audio output channel to a TTS engine; and under the condition that the audio output channel can output audio, the vehicle-mounted application sends the information carrying the target voice styles to a TTS engine.
Furthermore, in the processing of the TTS engine, a plurality of pieces of information carrying the target sound style corresponding to the plurality of pieces of information carrying the target sound style may be added to the output queue of the corresponding audio output channel; and outputting target audio information corresponding to a plurality of target sound styles in the output queue through the audio output channel.
It should be noted that the processing of the TTS engine may include performing speech synthesis in combination with speech models corresponding to a plurality of target voice styles. The speech model can be preset locally for the vehicle, and can also be cloud-end.
For example, a TTS engine of the vehicle may send a plurality of pieces of information carrying a target sound style to the cloud, the cloud selects a corresponding speech model to synthesize the plurality of pieces of information carrying the target sound style, and sequentially feeds back audio obtained by synthesis to the vehicle, and outputs the audio through an audio output channel corresponding to the TTS engine of the vehicle. This kind of mode can adopt when the vehicle can be connected in the high in the clouds, perhaps can adopt when the communication quality of vehicle and high in the clouds is good, perhaps, can adopt when setting up the vehicle and can connect in the clouds and communication quality is greater than preset threshold value for the user.
For another example, the TTS engine of the vehicle may perform speech synthesis on a plurality of pieces of information carrying the target voice style directly according to the local speech model. In this case, the local speech model may be a speech model that can be updated when the cloud can be connected, and/or may also be a speech model preset locally.
In addition, the method further comprises: and when the vehicle-mounted application detects a command of pausing the playing, sending the command of pausing the playing to the TTS engine so as to control the TTS engine to pause audio synthesis.
Wherein, the instruction for pausing the playing may be generated as follows: after the vehicle-mounted application detects the click operation of a pause virtual key in a current playing interface or a control interface displayed by the vehicle-mounted application, generating a play pause instruction; or after the vehicle-mounted application detects the click operation of a specified play pause key (physical key) in the vehicle, generating a play pause instruction; or after the vehicle-mounted application detects the voice command of playing pause, generating the command of playing pause.
In one embodiment, the sending the invocation request to the TTS engine further comprises:
the vehicle-mounted application determines a voice output channel and sends a calling request aiming at the voice output channel to the TTS engine.
In general, a vehicle may be provided with a plurality of sound output channels, which may be used to output a variety of different sounds. For example, a vehicle has 9 channels, of which 5 are used to output sounds of speakers of an in-vehicle computer in the vehicle interior, and the other 4 are used to output sounds of an alarm or the like.
In a specific implementation manner, the vehicle-mounted application may fixedly call one or more channels.
In step S102, the processing of the in-vehicle application invoking TTS may include: the vehicle-mounted application sends a calling request to a TTS engine and receives feedback information of the TTS engine; and if the feedback information of the TTS engine represents that the TTS call is successful, the vehicle-mounted application sends the information carrying the target voice style to the TTS engine.
Finally, the TTS engine can synthesize a plurality of pieces of information carrying the target voice style into a PCM voice stream, so that the PCM voice stream can be output.
In one example, the process of converting the text to be converted into a plurality of pieces of information carrying the target sound style may be as shown in fig. 2, and includes the following steps:
s201: a plurality of SSMLs is created.
The method may include adding information such as version, language, URI (Uniform Resource Identifier), and output voice addition to the created information carrying the target sound style. For example, a plurality of canonical versions of information carrying a target sound style for interpreting the markup of a document may be specified, the language of the root document may be specified, and the URLs of the documents may be specified for defining the markup vocabulary of the plurality of documents carrying information of the target sound style.
S202: and adjusting the selected voice according to each target sound style in the plurality of target sound styles, and selecting service information corresponding to each target sound style.
For example, the speech corresponding to the text to be converted is selected. The selected voice is adjusted according to the plurality of target voice styles, so that the style of the selected voice can be the plurality of target voice styles.
The service may be understood as the audio property related information of the present application, that is, a plurality of target sound styles may ultimately correspond to the audio property related information.
It is to be understood that different genres may also correspond to different services. The service may include at least one of speech rate, intonation, pitch, pause, and the like.
For example, the service information may include at least one of: add or delete interruptions/pauses in speech; specifying paragraphs and sentences in speech; using phonemes to improve pronunciation; using a user-defined dictionary to improve pronunciation; adjusting rhythm; changing the speed of speech; changing the volume; changing the pitch; changing a pitch lifting curve; adding the recorded audio; adding background audio, etc.
S203: a plurality of target SSMLs are generated. A plurality of information carrying the target sound style may be generated for the result based on the aforementioned selection and adjustment. In the embodiment of the present application, the plurality of information carrying the target sound style may include a tag of certain audio related attributes, and the audio related attributes may be understood as service information, such as a speech rate, a tone, a pitch, background music, and the like. The service information has some fixed settings for each style, for example, for the serious style and for the entertainment style, the speech speed, tone, pitch, background music, etc. may be different in at least one of the different service information.
Further, in another embodiment, on the basis of S101 and S102 shown in fig. 1, the scheme provided in this embodiment may further include mixing and playing the background audio information and the synthesized audio.
In one embodiment, the sending, by the in-vehicle application, the plurality of pieces of information carrying the target voice style to a TTS engine includes:
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
For example, the plurality of pieces of information carrying the target sound styles are information carrying sound styles of a1, a2, A3, a4 and a5, and the first order is determined as an order in which a1, a2, A3, a4 and a5 are randomly sorted.
For another example, the plurality of pieces of information carrying the target sound style are information carrying sound styles of serious, lively, sad, happy and cool sound styles, and in order to embody the effect of playing with precision, the sound style with larger contrast can be set as the sound style in the adjacent order. For example, the first order is set as: serious, lively, sad, happy and cool.
An example of the embodiment of the present application, as shown in fig. 3, may include:
the in-vehicle application supports a smart mode. The vehicle application has an AI (Artificial Intelligence) sound skill, and the hardware base of the operation may be a chip (IDCM) of the vehicle application.
The in-vehicle application may have a number of optional functions (or optional modules, such as other skills in the figures and birth of fun modules). Specifically, for example, in the AI voice skill application of IDCM, a drama birth module may be entered according to a user selection.
After entering a page of a birth module of the game essence, a user can control and set a broadcasted case in a manual or voice mode, the setting modes are two, and the case is selected or manually input from a case library; the case library can be a local case library or a cloud case library.
After the setting of the file is completed, the user can manually select and click the 'play type broadcast' button or send a voice command to confirm and click the 'play type broadcast' button.
After the key for controlling the 'show type report' is effective, the vehicle-mounted application requests a relevant case for the birth of the show from the local cloud by pressing a bar.
After the vehicle-mounted application requests the callback case, the birth case with the drama is converted to obtain information carrying a plurality of target voice styles, and the vehicle-mounted application (such as an AI voice skill application) applies for calling a TTS engine.
If the current TTS engine can be called, the TTS engine gives a feedback of successful calling of the vehicle-mounted application, otherwise, the feedback calling fails; after the vehicle-mounted application receives the feedback of successful calling, the vehicle-mounted application sends a plurality of exquisite births with target voice styles to be broadcasted to a TTS engine sentence by sentence, and the TTS synthesizes the files into audio sentence by sentence and broadcasts the audio sentence. Specifically, for example, after the AI voice skill application receives the feedback of successful call, the AI voice skill application copies 7 documents to be broadcasted, each of which carries a target voice style, the target voice styles carried by the documents are different, the documents carrying the target voice styles are successively sent to a TTS engine, and the TTS engine synthesizes the documents sentence by sentence into audio and broadcasts the audio.
In addition, in the section sub-broadcasting process, the vehicle-mounted application can also support pause reciting and continuous reciting. The specific processing has already been described in the foregoing embodiments, and is not described here again.
In the embodiment of the application, the text information to be converted can be converted through the vehicle-mounted application to obtain a plurality of pieces of information carrying different target voice styles, and then the TTS engine is called to synthesize the audio information of the plurality of target voice styles. Therefore, the vehicle-mounted application has a plurality of target sound styles and is combined with a specific audio playing scene, more personalized information carrying the target sound styles can be output in audio playing, and different broadcasting effects of the same sentence and file can be presented.
An embodiment of the present application further provides an information processing apparatus, a main component of which is shown in fig. 4, including:
the conversion module 51 is configured to acquire text information to be converted, and convert the text information to be converted into a plurality of pieces of information carrying a target sound style; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS invoking module 52 is configured to send the multiple pieces of information with the target sound style to a TTS engine, so as to sequentially perform audio synthesis on the multiple pieces of information with the target sound style through the TTS engine and output synthesized audio information.
In one embodiment, the conversion module is further configured to,
acquiring the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and acquiring text information input by a user, and taking the text information input by the user as the text to be converted.
In one embodiment, the conversion module is further configured to:
and marking audio related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and taking the plurality of target SSMLs as a plurality of information carrying the target sound styles.
In one embodiment, as shown in fig. 5, the apparatus further comprises:
the call request module is used for detecting a playing instruction and sending a call request to the TTS engine;
and the sending module is used for determining whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
In one embodiment, the invocation request module is further operable to,
determining a voice output channel, and sending a call request aiming at the voice output channel to the TTS engine.
In one embodiment, the sending module is further configured to,
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
In one embodiment, the TTS calling module is further configured to,
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
As shown in fig. 6, is a block diagram of a vehicle according to an information processing method of an embodiment of the present application. The vehicle is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The vehicle may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the vehicle includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the vehicle, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple vehicles may be connected, with each device providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 6 illustrates an example of a processor 1201.
Memory 1202 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor, so that the at least one processor executes the information processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the information processing method provided by the present application.
The memory 1202, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., units shown in fig. 4 and 5) corresponding to the information processing method in the embodiments of the present application. The processor 1201 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 1202, that is, implements the information processing method in the above-described method embodiment.
The memory 1202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the vehicle information processing vehicle, and the like. Further, the memory 1202 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 1202 optionally includes memory remotely located from processor 1201, which may be connected to the vehicle information handling vehicle via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The vehicle of the information processing method may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or other means, and the bus connection is exemplified in fig. 6.
The input device 1203 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the vehicle information handling vehicle, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 1204 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (16)

1. An information processing method applied to a vehicle in which an in-vehicle application is installed and in which a speech synthesis TTS engine is installed, characterized by comprising:
the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
2. The method of claim 1, wherein the obtaining text information to be converted comprises:
the vehicle-mounted application acquires the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and the vehicle-mounted application acquires text information input by a user, and takes the text information input by the user as the text to be converted.
3. The method of claim 1, wherein the converting the text information to be converted into a plurality of pieces of information carrying target sound styles comprises:
and the vehicle-mounted application marks audio-related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and the plurality of target SSMLs are used as a plurality of pieces of information carrying the target sound styles.
4. The method of claim 1, wherein the method further comprises:
the vehicle-mounted application detects a playing instruction and sends a calling request to a TTS engine;
and the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
5. The method of claim 4, wherein said sending a call request to a TTS engine further comprises:
the vehicle-mounted application determines a voice output channel and sends a calling request aiming at the voice output channel to the TTS engine.
6. The method of claim 4, wherein the determining, by the in-vehicle application, whether to send the plurality of information carrying the target voice style to the TTS engine based on the information fed back by the TTS engine comprises:
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
7. The method of claim 1, wherein the in-vehicle application sending the plurality of information carrying the target voice style to a TTS engine comprises:
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
8. An information processing apparatus comprising:
the conversion module is used for acquiring text information to be converted and converting the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS calling module is used for sending the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
9. The apparatus of claim 8, wherein the conversion module is further configured to,
acquiring the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and acquiring text information input by a user, and taking the text information input by the user as the text to be converted.
10. The apparatus of claim 8, wherein the conversion module is further configured to:
and marking audio related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and taking the plurality of target SSMLs as a plurality of information carrying the target sound styles.
11. The apparatus of claim 8, wherein the apparatus further comprises:
the call request module is used for detecting a playing instruction and sending a call request to the TTS engine;
and the sending module is used for determining whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
12. The apparatus of claim 11, wherein the call request module is further configured to,
determining a voice output channel, and sending a call request aiming at the voice output channel to the TTS engine.
13. The apparatus of claim 11, wherein the means for transmitting is further configured to,
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
14. The apparatus of claim 8, wherein the TTS invocation module is further to,
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
15. A vehicle, characterized by comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202010589862.8A 2020-06-24 2020-06-24 Information processing method, information processing apparatus, vehicle, and computer storage medium Pending CN111768755A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010589862.8A CN111768755A (en) 2020-06-24 2020-06-24 Information processing method, information processing apparatus, vehicle, and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010589862.8A CN111768755A (en) 2020-06-24 2020-06-24 Information processing method, information processing apparatus, vehicle, and computer storage medium

Publications (1)

Publication Number Publication Date
CN111768755A true CN111768755A (en) 2020-10-13

Family

ID=72721898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010589862.8A Pending CN111768755A (en) 2020-06-24 2020-06-24 Information processing method, information processing apparatus, vehicle, and computer storage medium

Country Status (1)

Country Link
CN (1) CN111768755A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115273808A (en) * 2021-04-14 2022-11-01 上海博泰悦臻网络技术服务有限公司 Sound processing method, storage medium and electronic device
US12008289B2 (en) 2021-07-07 2024-06-11 Honeywell International Inc. Methods and systems for transcription playback with variable emphasis

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR380401A0 (en) * 2001-03-19 2001-04-12 Famoice Technology Pty Ltd Data template structure
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
CN1938756A (en) * 2004-03-05 2007-03-28 莱塞克技术公司 Prosodic speech text codes and their use in computerized speech systems
CN102201233A (en) * 2011-05-20 2011-09-28 北京捷通华声语音技术有限公司 Mixed and matched speech synthesis method and system thereof
CN102222501A (en) * 2011-06-15 2011-10-19 中国科学院自动化研究所 Method for generating duration parameter in speech synthesis
US20110282668A1 (en) * 2010-05-14 2011-11-17 General Motors Llc Speech adaptation in speech synthesis
CN104200803A (en) * 2014-09-16 2014-12-10 北京开元智信通软件有限公司 Voice broadcasting method, device and system
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
US20160071509A1 (en) * 2014-09-05 2016-03-10 General Motors Llc Text-to-speech processing based on network quality
CN106782494A (en) * 2016-09-13 2017-05-31 乐视控股(北京)有限公司 Phonetic synthesis processing method and processing device
WO2018171257A1 (en) * 2017-03-21 2018-09-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for speech information processing
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
WO2019005625A1 (en) * 2017-06-26 2019-01-03 Zya, Inc. System and method for automatically generating media
CN109147760A (en) * 2017-06-28 2019-01-04 阿里巴巴集团控股有限公司 Synthesize method, apparatus, system and the equipment of voice
KR20190104941A (en) * 2019-08-22 2019-09-11 엘지전자 주식회사 Speech synthesis method based on emotion information and apparatus therefor
KR20190106890A (en) * 2019-08-28 2019-09-18 엘지전자 주식회사 Speech synthesis method based on emotion information and apparatus therefor
CN110688834A (en) * 2019-08-22 2020-01-14 阿里巴巴集团控股有限公司 Method and equipment for rewriting intelligent manuscript style based on deep learning model

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR380401A0 (en) * 2001-03-19 2001-04-12 Famoice Technology Pty Ltd Data template structure
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
CN1938756A (en) * 2004-03-05 2007-03-28 莱塞克技术公司 Prosodic speech text codes and their use in computerized speech systems
US20110282668A1 (en) * 2010-05-14 2011-11-17 General Motors Llc Speech adaptation in speech synthesis
CN102201233A (en) * 2011-05-20 2011-09-28 北京捷通华声语音技术有限公司 Mixed and matched speech synthesis method and system thereof
CN102222501A (en) * 2011-06-15 2011-10-19 中国科学院自动化研究所 Method for generating duration parameter in speech synthesis
US20160071509A1 (en) * 2014-09-05 2016-03-10 General Motors Llc Text-to-speech processing based on network quality
CN104200803A (en) * 2014-09-16 2014-12-10 北京开元智信通软件有限公司 Voice broadcasting method, device and system
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
CN106782494A (en) * 2016-09-13 2017-05-31 乐视控股(北京)有限公司 Phonetic synthesis processing method and processing device
WO2018171257A1 (en) * 2017-03-21 2018-09-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for speech information processing
WO2019005625A1 (en) * 2017-06-26 2019-01-03 Zya, Inc. System and method for automatically generating media
CN109147760A (en) * 2017-06-28 2019-01-04 阿里巴巴集团控股有限公司 Synthesize method, apparatus, system and the equipment of voice
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
KR20190104941A (en) * 2019-08-22 2019-09-11 엘지전자 주식회사 Speech synthesis method based on emotion information and apparatus therefor
CN110688834A (en) * 2019-08-22 2020-01-14 阿里巴巴集团控股有限公司 Method and equipment for rewriting intelligent manuscript style based on deep learning model
KR20190106890A (en) * 2019-08-28 2019-09-18 엘지전자 주식회사 Speech synthesis method based on emotion information and apparatus therefor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任萍萍: "《智能客服机器人》", 成都时代出版社, pages: 99 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115273808A (en) * 2021-04-14 2022-11-01 上海博泰悦臻网络技术服务有限公司 Sound processing method, storage medium and electronic device
US12008289B2 (en) 2021-07-07 2024-06-11 Honeywell International Inc. Methods and systems for transcription playback with variable emphasis

Similar Documents

Publication Publication Date Title
US11468889B1 (en) Speech recognition services
JP7181332B2 (en) Voice conversion method, device and electronic equipment
CN110633419B (en) Information pushing method and device
CN111768755A (en) Information processing method, information processing apparatus, vehicle, and computer storage medium
TWI249729B (en) Voice browser dialog enabler for a communication system
EP2112650B1 (en) Speech synthesis apparatus, speech synthesis method, speech synthesis program, portable information terminal, and speech synthesis system
CN107040452B (en) Information processing method and device and computer readable storage medium
US11586344B1 (en) Synchronizing media content streams for live broadcasts and listener interactivity
CN112533041A (en) Video playing method and device, electronic equipment and readable storage medium
US8340797B2 (en) Method and system for generating and processing digital content based on text-to-speech conversion
CN112165648B (en) Audio playing method, related device, equipment and storage medium
US11449301B1 (en) Interactive personalized audio
CN111142667A (en) System and method for generating voice based on text mark
CN110718221A (en) Voice skill control method, voice equipment, client and server
CN111935551A (en) Video processing method and device, electronic equipment and storage medium
CN111259125A (en) Voice broadcasting method and device, intelligent sound box, electronic equipment and storage medium
KR20210038278A (en) Speech control method and apparatus, electronic device, and readable storage medium
CN111818279A (en) Subtitle generating method, display method and interaction method
CN111739510A (en) Information processing method, information processing apparatus, vehicle, and computer storage medium
WO2023040820A1 (en) Audio playing method and apparatus, and computer-readable storage medium and electronic device
CN111754974B (en) Information processing method, device, equipment and computer storage medium
CN110633357A (en) Voice interaction method, device, equipment and medium
CN111768756B (en) Information processing method, information processing device, vehicle and computer storage medium
US20210392394A1 (en) Method and apparatus for processing video, electronic device and storage medium
CN113160782B (en) Audio processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination