CN115602171A - Voice interaction method, server and computer readable storage medium - Google Patents

Voice interaction method, server and computer readable storage medium Download PDF

Info

Publication number
CN115602171A
CN115602171A CN202211594190.5A CN202211594190A CN115602171A CN 115602171 A CN115602171 A CN 115602171A CN 202211594190 A CN202211594190 A CN 202211594190A CN 115602171 A CN115602171 A CN 115602171A
Authority
CN
China
Prior art keywords
reusable
clauses
audio data
voice
clause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211594190.5A
Other languages
Chinese (zh)
Other versions
CN115602171B (en
Inventor
郭华鹏
孟令哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202211594190.5A priority Critical patent/CN115602171B/en
Publication of CN115602171A publication Critical patent/CN115602171A/en
Application granted granted Critical
Publication of CN115602171B publication Critical patent/CN115602171B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • B60W2040/089Driver voice
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/21Voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application discloses a voice interaction method, which comprises the following steps: according to a first voice broadcast request of a vehicle at a first moment, under the condition that the first voice broadcast request is confirmed to belong to a preset field, adding reusable audio data corresponding to clauses into a plurality of first clauses obtained after the first voice broadcast request is split into multiple first clauses, and caching the multiple first clauses; according to a second voice broadcast request of the vehicle at a second moment after the first moment, under the condition that the second voice broadcast request is confirmed to belong to a preset field, splitting the second voice broadcast request; and under the condition that the second voice broadcasting request is determined to have the reusable clauses after being split, acquiring the audio data of the reusable clauses from the cache records and sending the audio data to the vehicle to finish voice interaction. In this application, can directly utilize the audio data of buffer memory to report in the follow-up, effectively reduce and report the time delay to repeatedly usable clause that preset the field voice broadcast request was torn open sentence processing and is obtained and buffer memory.

Description

Voice interaction method, server and computer readable storage medium
Technical Field
The present application relates to the field of vehicle-mounted voice technologies, and in particular, to a voice interaction method, a server, and a computer-readable storage medium.
Background
Currently, vehicle-mounted voice technology can support a user to interact in a vehicle cabin through voice, for example, to control vehicle components or interact with components in a user interface of a vehicle-mounted system, and the vehicle-mounted system can feed back voice broadcast to the user. In the related art, in the process of performing voice broadcasting, an online voice synthesis service is usually required for the quality of the broadcast tone, however, synthesis is performed in units of a whole sentence, which may result in poor instantaneity of voice broadcasting in some scenes and poor user experience.
Disclosure of Invention
The application provides a voice interaction method, a server and a computer readable storage medium.
The voice interaction method comprises the following steps:
according to a first voice broadcast request of a vehicle at a first moment, under the condition that the first voice broadcast request is confirmed to belong to a preset field, adding reusable audio data corresponding to clauses into a cache record from a plurality of first clauses obtained after the first voice broadcast request is split;
according to a second voice broadcast request of the vehicle at a second moment after the first moment, splitting the second voice broadcast request under the condition that the second voice broadcast request is confirmed to belong to the preset field;
and under the condition that the reusable clauses exist after the second voice broadcasting request is subjected to the splitting processing, acquiring the audio data of the reusable clauses from the cache records and sending the audio data to a vehicle to finish voice interaction.
Therefore, in the application, for the voice broadcast request of the vehicle at the previous moment, the voice broadcast request can be split firstly, under the condition that the voice request belongs to the preset field, the audio data of the reusable clauses in the first clause obtained after splitting is added into the cache record so as to be provided for the next moment, after the voice broadcast request in the preset field enters and is split, the reusable clauses are confirmed, the relevant audio data can be obtained from the cache record and sent to the vehicle, and finally, the voice interaction is completed. The voice interaction method can be used for splitting the sentence for the voice broadcasting request in the preset field, caching the clauses which can be repeatedly used, and can be used for broadcasting by directly utilizing the cached audio data when the clauses are encountered subsequently, so that the number of the words which need to be synthesized through synthesis service is reduced, the broadcasting time delay is effectively reduced, the synthesis and cache storage cost is reasonably controlled, and the user experience is improved.
The preset field includes a navigation field.
So, there are more sentences that probably need repeated report in the navigation field, set for the navigation field with the field scope of predetermineeing that voice interaction method is suitable for, can make the vehicle respond more rapidly when carrying out the in-process voice broadcast of navigating and keep tone quality preferred, improved the navigation security, promote user experience.
The splitting processing is carried out on the second voice broadcast request, and the splitting processing comprises the following steps:
and splitting the second voice broadcast request according to a preset separator.
Therefore, the voice broadcast request can be split at the non-verbal pause separators to obtain a plurality of clauses, the time for waiting for the return of the voice broadcast request can be reduced by shortening the sentence length, and whether each clause is reusable or not can be conveniently and subsequently judged.
The voice interaction method comprises the following steps:
and synthesizing the non-reusable clauses of the second voice broadcast request after the splitting processing, and synthesizing the audio data of the non-reusable clauses through a voice synthesis service and sending the audio data to a vehicle to complete voice interaction.
Therefore, for clauses which are not reusable after sentence splitting processing, namely clauses without cached audio data, audio data synthesis is carried out through the voice synthesis service and issued to supplement broadcast contents, and voice interaction is completed.
The voice interaction method comprises the following steps:
confirming the issuing time sequence of a plurality of second clauses according to the plurality of second clauses obtained by splitting the second voice broadcast request;
processing a plurality of second clauses in parallel, acquiring audio data of the reusable clauses in the second clauses from the cache record, and synthesizing through a voice synthesis service to obtain audio data of the non-reusable clauses in the second clauses;
and issuing the audio data of each second clause to the vehicle according to the issuing time sequence to finish voice interaction.
Therefore, the issuing time sequence of the reusable clauses and the non-reusable clauses obtained after the voice broadcast request is split can be confirmed, the voice broadcast request can be completely issued according to the time sequence after being split, and the integrity of voice interaction content is guaranteed. And for the parallel processing of each clause, the preparation time before the audio data is issued is shortened, and the efficiency of the voice interaction process is improved.
The interaction method further comprises the following steps:
and in the process of parallel processing, if the processing of at least one second clause is abnormal, stopping issuing the audio data of the second voice broadcast request.
Therefore, when a certain link is abnormal in parallel processing, the audio data of the voice broadcast request stops being issued so as to deal with the abnormality caused by parallel processing after sentence splitting.
The acquiring the audio data of the reusable clause from the cache record and issuing the audio data to the vehicle under the condition of confirming that the reusable clause exists after the second voice broadcast request is subjected to the splitting processing comprises the following steps:
under the condition that the reusable clauses exist after the second voice broadcasting request is subjected to the splitting processing, acquiring a sending time sequence of the reusable clauses;
and issuing the audio data of the reusable clauses obtained from the cache records to a vehicle according to the issuing time sequence so as to finish voice interaction.
Therefore, the audio data of the reusable clauses acquired from the cache records can be issued to the vehicle according to the acquired issuing time sequence. And each clause in the voice broadcast is issued according to a time sequence, so that the correctness and the integrity of voice interactive contents are ensured.
The issuing of the audio data of the reusable clauses obtained from the cache records to a vehicle according to the issuing time sequence comprises the following steps:
and if the current issuing time sequence is the same as the issuing time sequence of the reusable clause, issuing the audio data of the clause acquired from the cache record to a vehicle to finish voice interaction.
Therefore, when the current issuing time sequence is the same as the reusable clauses, the audio data of the reusable clauses are issued to the vehicle, the correct issuing of the voice broadcasting sequence is guaranteed, and the voice interaction is finally completed.
The issuing of the audio data of the reusable clauses obtained from the cache records to a vehicle according to the issuing time sequence comprises the following steps:
and if the current issuing time sequence is not the same as the issuing time sequence of the reusable clause, when the current issuing time sequence is updated to the issuing time sequence of the reusable clause, issuing the audio data of the reusable clause to a vehicle to finish voice interaction.
Therefore, when the current issuing time sequence is different from the reusable clauses, the user needs to wait for the corresponding issuing time sequence to issue the audio data of the reusable clauses to the vehicle, so that the correct issuing of the voice broadcasting sequence is ensured, and the voice interaction is finally completed.
The acquiring the audio data of the reusable clause from the cache record and issuing the audio data to the vehicle under the condition of confirming that the reusable clause exists after the second voice broadcast request is subjected to the splitting processing comprises the following steps:
under the condition that the reusable clauses exist after the second voice broadcasting request is subjected to the splitting processing, audio data of the reusable clauses are obtained from the cache records;
setting a pre-mute duration and/or a post-mute duration for the audio data of the reusable clauses;
and sending the audio data of the reusable clauses after the setting is finished to a vehicle so as to finish voice interaction.
Therefore, the pre-mute duration or the post-mute duration can be set for the audio data corresponding to the reusable clauses in the voice broadcast request obtained after splitting, the real pause duration in the whole sentence before splitting is simulated, and the voice interaction process is finally completed.
The server of the application comprises a processor and a memory, wherein the memory stores a computer program, and the computer program realizes the method when being executed by the processor.
The computer-readable storage medium of the present application stores a computer program that, when executed by one or more processors, implements the method described above.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the present application.
Drawings
The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a voice interaction method according to the present application;
FIG. 2 is a second schematic flow chart of the voice interaction method of the present application;
FIG. 3 is a third flowchart illustrating a voice interaction method according to the present application;
FIG. 4 is a fourth flowchart illustrating a voice interaction method of the present application;
FIG. 5 is a fifth flowchart illustrating a voice interaction method of the present application;
fig. 6 is a sixth flowchart of the voice interaction method of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application, and are not construed as limiting the embodiments of the present application.
With the development and popularization of vehicle electronic technology, a vehicle can perform voice interaction with a user, namely, not only can a voice request sent by the user be identified, but also voice broadcasting can be synthesized and fed back to the user. The human-vehicle voice interaction function meets various experiences of a driver and passengers in the driving process. Currently, in order to ensure the tone quality of voice broadcast, the content of voice broadcast is usually reported in whole sentence as a unit, and a third party online voice synthesis service is requested without the existence of whole sentence audio cache data. However, under the condition of no cache, more contents need to be synthesized, so that the instantaneity of voice broadcasting is poor in some scenes, the efficiency of the voice broadcasting process is reduced, and the user experience is poor. For example, when a vehicle is in a navigation service, different voice broadcasts such as "four hundred fifty meters, illegal photography", "four hundred fifty meters, speed-limited eighty-ten photography", and "four hundred fifty meters, wired photography" may be provided for the content that needs to be broadcasted. It can be understood that, in the driving process, the change of the road condition is usually fast, and if the broadcast of the road condition reminder is not timely enough, the reminder content may be invalid. In addition, the synthesis performed in units of whole sentences may also result in a large storage space occupied by cache contents, and a high cost for requesting third-party synthesis services.
Based on the above problems that may be encountered, please refer to fig. 1 and fig. 2, the present application provides a voice interaction method, which includes:
01: according to a first voice broadcast request of a vehicle at a first moment, under the condition that the first voice broadcast request is confirmed to belong to a preset field, adding reusable audio data corresponding to clauses into a plurality of first clauses obtained after the first voice broadcast request is split into multiple first clauses, and caching the multiple first clauses;
02: according to a second voice broadcast request of the vehicle at a second moment after the first moment, under the condition that the second voice broadcast request is confirmed to belong to a preset field, splitting the second voice broadcast request;
03: and under the condition that the reusable clauses exist after the second voice broadcasting request is split, acquiring the audio data of the reusable clauses from the cache records and sending the audio data to the vehicle to finish voice interaction.
The application also provides a server comprising a memory and a processor. The voice processing method can be realized by the server. Specifically, the memory stores a computer program, and the processor is configured to add, according to a first voice broadcast request of the vehicle at a first time, to a cache record audio data corresponding to a reusable clause in a plurality of first clauses obtained after a split processing of the first voice broadcast request under a condition that it is confirmed that the first voice broadcast request belongs to a preset field, split the second voice broadcast request under a condition that it is confirmed that the second voice broadcast request belongs to the preset field according to a second voice broadcast request of the vehicle at a second time after the first time, and acquire, from the cache record, the audio data of the reusable clause to be delivered to the vehicle to complete voice interaction under a condition that it is confirmed that the reusable clause exists after the split processing of the second voice broadcast request.
According to the method and the device, for the preset fields, such as the navigation field and the like, which are likely to have more multiplexing situations in the voice broadcasting process, after the voice broadcasting request is received, sentence splitting processing is carried out on the voice broadcasting request, and the clauses with higher multiplexing probability are added into the cache record by taking the clauses as units instead of the whole sentence as units. In the subsequent broadcasting process, the audio data in the cache record can be directly used for the clause and is sent to the vehicle to finish voice interaction.
Specifically, the vehicle is in a state where no audio data of any voice broadcast request is stored at the time of setting the first time. The first voice broadcast request is a voice broadcast request sent by a service party of the current vehicle-mounted system at a first moment. For example, in the process of navigating a vehicle, a prompt report needs to be performed on the road condition ahead. A voice play request is sent to the server.
The splitting processing is a process of splitting a voice broadcast request including a plurality of clauses into clauses according to a predetermined rule, for example, splitting according to a predetermined separator, splitting according to semantics, and the like. The reusable clauses refer to clauses with high recurrence rate in a preset field, for example, in the broadcast of the road condition reminding, clauses similar to the 'xx meter position' may recur in the reminding of different road conditions, the difference between different broadcasts is only that the content of the specific road condition reminding is different, in the above example, the 'four hundred fifty meters position' recurs in three broadcasts, and the difference is that 'illegal photographing', 'limited speed eighty photographing', and 'line pressing photographing'. Wherein, the 'four hundred and fifty meters' is a reusable clause. The determination of the reusable clauses can be determined by counting the broadcast contents in the related service field, and the clauses with the repetition times larger than the preset times are determined as the reusable clauses.
And the second moment is a moment after the first moment in time sequence, and the vehicle stores the audio data of the reusable clauses in the first voice broadcast request cached at the first moment at the second moment. The second voice broadcast request can adopt a splitting processing mode the same as that of the first voice broadcast request, and if the split clauses contain the clauses cached at the first moment, the audio data of the clauses can be directly obtained from the cache record. In addition, if the reusable clause which is not cached at the first moment exists after the second voice broadcasting request is split, the audio data after the clause is synthesized are also cached.
It should be noted that in the present application, the sentence splitting processing for the voice broadcast request needs to be performed when the voice broadcast request is in the preset field, and in the preset field, it can be considered as a service field in which there is a case of clause multiplexing with a high probability. For the non-preset domain, the whole sentence is still used as the unit for composition and caching.
In a preset field, such as a navigation field, a first voice broadcast request at a first moment may include at least one reusable clause in the field, the voice broadcast request in the preset field may be split, the reusable clause is finally obtained, and the reusable clause is added to the cache record. When a second voice broadcast request at a second moment after the first moment enters, the reusable clause cached at the first moment is obtained through the same splitting processing, the audio data corresponding to the first voice broadcast request stored at the first moment can be directly used as the audio data of the clause at the second moment to send the audio data of the vehicle, and the speech interaction process is finally completed without re-synthesis.
The method and the device can support the splitting and broadcasting of the voice broadcasting request of the vehicle-mounted voice assistant. The condition that the voice broadcast request contains repeated sentences exists in the preset field, the detachable sentences can obtain reusable clauses and add the audio data into the cache record, so that when the voice broadcast request in the subsequent same field contains the reusable clauses, the audio data in the cache can be quickly taken, the synthesis process is omitted, and smooth and high-tone-quality voice broadcast reply is formed.
In summary, in the application, for a voice broadcast request of a vehicle at a previous time, the voice broadcast request may be split first, and when it is determined that the voice request belongs to a preset field, audio data of a reusable clause in a first clause obtained after splitting is added to a cache record, so that at a next time, after the voice broadcast request in the preset field enters and is split, the reusable clause is determined, and relevant audio data may be obtained from the cache record and sent to the vehicle, and finally, voice interaction is completed. The voice interaction method can be used for splitting the sentence for the voice broadcasting request in the preset field, caching the clauses which can be repeatedly used, and can directly utilize the cached audio data for broadcasting when the clauses are encountered subsequently, so that the number of the words which need to be synthesized through synthesis service is reduced, the broadcasting time delay is effectively reduced, the synthesis and cache storage cost is reasonably controlled, and the user experience is improved.
The preset fields include a navigation field.
Specifically, the preset domain may be set as the navigation domain. Because in the voice broadcast aiming at the navigation field, the probability of repeated clauses is higher. For example, for a voice broadcast in the navigation field of "illegal photographing at four hundred fifty meters", there may be a case where the prompting photographed sentences are different but the relative positions where the voice broadcast is sent are the same, i.e., the clauses "at four hundred fifty meters" or "at xx meters" may be reused in different scenes in the field. And the content of the clause of 'illegal photographing' can be replaced by other scene description sentences, and the method is suitable for the voice interaction method provided by the application.
So, there are more sentences that probably need repeated report in the navigation field, set for the navigation field with the field scope of predetermineeing that voice interaction method is suitable for, can make the vehicle respond more rapidly when carrying out the in-process voice broadcast of navigating and keep tone quality preferred, improved the navigation security, promote user experience.
Step 02 comprises:
and splitting the second voice broadcast request according to the preset separator.
The processor is used for splitting the second voice broadcast request according to the preset separators.
Specifically, the preset separator includes a sentence separator indicating a pause without any tone or emotional color, such as a comma in chinese and english, and the specific separator type is not limited herein.
In one example, the second voice broadcast request is "good morning! Today is thursday ". Wherein, after the voice broadcast request, a pause symbol with voice property exists! ", then nothing is done here when the sentence split is done. In another example, the second voice broadcast request is "four hundred fifty meters away, there is an illegal photo," where there is a pause symbol without language ",", and then it is necessary to split at "," away ", resulting in two second clauses" four hundred fifty meters away "and" there is an illegal photo.
For a second voice broadcast request that can be split into a plurality of second clauses, the split result may appear in various combinations. Taking splitting the second voice broadcast request which can obtain three second clauses (t 1/t2/t 3) as an example, according to the hit condition of each clause on the stored reusable sentences, namely, "t1 (whether cache exists) × t2 (whether cache exists) × t3 (whether cache exists)", 8 possible combination conditions exist.
Therefore, the voice broadcast request can be split at the non-verbal pause separator to obtain a plurality of clauses, the time for waiting for the return of the voice broadcast request can be reduced by shortening the sentence length, and the subsequent judgment of whether each clause is reusable or not is facilitated.
Referring to fig. 2, the method further includes:
04: and synthesizing the non-reusable clauses after the second voice broadcast request is subjected to splitting processing, and synthesizing the audio data of the non-reusable clauses through a voice synthesis service and sending the audio data to the vehicle to finish voice interaction.
The processor is used for synthesizing the non-reusable clauses of the second voice broadcast request after the second voice broadcast request is split, synthesizing the audio data of the non-reusable clauses through the voice synthesis service and sending the audio data to the vehicle so as to finish voice interaction.
Specifically, at a second time when the second voice broadcast request is issued, a clause which is reusable in the first voice broadcast and stored at a first time before the second time has already been stored in the memory. After the second voice broadcast request enters, the second voice broadcast request can be split at the non-speech pause separator, and the reusable clauses in the obtained clauses can acquire the audio data of the reusable clauses from the cache records. And further, synthesizing the audio data of each non-reusable clause in the second voice broadcast request through the voice synthesis service, and finally sending the synthesized audio data to the vehicle to finish the voice interaction process. Of course, for the convenience of subsequent voice interaction, the audio data of the synthesized non-reusable clause may also be cached, so that when the vehicle travels to the location again in the following period, the cached data of the whole sentence voice broadcast request may be acquired.
In one example, the first voice broadcast request at the first time is "four hundred fifty meters away, there is illegal photography", and the split processing can obtain two clauses of "four hundred fifty meters away" and "there is illegal photography". Wherein, the 'four hundred and fifty meters' is confirmed to be a reusable clause, and the corresponding audio data is stored in the audio cache record. The second voice broadcast request at the second moment is that the four-hundred fifty-meter places and the limited-speed eighty photographing are performed, and two second clauses of the four-hundred fifty-meter places and the limited-speed eighty photographing are obtained through split processing. The clause 'four hundred and fifty meters' is a reusable clause, and can acquire audio data from the cache record, and the clause 'eighty-speed-limited photographing' is an un-reusable clause, and can not hit the reusable clause stored in the storage record, so that the audio data corresponding to the 'eighty-speed-limited photographing' is synthesized by the voice synthesis service according to the audio data, and is issued to the vehicle, and voice interaction is completed.
Therefore, for clauses which are not reusable after sentence splitting processing, namely clauses without cached audio data, audio data synthesis is carried out through the voice synthesis service and issued to supplement broadcast contents, and voice interaction is completed.
Referring to fig. 3, the method further includes:
05: confirming the issuing time sequence of the second clauses according to the second clauses obtained by splitting the second voice broadcasting request;
06: processing the plurality of second clauses in parallel, acquiring audio data of the reusable clauses in the plurality of second clauses from the cache record, and synthesizing through a voice synthesis service to obtain audio data of the non-reusable clauses in the plurality of second clauses;
07: and issuing the audio data of each second clause to the vehicle according to the issuing time sequence so as to finish voice interaction.
The processor is used for confirming the issuing time sequence of the second clauses according to the second clauses obtained by splitting the second voice broadcast request; processing the plurality of second clauses in parallel, acquiring the audio data of the reusable clauses in the plurality of second clauses from the cache record, and synthesizing through a voice synthesis service to obtain the audio data of the non-reusable clauses in the plurality of second clauses; and issuing the audio data of each second clause to the vehicle according to the issuing time sequence so as to finish voice interaction.
Specifically, according to a plurality of second clauses obtained after the second voice broadcast request is split, the second voice broadcast request enters a statement scheduling process after being split. The issuing time sequence of the second clauses cannot change along with the splitting processing, and the issuing time sequence of the second clauses can be kept in the sequence of the second voice broadcast request. For the plurality of second clauses, the audio data corresponding to the clause which can be reused in the plurality of second clauses can be found in the cache record. And for the clause which is not reusable in the second clause, synthesizing through a voice synthesis service to obtain audio data. And finally, after the issuing sequence of the plurality of second clauses is determined, the issuing sequence is pressed to issue the audio data of each second clause to the vehicle, and finally, the voice interaction process is completed.
In an actual scene, for a certain second voice broadcast which can be split into three second clauses, a timing label (1/2/3) can be marked on each second clause, and the number of the clauses can be recorded as (3). According to the time sequence label, after the clause of the current time sequence is sent, the audio data of the next time sequence can be obtained according to the time sequence, and the time sequence is recorded, namely the time sequence (+ 1).
In the above example, the processing of the three clauses may be performed in parallel, and the audio data of the clause that is reusable among the second clauses may be obtained from the buffered recording, and the audio data of the clause that is not reusable among the second clauses may also be obtained by synthesizing through the speech synthesis service. Finally, the audio data of each second clause can be issued to the vehicle according to the confirmed issuing time sequence. If the current time sequence is equal to the number of the prerecorded clauses, after the current clause is issued, an end mark can be issued to the vehicle, the issuing process is finished, and finally the voice interaction is finished.
Therefore, the issuing time sequence of the reusable clauses and the non-reusable clauses obtained after the voice broadcast request is split can be confirmed, the voice broadcast request can be completely issued according to the time sequence after being split, and the integrity of the voice interaction content is guaranteed. And for the parallel processing of each clause, the preparation time before the audio data is issued is shortened, and the efficiency of the voice interaction process is improved.
Referring to fig. 4, the method further includes:
08: and in the process of parallel processing, if the processing of at least one second clause is abnormal, stopping issuing the audio data of the second voice broadcast request.
And the processor is used for stopping issuing the audio data of the second voice broadcast request if the processing of at least one second clause is abnormal in the parallel processing process.
Specifically, in the process of processing the second clause in parallel, there may be an exception, including that the audio data packet corresponding to the second clause is not found within a preset certain time, or a three-party service exception or a server cache exception is performed when the audio data packet is searched.
Since the processing of the second clause is scheduled in parallel, the exception condition, once generated, may grow in multiple parallel threads at multiple levels. When the second voice broadcast request is split into a plurality of second clauses, the second voice broadcast request is issued to the vehicle and still can be regarded as a single voice broadcast request, and the vehicle can only receive one abnormal result. When the processing of any second clause is abnormal, a notification mechanism is added to notify other threads of stopping sending the audio data of other second clauses in the second voice broadcast request.
Therefore, when a certain link is abnormal in parallel processing, the audio data of the second voice broadcast request stops being issued so as to deal with the abnormality caused by parallel processing after sentence splitting.
Referring to fig. 5, step 03 includes:
031: under the condition that the reusable clauses exist after the second voice broadcasting request is split, acquiring a sending time sequence of the reusable clauses;
032: and issuing the audio data of the reusable clauses obtained from the cache records to the vehicle according to the issuing time sequence so as to finish voice interaction.
The processor is used for acquiring the issuing time sequence of the reusable clauses under the condition that the reusable clauses exist after the second voice broadcast request is split, and issuing the audio data of the reusable clauses acquired from the cache records to the vehicle according to the issuing time sequence so as to finish voice interaction.
Specifically, when the second voice broadcast request has a reusable clause after being split, the issuing timing sequence of the reusable clause can be acquired.
In the actual parallel processing process, because the audio data packet can be directly extracted from the buffer record, the processing process of the reusable clause is probably shorter than the time consumption of the process of synthesizing the audio data by the parallel non-reusable clause through the speech synthesis service, and therefore the clause behind the time sequence can obtain the audio data more quickly.
At this time, the audio data of the reusable clauses acquired from the cache records need to be issued to the vehicle according to an issuing time sequence so as to complete voice interaction.
Therefore, the audio data of the reusable clauses acquired from the cache records can be issued to the vehicle according to the acquired issuing time sequence. Each clause in the voice broadcast is issued according to the time sequence, and the correctness and the integrity of the voice interaction content are ensured.
Step 032 includes:
and if the current issuing time sequence is the same as the issuing time sequence of the reusable clause, issuing the audio data of the clause obtained from the cache record to the vehicle to finish voice interaction.
And the processor is used for issuing the audio data of the clauses obtained from the cache records to the vehicle to finish voice interaction if the current issuing time sequence is the same as the issuing time sequence of the reusable clauses.
Specifically, in the actual parallel processing process, since the audio data packet can be directly extracted from the buffered record, the processing process of the reusable clause may be shorter than the time consumption of the process of synthesizing the audio data by the parallel non-reusable clause through the speech synthesis service, and thus the clause following in time sequence may obtain the audio data faster. At this point, a waiting mechanism for the reusable clause needs to be introduced. If the time sequence of the clause being issued at present is the same as the stored issuing time sequence of the reusable clause, the fact that the audio data of the reusable clause need to be issued at present is indicated, and therefore the audio data of the clause in the cache record can be directly issued to a vehicle, and voice interaction is completed.
In one example, a voice broadcast request of a user for "four hundred fifty meters away, eighty pictures at a limited speed, slow down, and sentence splitting results in three clauses, where the reusable clauses are" four hundred fifty meters away "and" slow down, and "eighty pictures at a limited speed" is an un-reusable clause. In the audio data issuing process, when the issuing time sequence is (1), issuing the audio data of 'four hundred fifty meters' acquired from the cache record, when the issuing time sequence is (2), synthesizing and issuing the audio data of 'eighty photo at limited speed' through the voice synthesis service, when the issuing time sequence is (3), issuing the audio data of 'please slow down and slow down' acquired from the cache record, and finally completing the voice interaction.
Therefore, the audio data of the reusable clauses acquired in the cache records can be issued to the vehicle when the time sequence is issued correspondingly, and finally the voice interaction process is finished.
Step 032 includes:
and if the current issuing time sequence is not the same as the issuing time sequence of the reusable clause, issuing the audio data of the reusable clause to the vehicle to finish voice interaction when the current issuing time sequence is updated to the issuing time sequence of the reusable clause.
And the processor is used for issuing the audio data of the reusable clauses to the vehicle to finish the voice interaction when waiting for the current issuing time sequence to be updated into the issuing time sequence of the reusable clauses if the current issuing time sequence is not the same as the issuing time sequence of the reusable clauses.
Specifically, in the actual parallel processing process, since the audio data packet can be directly extracted from the buffered record, the processing process of the reusable clause may be shorter than the time consumption of the process of synthesizing the audio data by the parallel non-reusable clause through the speech synthesis service, and thus the clause following in time sequence may obtain the audio data faster. At this point, a waiting mechanism for the reusable clause needs to be introduced. If the time sequence of the clause being issued at present is the same as the stored issuing time sequence of the reusable clause, the fact that the audio data of the reusable clause need to be issued at present is indicated, and therefore the audio data of the clause in the cache record can be directly issued to a vehicle, and voice interaction is completed.
In one example, the user's voice broadcast request "four hundred fifty meters, eighty fast shot, please slow down," and the sentence breaking process results in three clauses "four hundred fifty meters," eighty fast shot, "and" please slow down. The audio data of two stored reusable clauses of 'four hundred fifty meters away' and 'please slow down and slow down' can be directly obtained from the cache, while the non-reusable clause of 'eighty photographing at limited speed' needs to be synthesized through a voice synthesis service, and the process of synthesizing the audio data is likely to take longer time than the process of directly obtaining the audio data from the cache.
In the above example, when the issuance timing is (2), although the audio data of the reusable clause with the timing (3) may have been extracted, the audio data synthesized by the speech synthesis service, that is, the non-reusable audio data with the timing (2), needs to be waited for issuance. And (3) when the audio data with the time sequence (2) is issued, the audio data with the time sequence (3) and the reusable clauses can be issued. The multithreading parallel processing mode does not influence the time sequence of audio data issuing, and finally the voice interaction is smoothly completed.
Therefore, when the current issuing time sequence is different from the reusable clauses, the user needs to wait for the corresponding issuing time sequence and issue the audio data of the reusable clauses to the vehicle, so that the correct issuing of the voice broadcasting sequence is ensured, and the voice interaction is finally completed.
Referring to fig. 6, step 03 includes:
033: under the condition that the second voice broadcast request is determined to have the reusable clauses after being split, acquiring audio data of the reusable clauses from the cache records;
034: setting a pre-mute duration and/or a post-mute duration for the audio data of the reusable clauses;
035: and transmitting the audio data of the reusable clauses after the setting is completed to the vehicle so as to complete the voice interaction.
Specifically, the second voice broadcast request may be processed with the audio data of the buffer clause, it is determined that a reusable clause exists in the second voice broadcast, and the audio data of the reusable clause is obtained from the buffer record. In the process of splitting the second voice broadcast, a front mute time and a rear mute time need to be set for the audio data of each reusable clause, the set audio data of the reusable clauses can be sent to a vehicle, and finally voice interaction is completed.
In an actual scene, the second voice broadcast is a complete sentence before the sentence is split, wherein the pause time at the comma is short. After the splitting process, the second voice broadcast request is split into a plurality of sentences, each sentence corresponds to one whole sentence, and the pause time between each clause regarded as a whole sentence after the splitting process may be long, for example, 2 seconds, 3 seconds, or longer. The front silence of each clause and the preceding clause and the back silence of each clause and the following clause can be set. The silence time between clauses is generally short, such as 0.1 or 0.2 seconds, and the pause time of commas in the whole sentence before splitting can be simulated.
Therefore, the pre-mute duration or the post-mute duration can be set for the audio data corresponding to the reusable clause in the voice broadcast request obtained after splitting, the real pause duration in the whole sentence before splitting is simulated, and the voice interaction process is finally completed.
The computer-readable storage medium of the present application stores a computer program that, when executed by one or more processors, implements the method described above.
In the description of the present specification, a description with reference to the terms "above", "specifically", and the like means that a particular feature, structure, material, or characteristic described in connection with an embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable requirements for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
Although embodiments of the present application have been shown and described above, it is to be understood that the above embodiments are exemplary and not to be construed as limiting the present application, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (12)

1. A method of voice interaction, comprising:
according to a first voice broadcast request of a vehicle at a first moment, under the condition that the first voice broadcast request is confirmed to belong to a preset field, adding reusable audio data corresponding to clauses into a cache record from a plurality of first clauses obtained after the first voice broadcast request is split;
according to a second voice broadcast request of the vehicle at a second time after the first time, under the condition that the second voice broadcast request is confirmed to belong to the preset field, splitting the second voice broadcast request;
and under the condition that the reusable clauses exist after the second voice broadcast request is determined to be subjected to the splitting processing, acquiring the audio data of the reusable clauses from the cache records and issuing the audio data to a vehicle to finish voice interaction.
2. The voice interaction method of claim 1, wherein the preset domain comprises a navigation domain.
3. The voice interaction method according to claim 1, wherein the splitting the second voice broadcast request includes:
and splitting the second voice broadcast request according to a preset separator.
4. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
and synthesizing the non-reusable clauses after the second voice broadcasting request is subjected to the splitting processing, synthesizing the audio data of the non-reusable clauses through a voice synthesis service, and sending the audio data to a vehicle to finish voice interaction.
5. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
confirming the issuing time sequence of a plurality of second clauses according to the plurality of second clauses obtained by splitting the second voice broadcast request;
processing a plurality of second clauses in parallel, acquiring audio data of the reusable clauses in the second clauses from the cache record, and synthesizing through a voice synthesis service to obtain audio data of the non-reusable clauses in the second clauses;
and issuing the audio data of each second clause to the vehicle according to the issuing time sequence to finish voice interaction.
6. The voice interaction method of claim 5, wherein the interaction method further comprises:
and in the process of parallel processing, if the processing of at least one second clause is abnormal, stopping issuing the audio data of the second voice broadcast request.
7. The voice interaction method according to any one of claims 1 to 6, wherein when it is determined that the reusable clause exists after the second voice broadcast request is subjected to the splitting processing, acquiring audio data of the reusable clause from the cache record and issuing the audio data to a vehicle, includes:
under the condition that the second voice broadcast request is confirmed to have the reusable clause after being subjected to the splitting processing, acquiring a sending time sequence of the reusable clause;
and issuing the audio data of the reusable clauses acquired from the cache records to a vehicle according to the issuing time sequence so as to finish voice interaction.
8. The voice interaction method according to claim 7, wherein the issuing, according to the issuing timing sequence, audio data of the reusable clauses obtained from the cache record to a vehicle includes:
and if the current issuing time sequence is the same as the issuing time sequence of the reusable clause, issuing the audio data of the clause acquired from the cache record to a vehicle to finish voice interaction.
9. The voice interaction method according to claim 8, wherein the issuing the audio data of the reusable clauses obtained from the cache record to a vehicle according to the issuing timing sequence comprises:
and if the current issuing time sequence is not the same as the issuing time sequence of the reusable clause, issuing the audio data of the reusable clause to a vehicle to finish voice interaction when the current issuing time sequence is updated to the issuing time sequence of the reusable clause.
10. The voice interaction method according to claim 1, wherein when it is determined that the reusable clause exists after the second voice broadcast request is subjected to the splitting processing, acquiring audio data of the reusable clause from the cache record and issuing the audio data to a vehicle, includes:
under the condition that the second voice broadcast request is determined to have the reusable clauses after being subjected to the splitting processing, acquiring audio data of the reusable clauses from the cache records;
setting a pre-mute duration and/or a post-mute duration for the audio data of the reusable clauses;
and transmitting the audio data of the reusable clauses after the setting is finished to the vehicle so as to finish the voice interaction.
11. A server, characterized in that the server comprises a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, carries out the method of any one of claims 1-10.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by one or more processors, implements the method of any one of claims 1-10.
CN202211594190.5A 2022-12-13 2022-12-13 Voice interaction method, server and computer readable storage medium Active CN115602171B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211594190.5A CN115602171B (en) 2022-12-13 2022-12-13 Voice interaction method, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211594190.5A CN115602171B (en) 2022-12-13 2022-12-13 Voice interaction method, server and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN115602171A true CN115602171A (en) 2023-01-13
CN115602171B CN115602171B (en) 2023-03-31

Family

ID=84853833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211594190.5A Active CN115602171B (en) 2022-12-13 2022-12-13 Voice interaction method, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115602171B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH116743A (en) * 1997-04-22 1999-01-12 Toyota Motor Corp Mobile terminal device and voice output system for it
US20050144015A1 (en) * 2003-12-08 2005-06-30 International Business Machines Corporation Automatic identification of optimal audio segments for speech applications
CN1956056A (en) * 2006-10-16 2007-05-02 同济大学 Speech synthesis device, speech synthesis method and GPS speech guide system
CN111161747A (en) * 2020-04-03 2020-05-15 深圳市友杰智新科技有限公司 Prediction method and device based on Tensorflow awakening model and computer equipment
CN111627438A (en) * 2020-05-21 2020-09-04 四川虹美智能科技有限公司 Voice recognition method and device
CN112217734A (en) * 2019-07-10 2021-01-12 海能达通信股份有限公司 Voice information synchronization method and communication system
CN113113002A (en) * 2019-12-25 2021-07-13 斑马智行网络(香港)有限公司 Vehicle voice interaction method and system and voice updating system
CN115129293A (en) * 2021-03-26 2022-09-30 阿里巴巴新加坡控股有限公司 Navigation voice processing method and device, electronic equipment and program product

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH116743A (en) * 1997-04-22 1999-01-12 Toyota Motor Corp Mobile terminal device and voice output system for it
US20050144015A1 (en) * 2003-12-08 2005-06-30 International Business Machines Corporation Automatic identification of optimal audio segments for speech applications
CN1956056A (en) * 2006-10-16 2007-05-02 同济大学 Speech synthesis device, speech synthesis method and GPS speech guide system
CN112217734A (en) * 2019-07-10 2021-01-12 海能达通信股份有限公司 Voice information synchronization method and communication system
CN113113002A (en) * 2019-12-25 2021-07-13 斑马智行网络(香港)有限公司 Vehicle voice interaction method and system and voice updating system
CN111161747A (en) * 2020-04-03 2020-05-15 深圳市友杰智新科技有限公司 Prediction method and device based on Tensorflow awakening model and computer equipment
CN111627438A (en) * 2020-05-21 2020-09-04 四川虹美智能科技有限公司 Voice recognition method and device
CN115129293A (en) * 2021-03-26 2022-09-30 阿里巴巴新加坡控股有限公司 Navigation voice processing method and device, electronic equipment and program product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙佰虎: "《 导航模式对空间知识获取的影响及其认知机制》", 《中国优秀硕士学位论文全文数据库 (哲学与人文科学辑)》 *

Also Published As

Publication number Publication date
CN115602171B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
EP0290679A1 (en) Device for receiving and processing road information messages
WO2006070566A1 (en) Speech synthesizing method and information providing device
CN102324995B (en) Speech broadcasting method and system
CN113581195B (en) Special vehicle identification method, electronic device and computer readable medium
US9794600B2 (en) Methods, systems, and media for generating an advertisement from a video stream
WO2014154097A1 (en) Automatic page content reading-aloud method and device thereof
CN111833881A (en) Travel voice service generation method, travel accompanying assistant system and electronic equipment
KR20210041553A (en) Audio stream mixing system and method
CN115602171B (en) Voice interaction method, server and computer readable storage medium
CN112346840A (en) Data processing method and device
KR20160077764A (en) Music information provision method and system
CN111641790A (en) Method, device and system for film and television production and distribution
JPH11201767A (en) Navigation device
CN115503639A (en) Voice processing method, voice interaction method, server and storage medium
CN112880703B (en) Navigation voice broadcast data generation method, device, medium and electronic equipment
CN112040067B (en) Method, electronic device, and medium for audio playing of messages
CN114915836A (en) Method, apparatus, device and storage medium for editing audio
KR20150059227A (en) Apparatus for providing drive route using telematics server and method thereof
CN111192579B (en) Information processing method, information control center apparatus, and computer-readable storage medium
CN114879923A (en) Multi-screen control method and device, electronic equipment and storage medium
US9414011B2 (en) Remote, directed delivery of data for on-air graphics
Morgenstern et al. Under which driving contexts do drivers decide to engage in mobile phone related tasks? An analysis of European naturalistic driving data
JP2009284004A (en) Reproducing device, reproducing method, and program
US20090222859A1 (en) Method, apparatus, and computer program product for implementing automatic update of time shift content
CN115119050B (en) Video editing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant