CN117133288A - Interactive processing method, vehicle machine and vehicle terminal - Google Patents

Interactive processing method, vehicle machine and vehicle terminal Download PDF

Info

Publication number
CN117133288A
CN117133288A CN202310821796.6A CN202310821796A CN117133288A CN 117133288 A CN117133288 A CN 117133288A CN 202310821796 A CN202310821796 A CN 202310821796A CN 117133288 A CN117133288 A CN 117133288A
Authority
CN
China
Prior art keywords
audio data
semantic
preset
vehicle
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310821796.6A
Other languages
Chinese (zh)
Inventor
王伟凯
吴尧
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zebred Network Technology Co Ltd
Original Assignee
Zebred Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zebred Network Technology Co Ltd filed Critical Zebred Network Technology Co Ltd
Priority to CN202310821796.6A priority Critical patent/CN117133288A/en
Publication of CN117133288A publication Critical patent/CN117133288A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • B60R16/037Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements for occupant comfort, e.g. for automatic adjustment of appliances according to personal settings, e.g. seats, mirrors, steering wheel
    • B60R16/0373Voice control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application provides an interaction processing method, a vehicle machine and a vehicle terminal, and relates to an intelligent automobile cabin operation system interaction processing technology, wherein the method comprises the following steps: acquiring audio data of an object; judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle. If the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle. According to the method, on the basis of semantic model judgment, retrieval model judgment is introduced, so that the problem that a specific call cannot be recalled can be effectively solved, the accuracy of skill experience is ensured while no disturbance is ensured, and the technical problem that the accuracy of identifying operation instructions of a vehicle and a machine is relatively low is solved.

Description

Interactive processing method, vehicle machine and vehicle terminal
Technical Field
The application relates to an interactive processing technology of an intelligent automobile cabin operation system, in particular to an interactive processing method, an automobile machine and an automobile terminal.
Background
At present, the car machine arranged in the intelligent car cabin bears main functions of a user for operating car body equipment, audio-visual entertainment and the like, so that the car machine needs to accurately identify and respond to operation instructions of the user.
In the prior art, the vehicle is in a full duplex wake-up free state to respond to audio sent by a user. However, because the audio sent by the user in the full duplex wake-up-free state is not limited to the operation instruction, and further includes boring and the like between drivers and passengers, namely, when the drivers and passengers do not have interaction intention with the vehicle, the vehicle can also respond to the boring and the like between the drivers and passengers as the operation instruction, so that the situation that the vehicle cannot accurately identify the operation instruction of the user and frequently breaks or initiates a conversation can occur, and people in the vehicle generate great trouble can be caused, and based on the situation, the accuracy of the existing vehicle identification operation instruction is relatively low, and the user embodiment is relatively poor.
Disclosure of Invention
The application provides an interaction processing method, a vehicle machine and a vehicle terminal, which are used for solving the technical problem that the accuracy of a vehicle machine identification operation instruction is relatively low.
In a first aspect, the present application provides an interaction processing method, including:
Acquiring audio data of an object;
judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle;
and if the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle.
Further, the determining the semantic type of the audio data by using a preset semantic model and/or a preset search model includes:
acquiring interaction information of a preset round of the object closest to the current moment, and judging the interaction information of the preset round and the audio data by utilizing a preset semantic model to obtain judgment result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into a vehicle machine;
and/or converting the audio data into semantic vectors according to a preset first coding model, and searching whether the semantic vectors are in a preset candidate speech table or not by using a preset searching model to obtain searching result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to a vehicle, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector;
And if the judging result information is inconsistent with the searching result information, determining that the searching result information is the semantic type of the audio data.
Further, the interaction information comprises semantic information and event information aiming at object behaviors; wherein,
the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information comprises an event tag of each of the preset turns closest to the current moment and a conversation event of each of the preset turns closest to the current moment.
Further, the step of judging the interaction information of the preset round and the audio data by using a preset semantic model to obtain judgment result information includes:
according to a preset second coding model, coding and converting the semantic information and the audio data to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and coding and converting the event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector;
And judging the semantic vector code and the behavior information code vector by using a preset semantic model to obtain judgment result information.
Further, after the audio data is acquired, further comprising:
performing text conversion processing on the audio data to obtain text-converted audio data;
judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data, wherein the method comprises the following steps:
and judging the text-converted audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data.
Further, the method further comprises:
if the semantic type is determined to be the type which is not speaking to the car machine, deleting the audio data and acquiring new audio data.
In a second aspect, the present application provides a vehicle machine comprising:
an acquisition unit configured to acquire audio data of an object;
the first determining unit is used for judging the audio data by utilizing a preset semantic model and/or a preset retrieval model and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle;
The second determining unit is used for identifying an operation instruction corresponding to the audio data if the semantic type is determined to be the type of speaking into the vehicle-to-machine;
and the execution unit is used for executing the operation instruction to complete the interaction processing between the object and the vehicle.
Further, the first determining unit includes:
the acquisition module is used for acquiring interaction information of the preset round of which the object is nearest to the current moment;
the judging module is used for judging the interaction information of the preset rounds and the audio data by utilizing a preset semantic model to obtain judging result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into a vehicle machine;
and/or a retrieval module, which is used for converting the audio data into semantic vectors according to a preset first coding model, and retrieving whether the semantic vectors are in a preset candidate speech table or not by utilizing a preset retrieval model to obtain retrieval result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to a vehicle, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector;
And the determining module is used for determining that the search result information is the semantic type of the audio data if the judgment result information is inconsistent with the search result information.
Further, the interaction information comprises semantic information and event information aiming at object behaviors; wherein,
the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information comprises an event tag of each of the preset turns closest to the current moment and a conversation event of each of the preset turns closest to the current moment.
Further, the judging module includes:
the coding sub-module is used for carrying out coding conversion on the semantic information and the audio data according to a preset second coding model to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and carrying out coding conversion on the event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector;
And the judging sub-module is used for judging the semantic vector codes and the behavior information coding vectors by utilizing a preset semantic model to obtain judging result information.
Further, the method further comprises the following steps:
the conversion unit is used for performing text conversion processing on the audio data after the audio data are acquired to obtain text-converted audio data;
the first determining unit is specifically configured to:
and judging the text-converted audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data.
Further, the terminal further includes:
and the deleting unit is used for deleting the audio data and acquiring new audio data if the semantic type is determined to be the type which is not uttered to the car machine.
In a third aspect, the present application provides a vehicle terminal, in which a vehicle machine is disposed, where the vehicle machine is the vehicle machine in the first aspect.
In a fourth aspect, the present application provides a vehicle machine, including a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor implements the method of the first aspect when executing the computer program.
In a fifth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions for implementing the method of the first aspect when executed by a processor.
In a sixth aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.
The application provides an interaction processing method, a vehicle machine and a vehicle terminal, which are used for acquiring audio data of an object. Judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle. If the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle. In the scheme, the audio data of the object are acquired in real time. And judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data. Judging the meaning of the semantic type, if the semantic type is determined to be speaking to the vehicle, further determining an operation instruction corresponding to the audio data, and executing the operation instruction, thereby completing the interaction processing between the object and the vehicle. Therefore, on the basis of semantic model judgment, search model judgment is introduced, whether the audio data is an operation instruction or not is determined through comprehensive judgment of the semantic model and the search model, and the audio data is responded under the condition of the operation instruction, so that the problem that a specific speaking operation cannot be recalled can be effectively relieved, the search model can effectively identify command type instructions sent by a user in the continuous speaking process, the accuracy of skill experience is ensured while no disturbance is ensured, and the technical problem that the accuracy of vehicle-mounted computer identification operation instruction is relatively low is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic flow chart of an interaction processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for determining a semantic model according to an embodiment of the present application;
FIG. 3 is a flowchart of a retrieval method of a retrieval model according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of another interactive processing method according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another interactive processing method according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating another interactive processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a vehicle machine according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another vehicle according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a vehicle machine according to an embodiment of the present application.
Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure.
At present, the car machine arranged in the intelligent car cabin bears main functions of a user for operating car body equipment, audio-visual entertainment and the like, so that the car machine needs to accurately identify and respond to operation instructions of the user.
In one example, the vehicle may be in a full duplex wake-free state to respond to audio from the user. However, because the audio sent by the user in the full duplex wake-up-free state is not limited to the operation instruction, and further includes boring and the like between drivers and passengers, namely, when the drivers and passengers do not have interaction intention with the vehicle, the vehicle can also respond to the boring and the like between the drivers and passengers as the operation instruction, so that the situation that the vehicle cannot accurately identify the operation instruction of the user and frequently breaks or initiates a conversation can occur, and people in the vehicle generate great trouble can be caused, and based on the situation, the accuracy of the existing vehicle identification operation instruction is relatively low, and the user embodiment is relatively poor.
In one example, whether the user is speaking into the device may be detected by a visual signal, but not in an in-vehicle scenario, there may be a safety hazard.
The application provides an interaction processing method, a vehicle machine and a vehicle terminal, and aims to solve the technical problems in the prior art.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of an interaction processing method according to an embodiment of the present application, as shown in fig. 1, where the method includes:
step 101, obtaining audio data of an object.
Illustratively, the execution body of the present embodiment may be a car machine. The vehicle is a vehicle-mounted infotainment device deployed at a vehicle terminal. First, audio data of an object needs to be acquired. The vehicle machine can receive the audio data of the object and wake the vehicle machine in a non-wake-up state, and can acquire the audio data of the object in real time in the wake-up state.
102, judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle.
The preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle, wherein the interaction information comprises semantic information and event information aiming at the object behavior. Specifically, the semantic information includes first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment, and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment. The event information comprises an event tag of each of the preset rounds nearest to the current moment and a conversation event of each of the preset rounds nearest to the current moment, wherein the event tag further comprises: based on the response of the physical key; reply broadcasting is carried out based on the object speaking; when broadcasting, if the object presses a physical key to interrupt broadcasting, the broadcasting is finished in advance; when the object does not speak, the vehicle automatically sends out broadcasting, such as calling or time reporting to the object. Since the time stamp is generated each time the object emits audio data and the car machine responds, the data of each round can be determined according to the time stamp.
The retrieval model is used for retrieving the audio data. Specifically, a candidate phone list is pre-deployed in the cloud, the candidate phone list stores phones which need to be considered as speaking to the car machine under any scene, each phone is pre-coded into vector representation through a Roformer model (open source model), and a retrieval model is used for retrieving audio data in the candidate phone list and judging whether the currently acquired audio data is in the candidate phone list or not.
For example, fig. 2 is a flow chart of a method for determining a semantic model according to an embodiment of the present application, as shown in fig. 2, the method includes: acquiring multiple rounds of semantic information and event information aiming at object behaviors, respectively carrying out semantic information coding on the semantic information and carrying out user behavior sequence coding on event labels in the event information to obtain semantic information coding vectors corresponding to the semantic information and behavior information coding vectors corresponding to the event information; the semantic information coding vector and the behavior information coding vector are fused through a preset feature fusion module, and a result of 'whether the current user speaks' is obtained through a vehicle-to-vehicle speech classification module, so that judgment result information can be determined.
Fig. 3 is a flow chart of a retrieval method of a retrieval model, as shown in fig. 3, in which the retrieval function is added to the model, and in order to solve the problem of low recall rate of new skills and generalized skills, recall capability of the skills can be increased by the retrieval model, and the method includes: acquiring the speaking of the current user; performing similarity model coding to obtain a coding vector of user speaking operation, wherein the coding vector is a vector with the maximum similarity value with the current user speaking operation; vector recall service is carried out in a preset semantic vector library; and determining whether a preset conversation is searched in a preset conversation candidate list, if the preset conversation is searched, determining to speak to the vehicle and the vehicle, and if the preset conversation is not searched, continuing to acquire the conversation of the current user, thereby obtaining the search result information.
In the step, the vehicle-mounted device judges the audio data by using a preset semantic model and determines the semantic type of the audio data; or judging the audio data by using a preset retrieval model, and determining the semantic type of the audio data; or judging the audio data by utilizing a preset semantic model and a preset retrieval model, and if the judgment result information obtained by the semantic model is inconsistent with the retrieval result information obtained by the retrieval model, preferentially determining that the retrieval result information is the semantic type of the audio data.
The joint judgment logic based on the multi-mode signal is not limited to the judgment of the audio data as indicated in the above steps, and may be a scheme of joint modeling of the audio data and the text, a scheme of adjunctive judgment of visually detecting whether the user has lip motion, phone call motion, or the like, and the like.
And 103, if the semantic type is determined to be the type of speaking into the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete the interaction processing between the object and the vehicle.
The vehicle-mounted device judges the meaning of the semantic type, if the semantic type is determined to be the speaking type to the vehicle-mounted device, the operation instruction corresponding to the audio data is further identified, and the operation instruction is executed, so that the interaction processing between the object and the vehicle-mounted device is realized. If the semantic type is determined to be the type which is not speaking to the car machine, discarding the current speaking operation, and continuing to monitor the speaking of the user.
In the embodiment of the application, the audio data of the object is acquired. Judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle. If the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle. In the scheme, the audio data of the object are acquired in real time. And judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data. Judging the meaning of the semantic type, if the semantic type is determined to be speaking to the vehicle, further determining an operation instruction corresponding to the audio data, and executing the operation instruction, thereby completing the interaction between the object and the vehicle. Therefore, on the basis of semantic model judgment, search model judgment is introduced, whether the audio data is an operation instruction or not is determined through comprehensive judgment of the semantic model and the search model, and the audio data is responded under the condition of the operation instruction, so that the problem that a specific speaking operation cannot be recalled can be effectively relieved, the search model can effectively identify command type instructions sent by a user in the continuous speaking process, the accuracy of skill experience is ensured while no disturbance is ensured, and the technical problem that the accuracy of vehicle-mounted computer identification operation instruction is relatively low is solved.
Fig. 4 is a flow chart of another interactive processing method provided in an embodiment of the present application, as shown in fig. 4, the method includes:
step 201, obtaining audio data of an object.
Illustratively, this step may refer to step 101 in fig. 1, and will not be described in detail.
And 202, performing text conversion processing on the audio data to obtain text-converted audio data.
Illustratively, real-time audio data of the object is processed by acoustic front-end and ASR technology (audio to text conversion, well known in the art) to generate text information to be judged, i.e., text-converted audio data.
Step 203, obtaining interaction information of a preset round of which the object is nearest to the current moment.
In one example, the interaction information includes semantic information and event information for object behavior; the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information includes an event tag for each of the preset runs closest to the current time, and a talk event for each of the preset runs closest to the current time.
The vehicle machine acquires interaction information of a preset round of which the object is nearest to the current moment. The interaction information comprises semantic information and event information aiming at object behaviors; the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment, third audio data with the largest similarity value between the audio data and a preset round closest to the current moment, and event information comprises event labels of each event round in the preset round closest to the current moment and conversation events of each event round in the preset round closest to the current moment. For example, the event information includes an event tag of five-round event information nearest to the current time and a talk event of five-round event information nearest to the current time, the event tag including: the method comprises the steps of steering wheel awakening, vehicle broadcasting, vehicle breaking broadcasting by a user and the like; the speech events were: the historical round of speaking is "speak to car machine type" or "not speak to car machine type".
204, judging interaction information and audio data of a preset round by using a preset semantic model to obtain judgment result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into the car set.
In one example, step 204 includes: according to a preset second coding model, coding and converting semantic information and audio data to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and coding and converting event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector; and judging the semantic vector code and the behavior information code vector by using a preset semantic model to obtain judgment result information.
Illustratively, a second coding model (namely, a RoBERTa model) characterizes a corresponding relationship between the data to be converted and the coding vector, and the second coding model is used for coding the data to be converted and obtaining the corresponding coding vector. According to a preset second coding model, coding and converting semantic information and audio data to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and coding and converting event information to obtain behavior information coding vectors corresponding to the event information. And comprehensively judging the semantic vector codes and the behavior information coding vectors by using a preset semantic model to obtain judgment result information, wherein the judgment result information characterizes whether the semantic type of the audio data is a speaking type to a vehicle.
And/or, step 205, converting the audio data into semantic vectors according to a preset first coding model, and searching whether the semantic vectors are in a preset candidate speech table or not by using a preset searching model to obtain searching result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to the car set, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector.
Illustratively, a preset first coding model (Roformer model) characterizes the correspondence between the audio data and the semantic vectors, the first coding model being used to convert the audio data into semantic vectors. And encoding the text information to be judged into a semantic vector through a first encoding model, and searching whether the semantic vector is in a pre-constructed candidate speech table or not in a semantic vector similarity matching mode. If the candidate speech table is searched, directly outputting the search result information of the 'speaking type to the vehicle machine'; if not, outputting the retrieval result information of 'the type of not speaking to the vehicle and the machine', depending on the judgment result of the semantic model.
Step 206, if the result information is inconsistent with the search result information, determining that the search result information is the semantic type of the audio data.
Illustratively, the vehicle compares the judgment result information with the search result information, and if it is determined that the judgment result information is inconsistent with the search result information, determines that the search result information is the semantic type of the audio data.
Step 207, if the semantic type is determined to be the type of speaking into the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete the interaction processing between the object and the vehicle.
The vehicle-mounted device judges the meaning of the semantic type, if the semantic type is determined to be the speaking type to the vehicle-mounted device, the operation instruction corresponding to the audio data is further identified, the operation instruction is executed, and then the interaction processing between the object and the vehicle-mounted device is completed.
And step 208, if the semantic type is determined to be the type which is not speaking to the car machine, deleting the audio data and acquiring new audio data.
Illustratively, if the semantic type is determined to be a type that is not speaking to the vehicle, the audio data is deleted, i.e., the current speaking is discarded, the user continues to listen to the speaking and new audio data is obtained, so that step 201 is repeatedly performed.
In the embodiment of the application, the audio data of the object is acquired. And performing text conversion processing on the audio data to obtain the audio data after text conversion. And acquiring interaction information of the preset round of which the object is nearest to the current moment. Judging interaction information and audio data of a preset round by using a preset semantic model to obtain judging result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into the car set. Converting the audio data into semantic vectors according to a preset first coding model, and searching whether the semantic vectors are in a preset candidate speech table or not by utilizing a preset searching model to obtain searching result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to the car set, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector. If the judgment result information is inconsistent with the search result information, determining that the search result information is the semantic type of the audio data. If the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle. If the semantic type is determined to be the type which is not speaking to the car machine, deleting the audio data and acquiring new audio data. Therefore, on the basis of semantic model judgment, search model judgment is introduced, whether the audio data is an operation instruction or not is determined through comprehensive judgment of the semantic model and the search model, and the audio data is responded under the condition of the operation instruction, so that the problem that a specific speaking operation cannot be recalled can be effectively relieved, the search model can effectively identify command type instructions sent by a user in the continuous speaking process, the accuracy of skill experience is ensured while no disturbance is ensured, and the technical problem that the accuracy of vehicle-mounted computer identification operation instruction is relatively low is solved. In addition, full-quantity skills are accessed in a full-duplex interaction scene, the probability that the current to-be-judged speaking operation is speaking to the vehicle and machine can be accurately identified by utilizing event information and semantic information, more characteristic sources are obtained for judging whether the task is speaking to the vehicle and machine, and the difference in skill richness experience under full duplex and non-full duplex is avoided; the judging process does not adopt a visual scheme, and meets the driving safety requirement in a vehicle-mounted scene; the method also fully considers the understanding condition of the user speaking skill under different interaction scenes, and the same sentence of the speaking skill to be judged accords with the use habit of the user under different interaction scenes, wherein the probabilities of being speaking to the automobile and not speaking to the automobile are different.
Fig. 5 is a schematic flow chart of another interaction processing method according to an embodiment of the present application, as shown in fig. 5, including: the method comprises the steps that audio data of a user are input, and a car machine continuously receives a speech of the user and recognizes the speech as a text; judging whether the received audio data are speaking to the car machine, if yes, sending the audio data to a semantic understanding module, wherein the semantic understanding module is used for speech understanding, namely determining an operation instruction corresponding to the audio data, sequentially sending the operation instruction to a dialogue management module, an action execution module and a voice broadcasting module, wherein the action execution module is used for executing the operation instruction, and the voice broadcasting module is used for broadcasting the response content of the car machine. Meanwhile, the semantic understanding module intercepts the talking operation of' not speaking into the vehicle, so that the correct rate of instruction execution is ensured, and meanwhile, the user is ensured not to be disturbed.
Fig. 6 is a schematic flow chart of another interactive processing method according to an embodiment of the present application, as shown in fig. 5, where "semantic understanding module", "dialogue management module", "action execution module" and "voice broadcast module" in fig. 5 are all general knowledge in industry, and are not important to the present application, and the present application is focused on "determine whether to speak to a vehicle, where the module mainly includes a semantic module and a search module, where a semantic model in the semantic module is referred to fig. 2, and a search model in the search module is referred to fig. 3. As shown in fig. 6, the "determine whether to speak to the car machine" module includes: and judging a to-be-judged conversation (namely audio data) of the user by utilizing the semantic model and the retrieval model, judging whether to speak to the vehicle and the machine, and if so, transmitting the conversation to the semantic understanding module. If not, the current session is discarded and the audio receiving portion is entered.
Fig. 7 is a schematic structural diagram of a vehicle machine according to an embodiment of the present application, as shown in fig. 7, the vehicle machine includes:
an acquisition unit 31 for acquiring audio data of the object.
A first determining unit 32, configured to determine the semantic type of the audio data by using a preset semantic model and/or a preset search model to determine the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle.
The second determining unit 33 is configured to identify an operation instruction corresponding to the audio data if the semantic type is determined to be a type of speaking into the vehicle.
And the execution unit 34 is used for executing the operation instruction to complete the interaction processing between the object and the vehicle.
The device of the embodiment may execute the technical scheme in the above method, and the specific implementation process and the technical principle are the same and are not described herein again.
Fig. 8 is a schematic structural diagram of another vehicle machine according to an embodiment of the present application, and on the basis of the embodiment shown in fig. 7, as shown in fig. 8, the first determining unit 32 includes:
the obtaining module 321 is configured to obtain interaction information of a preset round of time that the object is closest to the current time.
The judging module 322 is configured to judge the interaction information and the audio data of the preset round by using the preset semantic model, so as to obtain judgment result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into the car set.
And/or, a retrieving module 323, configured to convert the audio data into semantic vectors according to a preset first coding model, and retrieve whether the semantic vectors are in a preset candidate speech table by using a preset retrieving model to obtain retrieval result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to the car set, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector.
The determining module 324 is configured to determine that the search result information is a semantic type of the audio data if the determination result information is inconsistent with the search result information.
In one example, the interaction information includes semantic information and event information for object behavior; the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information includes an event tag for each of the preset runs closest to the current time, and a talk event for each of the preset runs closest to the current time.
In one example, the determination module 322 includes:
the coding sub-module 3221 is configured to perform coding conversion on the semantic information and the audio data according to a preset second coding model to obtain semantic information coding vectors corresponding to both the semantic information and the audio data, and perform coding conversion on the event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector.
The judging sub-module 3222 is configured to judge the semantic vector code and the behavior information code vector by using a preset semantic model, so as to obtain judgment result information.
In one example, the vehicle further comprises:
the conversion unit 41 is configured to perform text conversion processing on the audio data after the audio data is acquired, and obtain text-converted audio data.
The first determining unit 32 is specifically configured to:
and judging the audio data after text conversion by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data.
In one example, the vehicle further comprises:
and the deleting unit 42 is configured to delete the audio data and acquire new audio data if the semantic type is determined to be a type that is not speaking to the car machine.
The device of the embodiment may execute the technical scheme in the above method, and the specific implementation process and the technical principle are the same and are not described herein again.
On the basis of the above embodiments, the embodiment of the present application provides a vehicle terminal, in which a vehicle is provided, and the vehicle is the vehicle described in the above embodiments.
Fig. 9 is a schematic structural diagram of a vehicle machine according to an embodiment of the present application, where, as shown in fig. 9, the vehicle machine includes: a memory 51, and a processor 52.
The memory 51 stores a computer program executable on the processor 52.
The processor 52 is configured to perform the method as provided by the above-described embodiments.
The vehicle further comprises a receiver 53 and a transmitter 54. The receiver 53 is for receiving instructions and data transmitted from an external device, and the transmitter 54 is for transmitting instructions and data to the external device.
The embodiments of the present application also provide a non-transitory computer-readable storage medium, which when executed by a processor of a vehicle, enables the vehicle to perform the method provided by the above embodiments.
The embodiment of the application also provides a computer program product, which comprises: a computer program stored in a readable storage medium, from which at least one processor of the vehicle machine can read, the at least one processor executing the computer program causing the vehicle machine to perform the solution provided in any one of the embodiments described above.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (14)

1. An interactive processing method, comprising:
acquiring audio data of an object;
judging the audio data by using a preset semantic model and/or a preset retrieval model, and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle;
And if the semantic type is determined to be the type of speaking to the vehicle, identifying an operation instruction corresponding to the audio data, and executing the operation instruction to complete interaction processing between the object and the vehicle.
2. The method according to claim 1, wherein the determining the semantic type of the audio data by determining the audio data using a preset semantic model and/or a preset retrieval model comprises:
acquiring interaction information of a preset round of the object closest to the current moment, and judging the interaction information of the preset round and the audio data by utilizing a preset semantic model to obtain judgment result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into a vehicle machine;
and/or the number of the groups of groups,
converting the audio data into semantic vectors according to a preset first coding model, and searching whether the semantic vectors are in a preset candidate speech table or not by using a preset searching model to obtain searching result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to a vehicle, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector;
And if the judging result information is inconsistent with the searching result information, determining that the searching result information is the semantic type of the audio data.
3. The method of claim 2, wherein the interaction information includes semantic information and event information for object behavior; wherein,
the semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information comprises an event tag of each of the preset turns closest to the current moment and a conversation event of each of the preset turns closest to the current moment.
4. The method of claim 3, wherein the determining the interaction information of the preset round and the audio data by using a preset semantic model to obtain determination result information includes:
according to a preset second coding model, coding and converting the semantic information and the audio data to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and coding and converting the event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector;
And judging the semantic vector code and the behavior information code vector by using a preset semantic model to obtain judgment result information.
5. The method according to any one of claims 1-4, further comprising:
if the semantic type is determined to be the type which is not speaking to the car machine, deleting the audio data and acquiring new audio data.
6. A vehicle machine, comprising:
an acquisition unit configured to acquire audio data of an object;
the first determining unit is used for judging the audio data by utilizing a preset semantic model and/or a preset retrieval model and determining the semantic type of the audio data; the preset semantic model and the search model are obtained through training according to multiple rounds of interaction information between the object and the vehicle;
the second determining unit is used for identifying an operation instruction corresponding to the audio data if the semantic type is determined to be the type of speaking into the vehicle-to-machine;
and the execution unit is used for executing the operation instruction to complete the interaction processing between the object and the vehicle.
7. The vehicle according to claim 6, characterized in that the first determining unit includes:
The acquisition module is used for acquiring interaction information of the preset round of which the object is nearest to the current moment;
the judging module is used for judging the interaction information of the preset rounds and the audio data by utilizing a preset semantic model to obtain judging result information; the judgment result information represents whether the semantic type of the audio data is a type of speaking into a vehicle machine;
and/or a retrieval module, which is used for converting the audio data into semantic vectors according to a preset first coding model, and retrieving whether the semantic vectors are in a preset candidate speech table or not by utilizing a preset retrieval model to obtain retrieval result information; the retrieval result information represents whether the semantic type of the audio data is a type of speaking to a vehicle, and the preset first coding model represents the corresponding relation between the audio data and the semantic vector;
and the determining module is used for determining that the search result information is the semantic type of the audio data if the judgment result information is inconsistent with the search result information.
8. The vehicle of claim 7, wherein the interaction information includes semantic information and event information for object behavior; wherein,
The semantic information comprises first audio data sent by an object closest to the current moment, second audio data sent by a vehicle machine closest to the current moment and third audio data with the largest similarity value with the audio data in a preset round closest to the current moment; the event information comprises an event tag of each of the preset turns closest to the current moment and a conversation event of each of the preset turns closest to the current moment.
9. The vehicle of claim 8, wherein the determination module comprises:
the coding sub-module is used for carrying out coding conversion on the semantic information and the audio data according to a preset second coding model to obtain semantic information coding vectors corresponding to the semantic information and the audio data, and carrying out coding conversion on the event information to obtain behavior information coding vectors corresponding to the event information; the second coding model characterizes the corresponding relation between the data to be converted and the coding vector;
and the judging sub-module is used for judging the semantic vector codes and the behavior information coding vectors by utilizing a preset semantic model to obtain judging result information.
10. The vehicle according to any one of claims 6-9, characterized in that the vehicle further comprises:
and the deleting unit is used for deleting the audio data and acquiring new audio data if the semantic type is determined to be the type which is not uttered to the car machine.
11. A vehicle terminal, characterized in that a vehicle is provided in the vehicle terminal, the vehicle being as claimed in any one of claims 1-5.
12. A vehicle comprising a memory, a processor, the memory having stored thereon a computer program executable on the processor, the processor implementing the method of any of the preceding claims 1-5 when the computer program is executed.
13. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-5.
14. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-5.
CN202310821796.6A 2023-07-05 2023-07-05 Interactive processing method, vehicle machine and vehicle terminal Pending CN117133288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310821796.6A CN117133288A (en) 2023-07-05 2023-07-05 Interactive processing method, vehicle machine and vehicle terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310821796.6A CN117133288A (en) 2023-07-05 2023-07-05 Interactive processing method, vehicle machine and vehicle terminal

Publications (1)

Publication Number Publication Date
CN117133288A true CN117133288A (en) 2023-11-28

Family

ID=88851671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310821796.6A Pending CN117133288A (en) 2023-07-05 2023-07-05 Interactive processing method, vehicle machine and vehicle terminal

Country Status (1)

Country Link
CN (1) CN117133288A (en)

Similar Documents

Publication Publication Date Title
CN109065053B (en) Method and apparatus for processing information
CN106503046B (en) Interaction method and system based on intelligent robot
CN111261151A (en) Voice processing method and device, electronic equipment and storage medium
CN110992955A (en) Voice operation method, device, equipment and storage medium of intelligent equipment
CN103680505A (en) Voice recognition method and voice recognition system
CN105227557A (en) A kind of account number processing method and device
CN110767219B (en) Semantic updating method, device, server and storage medium
CN113779208A (en) Method and device for man-machine conversation
CN109545203A (en) Audio recognition method, device, equipment and storage medium
KR20140067687A (en) Car system for interactive voice recognition
CN113206861B (en) Information processing apparatus, information processing method, and recording medium
EP3618060A1 (en) Signal processing device, method, and program
CN111261149B (en) Voice information recognition method and device
CN117133288A (en) Interactive processing method, vehicle machine and vehicle terminal
CN110764684A (en) Instant interaction method and system based on voice touch screen fusion, storage medium and vehicle-mounted terminal
CN114596854A (en) Voice processing method and system based on full-duplex communication protocol and computer equipment
CN115168558A (en) Method for realizing multi-round man-machine conversation
CN110534084B (en) Intelligent voice control method and system based on FreeWITCH
CN114974232A (en) Voice information processing method and related product
CN107967308B (en) Intelligent interaction processing method, device, equipment and computer storage medium
CN113011198B (en) Information interaction method and device and electronic equipment
CN110543556A (en) Dialogue configuration method, storage medium and electronic equipment
CN114863929B (en) Voice interaction method, device, system, computer equipment and storage medium
CN115499397B (en) Information reply method, device, equipment and storage medium
CN111726283B (en) WeChat receiving method and device for vehicle-mounted intelligent sound box

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination