CN117877470A - Voice association method, device, equipment and storage medium - Google Patents

Voice association method, device, equipment and storage medium Download PDF

Info

Publication number
CN117877470A
CN117877470A CN202311740472.6A CN202311740472A CN117877470A CN 117877470 A CN117877470 A CN 117877470A CN 202311740472 A CN202311740472 A CN 202311740472A CN 117877470 A CN117877470 A CN 117877470A
Authority
CN
China
Prior art keywords
voice
intention
determining
analysis result
association rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311740472.6A
Other languages
Chinese (zh)
Inventor
周文欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apollo Zhilian Beijing Technology Co Ltd
Original Assignee
Apollo Zhilian Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apollo Zhilian Beijing Technology Co Ltd filed Critical Apollo Zhilian Beijing Technology Co Ltd
Priority to CN202311740472.6A priority Critical patent/CN117877470A/en
Publication of CN117877470A publication Critical patent/CN117877470A/en
Pending legal-status Critical Current

Links

Landscapes

  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure provides a voice association method, a device, equipment and a storage medium, relates to the field of voice processing, in particular to the technical fields of voice recognition, voice interaction and the like, and can be applied to a multi-voice-zone interaction scene. The specific implementation scheme comprises the following steps: receiving a first voice from a first voice zone; according to the first voice, a first analysis result of the first voice is obtained; acquiring a recognition text corresponding to the second voice from the second voice zone and a second analysis result of the second voice; determining a third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice; and determining a target control instruction according to the first voice and the third voice. According to the method and the device, the current voice can be accurately associated with the historical voices of different voice areas in an offline state, and user experience is improved.

Description

Voice association method, device, equipment and storage medium
Technical Field
The disclosure relates to the field of voice processing, in particular to the technical fields of voice recognition, voice interaction and the like, and can be applied to a multi-voice-zone interaction scene, in particular to a voice association method, a device, equipment and a storage medium.
Background
In-vehicle voice assistants need to use a large amount of computing resources on-line in the cloud to contextually relate the newly received voice to the historical voice of other voice zones in order to understand the user's intended goal of the newly received voice. However, when the vehicle-mounted voice assistant cannot communicate with the cloud, the current voice cannot be in contextual connection with the historical voices of other voice zones due to insufficient computing resources, and the user's intended target cannot be completed.
Currently, a vehicle-mounted voice assistant determines whether a current voice is associated with historical voices of other voice zones in a white list mode, so that the vehicle is controlled according to the current voice and the associated historical voices, and the user expected target is completed.
However, the expandability of the white list setting manner is poor, so that the vehicle cannot accurately complete the target expected by the user, and the user experience is poor.
Disclosure of Invention
The invention provides a voice association method, a device, equipment and a storage medium, which can accurately associate current voice with historical voices of different voice areas in an offline state and improve user experience.
According to a first aspect of the present disclosure, there is provided a voice correlation method, including:
Receiving a first voice from a first voice zone; according to the first voice, a first analysis result of the first voice is obtained, and the first analysis result is used for indicating the intention of the first voice; acquiring a recognition text corresponding to a second voice from a second voice zone and a second analysis result of the second voice, wherein the second voice is historical voice, and the second analysis result is used for indicating the control category and the intention of the second voice; determining a third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, wherein the association rule corresponding to the second voice is determined according to the control category of the second analysis result of the second voice, and the third voice is the second voice of which the corresponding association rule comprises the association rule which can be met by the intention of the first voice; and determining a target control instruction according to the first voice and the third voice.
According to a second aspect of the present disclosure, there is provided a speech associating apparatus, the apparatus comprising:
and the receiving module is used for receiving the first voice from the first voice zone.
The analysis module is used for obtaining a first analysis result of the first voice according to the first voice, and the first analysis result is used for indicating the intention of the first voice.
The acquisition module is used for acquiring the recognition text corresponding to the second voice from the second voice zone and a second analysis result of the second voice, wherein the second voice is historical voice, and the second analysis result is used for indicating the control category and the intention of the second voice.
The determining module is configured to determine, from the second voices, a third voice according to the intention of the first voice in the first analysis result and an association rule corresponding to the second voice, where the association rule corresponding to the second voice is determined according to a control class of the second analysis result of the second voice, and the third voice is a second voice whose association rule corresponding to the first voice includes an association rule that can be satisfied by the intention of the first voice.
And the control module is used for determining a target control instruction according to the first voice and the third voice.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic flow chart of a voice association method according to an embodiment of the disclosure;
fig. 2 is another flow chart of a voice association method according to an embodiment of the disclosure;
fig. 3 is a schematic flow chart of a voice association method according to an embodiment of the disclosure;
fig. 4 is a schematic diagram of a voice association apparatus according to an embodiment of the disclosure;
fig. 5 is a schematic diagram of the composition of an electronic device according to an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be appreciated that in embodiments of the present disclosure, the character "/" generally indicates that the context associated object is an "or" relationship. The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.
In-vehicle voice assistants need to use a large amount of computing resources on-line in the cloud to contextually relate the newly received voice to the historical voice of other voice zones in order to understand the user's intended goal of the newly received voice. However, when the vehicle-mounted voice assistant cannot communicate with the cloud, the current voice cannot be in contextual connection with the historical voices of other voice zones due to insufficient computing resources, and the user's intended target cannot be completed.
Currently, a vehicle-mounted voice assistant determines whether a current voice is associated with historical voices of other voice zones in a white list mode, so that the vehicle is controlled according to the current voice and the associated historical voices, and the user expected target is completed.
For example, taking the history voice as an "open window" of the secondary driving voice zone as an example, the preset white list includes "a big spot", and when the newly received voice is the "big spot" of the primary driving voice zone, the "open window" and the "big spot" cannot be associated, that is, the height of the window cannot be adjusted down by the vehicle.
For another example, taking the historical voice as the "window opening" from the primary driving voice zone, the preset white list includes "opening a little", when the received voice is "i am too" from the secondary driving voice zone, the "window opening" of the primary driving user cannot be associated with the "i am too" of the secondary driving user, i.e. the vehicle cannot go to the window opening of the secondary driving position.
However, the expandability of the white list setting manner is poor, so that the vehicle cannot accurately complete the target expected by the user, and the user experience is poor.
Because different users have different language habits and speaking modes, the preset white list cannot comprise all voice texts, so that the historical voice associated with the current voice cannot be accurately determined, the expected target of the user cannot be accurately completed, and the user experience is poor.
Under the background technology, the present disclosure provides a voice association method, which can accurately associate current voice with historical voice of different voice areas in an offline state, and improve user experience.
The execution subject of the voice association method provided by the embodiment of the disclosure may be a computer or a server, or may also be other electronic devices with data processing capability; alternatively, the execution subject of the method may be a processor (e.g., a central processing unit (central processing unit, CPU)) in the above-described electronic device; still alternatively, the execution subject of the method may be an Application (APP) installed in the electronic device and capable of implementing the function of the method; alternatively, the execution subject of the method may be a functional module, a unit, or the like having the function of the method in the electronic device. The subject of execution of the method is not limited herein.
For example, in some implementations, the voice association method provided by the embodiments of the present disclosure may be applied to a vehicle-mounted terminal.
The voice correlation method is exemplarily described below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a voice association method according to an embodiment of the disclosure. As shown in fig. 1, the method may include:
s101, receiving first voice from a first voice zone.
Illustratively, taking an application scenario as an example of a vehicle, a first voice from a user inside the vehicle may be received through an audio input device on the vehicle.
For example, taking an application scenario as an example of a vehicle, the first audio region may be any audio region on the vehicle, which is not limited. For example, the first sound zone may be any one of a main driving sound zone, a sub driving sound zone, a left rear sound zone, and a right rear sound zone.
S102, obtaining a first analysis result of the first voice according to the first voice, wherein the first analysis result is used for indicating the intention of the first voice.
The first analysis result of the first voice may be obtained by performing voice recognition on the first voice after receiving the first voice from the first voice zone, to obtain a recognition text corresponding to the first voice, and performing natural language processing on the recognition text corresponding to the first voice.
Illustratively, the control categories may include, without limitation, a vehicle control category, a media control category, a system control category, a navigation category, an encyclopedia category, and the like.
For example, taking a recognition text corresponding to a first voice as an example of opening a vehicle window, the control type of the first voice is a vehicle control type, and the intention of the first voice is to control the opening of the vehicle window.
For example, the first analysis result of the first voice may include the control class of the first voice, or the control class of the first voice may not exist, which is not limited.
For example, taking the recognized text corresponding to the first voice as "one point of enlargement", the control category corresponding to the "one point of enlargement" cannot be accurately obtained by performing the natural language processing, and only the first voice is obtained with the intention of controlling the enlargement of a certain object.
S103, acquiring a recognition text corresponding to the second voice from the second voice zone and a second analysis result of the second voice.
The second voice is a historical voice, and the second analysis result is used for indicating the control category and the intention of the second voice.
For example, taking an application scenario as an example of a vehicle, the second sound zone may be a sound zone on the vehicle other than the first sound zone. For example, taking the first sound zone as the main driving sound zone as an example, the auxiliary driving sound zone, the left rear sound zone and the right rear sound zone are all second sound zones.
For example, the recognition text corresponding to the second voice and the second analysis result of the second voice may be both stored in a storage medium of the vehicle-mounted terminal, and the vehicle-mounted terminal directly obtains from the storage medium when executing S103, or only the recognition text corresponding to the second voice is stored in the storage medium of the vehicle-mounted terminal, and after obtaining the recognition text corresponding to the second voice from the storage medium, the vehicle-mounted terminal performs natural language processing on the recognition text corresponding to the second voice to obtain the second analysis result of the second voice.
The second speech is, for example, speech received at any time before the first speech is received. For example, when the moment of receiving the first voice is 20:00:00, the voices received before 20:00:00 are all second voices.
S104, determining a third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice.
The association rule corresponding to the second voice is determined according to the control category of the second analysis result of the second voice, and the third voice is the second voice of which the corresponding association rule comprises the association rule which can be met by the intention of the first voice.
For example, according to the control category of the second voice in the second analysis result and the association relation between the preset control category and the association rule, the association rule corresponding to the second voice may be determined, according to the intention of the first voice, the association rule which can be satisfied by the intention of the first voice may be determined, and the corresponding association rule includes the second voice of the association rule which can be satisfied by the intention of the first voice, and may be determined as the third voice.
By way of example, the association rules may include copy rules, inheritance rules, rest rules. For example, taking the example that the second voice includes "window opening" of the main driving voice zone and "song playing" of the left rear voice zone, the former control class is a vehicle control class, the latter control class is a media control class, the association rule associated with the vehicle control class includes a copy rule, an inheritance rule and a rest rule, the association rule associated with the media control class includes an inheritance rule and a rest rule, when the first voice received from the auxiliary driving voice zone is "I'm too" too, the intention of the first voice is the copy intention, the copy rule of the vehicle control class can be satisfied, and the "window opening" of the main driving voice zone is determined to be the third voice.
For example, after the third voice is determined, S105 may be continued.
S105, determining a target control instruction according to the first voice and the third voice.
The method includes the steps that a first voice recognition text and a third voice recognition text can be combined, natural language processing is conducted on the combined voice text, a resolving result of the combined voice text is obtained, and a target control instruction is generated according to intention of the combined voice text and association rules met by the first voice, so that control of software or hardware is achieved, and a user expected target is achieved.
For example, taking the first voice as "I ' me too" from the secondary driving voice zone and the third voice as "opening the window" from the primary driving voice zone as an example, the voice text combined by the first voice and the third voice can be obtained as "opening the window", I ' me too ", and according to the intention of" I ' me too to open the window "and the copy rule satisfied by the first voice, a target control instruction for indicating to open the window at the secondary driving position is generated, so that the window at the secondary driving position is controlled to be opened by the vehicle, and the expected target of the user at the secondary driving position is completed.
For another example, taking the first voice as an "open point" from the secondary driving voice zone and the third voice as an "open window" from the primary driving voice zone, it is possible to obtain that the voice text in which the first voice and the third voice are combined is the "open window, open point", and according to the intention of the "open window, open point" and the inheritance rule satisfied by the first voice, a target control instruction for instructing to turn down the window at the primary driving position is generated, so as to control the vehicle to turn down the window at the primary driving position, and thus the intended target of the user at the secondary driving position is completed.
According to the embodiment of the disclosure, the first voice from the first voice zone is received, the first analysis result of the first voice is obtained according to the first voice, the recognition text corresponding to the second voice from the second voice zone and the second analysis result of the second voice are obtained, the third voice is determined from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, the target control instruction is determined according to the first voice and the third voice, context connection according to semantics is not needed, the current voice can be accurately associated with the historical voices of different voice zones in an offline state, and user experience is improved.
Fig. 2 is another flow chart of a voice association method according to an embodiment of the disclosure. As shown in fig. 2, according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, determining the third voice from the second voice may include:
s201, determining an association rule corresponding to the second voice according to the second analysis result and a preset corresponding relation.
The preset corresponding relation is used for indicating the corresponding relation between the control category of the second voice and the association rule.
The preset correspondence may be a mapping table of a control class of the second voice and an association rule, and the association rule corresponding to the second voice may be determined by looking up a table according to the control class of the second voice in the second analysis result.
S202, according to the intention of the first voice, determining a target association rule which can be met by the intention of the first voice.
By way of example, the association rules may include copy rules, inheritance rules, rest rules. The intention to satisfy the copy rule is a copy intention, the intention to satisfy the inheritance rule is an incomplete control intention, and the intention to satisfy the rest rule is a stop intention.
For example, when the first voice is "i am also", "i am also" the intention is the copy intention, the copy rule may be satisfied, when the first voice is "on point", "turn down point" the intention is the incomplete control intention, the inheritance rule may be satisfied, and when the first voice is "stop", "cancel" the intention is the stop intention, the rest rule may be satisfied.
S203, according to the association rule corresponding to the second voice, determining the second voice, of which the corresponding association rule comprises the target association rule, as the third voice.
By way of example, taking the example that the second voice includes "open window" of the main driving voice zone and "play song" of the left rear voice zone, the first voice is "i am also" of the auxiliary driving voice zone, the first voice is intended to be copied, and the copying rule can be satisfied, and the corresponding association rule of "open window" of the main driving voice zone includes the copying rule, then the "open window" of the main driving voice zone is determined as the third voice.
According to the method, the corresponding association rule of the second voice is determined according to the second analysis result and the preset corresponding relation, the target association rule which can be met by the intention of the first voice is determined according to the intention of the first voice, the corresponding association rule comprises the second voice of the target association rule according to the association rule corresponding to the second voice, the second voice is determined to be the third voice, and the third voice can be accurately determined from the second voice.
Fig. 3 is a schematic flow chart of a voice association method according to an embodiment of the disclosure. As shown in fig. 3, the third voice includes a plurality of voices, and determining a target control instruction according to the first voice and the third voice includes:
s301, generating a reply text according to a preset reply template, the first voice and the third voice.
For example, when there are a plurality of voices corresponding to the association rule that can be satisfied by the intention of the first voice, the determined third voice is also a plurality of voices. For example, taking the example that the second voice includes "raising the air-conditioning temperature" of the primary driving voice zone and "raising the music volume" of the left rear voice zone, the first voice is "re-raising point" of the secondary driving voice zone, the intention of the first voice is incomplete control intention, and the inheritance rule can be satisfied, and the association rules corresponding to "raising the air-conditioning temperature" and "raising the music volume" all include inheritance rules, then "raising the air-conditioning temperature" and "raising the music volume" are both third voices.
Illustratively, taking the example that the third voice includes two voices, the preset reply template may be "please ask is 'AAA' or 'BBB'? By "wherein the AAA may be a combined text of the recognized text corresponding to the first voice and the recognized text corresponding to one third voice, and the BBB may be a combined text of the recognized text corresponding to the first voice and the recognized text corresponding to the other third voice.
For example, taking the example that the third voice includes "raise air conditioning temperature" and "raise music volume", and the first voice is "raise again point", the reply text may be "please ask" whether to raise air conditioning temperature, raise again point "or" raise music volume, raise again point'? ".
S302, fourth voice from the first voice zone is received.
For example, the specific embodiment of S302 may refer to the specific embodiment of S101, which is not described herein.
S303, obtaining a fourth analysis result of the fourth voice according to the fourth voice.
The fourth analysis result is used for indicating the intention of the fourth voice.
For example, the specific embodiment of S303 may refer to the specific embodiment of S102, which is not described herein.
S304, determining the target voice from the third voice according to the similarity between the intention of the fourth voice and the intention of the third voice.
For example, the similarity between the intent of the fourth voice and the intent of the third voice may be determined by calculating a euclidean distance between a feature vector corresponding to the intent of the fourth voice and a feature vector straight corresponding to the intent of the third voice.
For example, a third voice having the greatest intended similarity with the fourth voice and a similarity greater than a preset similarity threshold may be determined as the target voice. The value of the preset similarity threshold is not limited.
For example, taking the example that the third voice includes "raise air-conditioning temperature" and "raise music volume", the fourth voice is "temperature high point", the fourth voice is intended to raise the temperature, and the "raise air-conditioning temperature" is determined as the target voice if the similarity to the intention of "raise air-conditioning temperature" is maximum.
S305, determining a target control instruction according to the first voice and the target voice.
For example, the specific embodiment of S305 may refer to the specific embodiment of S105, which is not described herein.
According to the method, the device and the system, the reply text is generated according to the preset reply template, the first voice and the third voice, the fourth voice from the first voice area is received, the fourth analysis result of the fourth voice is obtained according to the fourth voice, the target voice is determined from the third voice according to the similarity between the intention of the fourth voice and the intention of the third voice, the target control instruction is determined according to the first voice and the target voice, when the third voice comprises a plurality of target voices which are most relevant to the first voice can be accurately determined, the current voice is accurately associated with the historical voices of different voice areas, and user experience is improved.
In some possible embodiments, after obtaining the second voice of the second voice zone and the second analysis result of the second voice, the method further includes:
and determining a fifth voice from the second voice according to the receiving moment of the first voice and a preset time threshold.
The fifth voice is a second voice with the receiving duration meeting a preset time threshold.
Illustratively, the preset time threshold may be 30 seconds or 60 seconds, and the size of the preset time threshold is not limited.
For example, taking a preset time threshold of 60 seconds as an example, when the receiving time of the first voice is 20:01:00, the second voice includes a voice a with a receiving time of 20:00:33, a voice B with a receiving time of 20:00:15, and a voice C with a receiving time of 19:59:45, where the receiving time of voice a is 27 seconds, the receiving time of voice B is 45 seconds, and the receiving time of voice C is 75 seconds, then the voice a and the voice B are determined as fifth voices.
According to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, determining the third voice from the second voice may include:
and determining a third voice from the fifth voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the fifth voice.
For example, reference may be made to the specific embodiment of S104, which is not described herein.
According to the embodiment, the fifth voice is determined from the second voice according to the receiving time of the first voice and the preset time threshold, the fifth voice meeting the preset time threshold can be screened from the second voice, and the efficiency of determining the third voice can be improved.
In some possible embodiments, the recognition text corresponding to the historical speech of the first sound zone is stored in a first array container, the recognition text corresponding to the second speech from the second sound zone is stored in a second array container, and the first array container and the second array container are arranged in a single instance.
Illustratively, the second soundfield may include a plurality of soundfields, and when the second soundfield may include a plurality of soundfields, the second plurality of set of containers also includes a plurality, the plurality of second plurality of set of containers being in one-to-one correspondence with the plurality of soundfields. For example, taking the example that the second voice zone includes a secondary driving voice zone, a left rear voice zone, and a right rear voice zone, the second array container may include A, B, C, the second voice of the secondary driving voice zone is stored in the array container a, the second voice of the left rear voice zone is stored in the array container B, and the second voice of the right rear voice zone is stored in the array container C.
According to the embodiment, the recognition texts corresponding to the historical voices in different voice areas are stored in different array containers in a single example, so that the historical voices in different voice areas can be shared.
In some possible embodiments, after determining the third voice from the second voice according to the intention of the first voice in the first parsing result and the association rule corresponding to the second voice, the method may further include:
And storing the recognition texts corresponding to the first voice and the third voice in a storage position of the historical voice of the first voice zone.
By way of example, taking the first voice as an "open point" from the secondary driving voice zone and the third voice as an "open window" from the primary driving voice zone, the recognition texts corresponding to the first voice and the third voice may be combined to obtain an "open window", and the open point "is stored in a storage location of the corresponding historical voice of the secondary driving voice zone.
According to the method, the device and the system, the recognition texts corresponding to the first voice and the third voice are stored in the storage positions of the historical voices in the first voice area, so that the control category of the second voice is necessarily contained in the second analysis result of the acquired second voice when the new first voice is received subsequently, the third voice related to the first voice is determined more rapidly, and the efficiency of determining the third voice is further improved.
For example, the "open point" from the secondary driving voice zone is received at the time of 20:01:30, the "open window" from the primary driving voice zone received at the time of 20:01:00 is determined as the voice associated with the "open window", and the "open point" is stored in the storage position of the corresponding historical voice of the secondary driving voice zone. The're-opening point' from the left back sound zone is received at the moment of 20:01:50, the 'window opening of the auxiliary driving sound zone and the analysis result of the opening point' including the control category can be used for determining the associated voice of the're-opening point' of the left back sound zone, the 'opening point' of the auxiliary driving sound zone at the moment of 20:01:30 is not used as the historical voice, the situation that the analysis result corresponding to the 'opening point' does not have the control category and the historical voice must be acquired again is avoided.
The foregoing description of the embodiments of the present disclosure has been presented primarily in terms of methods. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The technical aim may be to use different methods to implement the described functions for each particular application, but such implementation should not be considered beyond the scope of the present disclosure.
In an exemplary embodiment, the embodiment of the disclosure further provides a voice associating apparatus, which may be used to implement the voice associating method as in the foregoing embodiment.
Fig. 4 is a schematic diagram of a voice associating apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include: a receiving module 401, a parsing module 402, an obtaining module 403, a determining module 405 and a controlling module 407.
A receiving module 401, configured to receive a first voice from a first voice zone;
the parsing module 402 is configured to obtain a first parsing result of the first voice according to the first voice, where the first parsing result is used to indicate an intention of the first voice;
the obtaining module 403 is configured to obtain a recognition text corresponding to a second voice from the second voice zone, and a second analysis result of the second voice, where the second voice is a history voice, and the second analysis result is used to indicate a control category and an intention of the second voice;
the determining module 405 is configured to determine, according to the intention of the first voice in the first analysis result and an association rule corresponding to the second voice, a third voice from the second voice, where the association rule corresponding to the second voice is determined according to a control class of the second analysis result of the second voice, and the third voice is a second voice that the association rule corresponding to the third voice includes an association rule that can be satisfied by the intention of the first voice;
the control module 407 is configured to determine a target control instruction according to the first voice and the third voice.
In some possible embodiments, the determining module 405 is specifically configured to:
determining an association rule corresponding to the second voice according to the second analysis result and a preset corresponding relation, wherein the preset corresponding relation is used for indicating the corresponding relation between the control category of the second voice and the association rule;
Determining a target association rule which can be met by the intention of the first voice according to the intention of the first voice;
and according to the association rule corresponding to the second voice, determining the second voice of which the corresponding association rule comprises the target association rule as a third voice.
In some possible embodiments, the third voice includes a plurality of third voices, and the control module 407 is specifically configured to:
generating a reply text according to a preset reply template, the first voice and the third voice;
receiving a fourth voice from the first voice zone;
according to the fourth voice, a fourth analysis result of the fourth voice is obtained, and the fourth analysis result is used for indicating the intention of the fourth voice;
determining a target voice from the third voice according to the similarity between the intention of the fourth voice and the intention of the third voice;
and determining a target control instruction according to the first voice and the target voice.
In some possible embodiments, the apparatus further comprises:
the screening module 404 is configured to determine, according to a preset time threshold, a fifth voice from the second voices after obtaining the second voice of the second voice zone and the second analysis result of the second voice, where the fifth voice is the second voice whose receiving duration meets the preset time threshold;
The determining module 405 is specifically configured to:
and determining a third voice from the fifth voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the fifth voice.
In some possible embodiments, the recognition text corresponding to the historical speech of the first sound zone is stored in a first array container, the recognition text corresponding to the second speech from the second sound zone is stored in a second array container, and the first array container and the second array container are arranged in a single instance.
In some possible embodiments, the apparatus further comprises:
the storage module 406 is configured to store the recognition text corresponding to the first voice and the third voice in a storage location of the history voice in the first voice zone after determining the third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice.
It should be noted that the division of the modules in fig. 4 is schematic, and is merely a logic function division, and other division manners may be implemented in practice. For example, two or more functions may also be integrated in one processing module. The embodiments of the present disclosure are not limited in this regard. The integrated modules may be implemented in hardware or in software functional modules.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
In an exemplary embodiment, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the above embodiments. The electronic device may be the computer or server described above.
In an exemplary embodiment, the readable storage medium may be a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the above embodiment.
In an exemplary embodiment, the computer program product comprises a computer program which, when executed by a processor, implements the method according to the above embodiments.
Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the electronic device 500 includes a computing unit 501 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic device 500 may also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in electronic device 500 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as a voice correlation method. For example, in some embodiments, the voice correlation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the voice association method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the speech association method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (15)

1. A method of voice correlation, the method comprising:
receiving a first voice from a first voice zone;
obtaining a first analysis result of the first voice according to the first voice, wherein the first analysis result is used for indicating the intention of the first voice;
acquiring a recognition text corresponding to a second voice from a second voice zone and a second analysis result of the second voice, wherein the second voice is a historical voice, and the second analysis result is used for indicating the control category and the intention of the second voice;
Determining a third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, wherein the association rule corresponding to the second voice is determined according to the control category of the second analysis result of the second voice, and the third voice is the second voice of which the corresponding association rule comprises the association rule which can be met by the intention of the first voice;
and determining a target control instruction according to the first voice and the third voice.
2. The method of claim 1, wherein the determining, according to the intent of the first voice in the first parsing result and the association rule corresponding to the second voice, a third voice from the second voice includes:
determining an association rule corresponding to the second voice according to the second analysis result and a preset corresponding relation, wherein the preset corresponding relation is used for indicating the corresponding relation between the control category of the second voice and the association rule;
determining a target association rule which can be met by the intention of the first voice according to the intention of the first voice;
and according to the association rule corresponding to the second voice, determining the second voice of which the corresponding association rule comprises the target association rule as the third voice.
3. The method of claim 1 or 2, the third speech comprising a plurality, the determining a target control instruction from the first speech and the third speech comprising:
generating a reply text according to a preset reply template, the first voice and the third voice;
receiving a fourth voice from the first voice zone;
obtaining a fourth analysis result of the fourth voice according to the fourth voice, wherein the fourth analysis result is used for indicating the intention of the fourth voice;
determining a target voice from third voice according to the similarity between the intention of the fourth voice and the intention of the third voice;
and determining the target control instruction according to the first voice and the target voice.
4. A method according to any of claims 1-3, after said obtaining a second voice of a second voice zone, a second parsing result of said second voice, the method further comprising:
according to a preset time threshold, determining a fifth voice from the second voices, wherein the fifth voice is the second voice with the receiving duration meeting the preset time threshold;
the determining, according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, a third voice from the second voice includes:
And determining a third voice from the fifth voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the fifth voice.
5. The method of any of claims 1-4, wherein the recognized text corresponding to the historical speech of the first soundfield is stored in a first array container, and the recognized text corresponding to the second speech from the second soundfield is stored in a second array container, the first array container and the second array container being provided in a single instance.
6. The method of claims 1-5, after determining a third voice from the second voice according to the intent of the first voice in the first parsing result and the association rule corresponding to the second voice, the method further comprising:
and storing the recognition texts corresponding to the first voice and the third voice in a storage position of the historical voice of the first voice zone.
7. A speech-related device, the device comprising:
the receiving module is used for receiving the first voice from the first voice zone;
the analysis module is used for obtaining a first analysis result of the first voice according to the first voice, wherein the first analysis result is used for indicating the intention of the first voice;
The system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring a recognition text corresponding to a second voice from a second voice zone and a second analysis result of the second voice, the second voice is a historical voice, and the second analysis result is used for indicating the control category and the intention of the second voice;
the determining module is used for determining a third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice, wherein the association rule corresponding to the second voice is determined according to the control category of the second analysis result of the second voice, and the third voice is the second voice of which the association rule corresponding to the association rule comprises the association rule which can be met by the intention of the first voice;
and the control module is used for determining a target control instruction according to the first voice and the third voice.
8. The apparatus of claim 7, the determining module is specifically configured to:
determining an association rule corresponding to the second voice according to the second analysis result and a preset corresponding relation, wherein the preset corresponding relation is used for indicating the corresponding relation between the control category of the second voice and the association rule;
Determining a target association rule which can be met by the intention of the first voice according to the intention of the first voice;
and according to the association rule corresponding to the second voice, determining the second voice of which the corresponding association rule comprises the target association rule as the third voice.
9. The apparatus according to claim 7 or 8, the third speech comprising a plurality of, the control module being in particular configured to:
generating a reply text according to a preset reply template, the first voice and the third voice;
receiving a fourth voice from the first voice zone;
obtaining a fourth analysis result of the fourth voice according to the fourth voice, wherein the fourth analysis result is used for indicating the intention of the fourth voice;
determining a target voice from third voice according to the similarity between the intention of the fourth voice and the intention of the third voice;
and determining the target control instruction according to the first voice and the target voice.
10. The apparatus according to any one of claims 7-9, further comprising:
the screening module is used for determining a fifth voice from the second voice according to a preset time threshold after the second voice of the second voice zone and the second analysis result of the second voice are obtained, wherein the fifth voice is the second voice of which the receiving duration meets the preset time threshold;
The determining module is specifically configured to:
and determining a third voice from the fifth voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the fifth voice.
11. The apparatus of any of claims 7-10, wherein the recognized text corresponding to the historical speech of the first soundfield is stored in a first array container, and the recognized text corresponding to the second speech from the second soundfield is stored in a second array container, the first array container and the second array container being arranged in a single instance.
12. The apparatus of claims 7-11, the apparatus further comprising:
and the storage module is used for storing the recognition texts corresponding to the first voice and the third voice in the storage position of the historical voice of the first voice zone after determining the third voice from the second voice according to the intention of the first voice in the first analysis result and the association rule corresponding to the second voice.
13. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor;
Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.
CN202311740472.6A 2023-12-18 2023-12-18 Voice association method, device, equipment and storage medium Pending CN117877470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311740472.6A CN117877470A (en) 2023-12-18 2023-12-18 Voice association method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311740472.6A CN117877470A (en) 2023-12-18 2023-12-18 Voice association method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117877470A true CN117877470A (en) 2024-04-12

Family

ID=90591017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311740472.6A Pending CN117877470A (en) 2023-12-18 2023-12-18 Voice association method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117877470A (en)

Similar Documents

Publication Publication Date Title
EP3441891A1 (en) Data source-based service customisation apparatus, method, system, and storage medium
CN113342345A (en) Operator fusion method and device of deep learning framework
US20220293085A1 (en) Method for text to speech, electronic device and storage medium
CN116705018A (en) Voice control method, voice control device, electronic equipment and readable storage medium
EP3910528A2 (en) Method, apparatus, device, and storage medium for generating response
CN113157877A (en) Multi-semantic recognition method, device, equipment and medium
CN115497458B (en) Continuous learning method and device of intelligent voice assistant, electronic equipment and medium
EP4030424A2 (en) Method and apparatus of processing voice for vehicle, electronic device and medium
CN113448668B (en) Method and device for skipping popup window and electronic equipment
CN117877470A (en) Voice association method, device, equipment and storage medium
CN112669839B (en) Voice interaction method, device, equipment and storage medium
CN113360590B (en) Method and device for updating interest point information, electronic equipment and storage medium
CN115762503A (en) Vehicle-mounted voice system, vehicle-mounted voice autonomous learning method, device and medium
CN113129894A (en) Speech recognition method, speech recognition device, electronic device and storage medium
CN113408632A (en) Method and device for improving image classification accuracy, electronic equipment and storage medium
CN114398130B (en) Page display method, device, equipment and storage medium
CN113223500B (en) Speech recognition method, method for training speech recognition model and corresponding device
US20230085458A1 (en) Dialog data generating
CN116978375A (en) User interface control method, device, equipment and storage medium
EP4068278A2 (en) Method and apparatus for voice recognition, electronic device and storage medium
CN114281964A (en) Method and device for determining conversation skill service, electronic equipment and storage medium
CN116631396A (en) Control display method and device, electronic equipment and medium
CN116521113A (en) Multi-screen control method and device and vehicle
CN114120982A (en) Voice recognition method, voice processing method and device and automatic driving vehicle
CN118193689A (en) Dialog generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination