CN114495921A

CN114495921A - Voice processing method and device and computer storage medium

Info

Publication number: CN114495921A
Application number: CN202011256851.4A
Authority: CN
Inventors: 时红仁; 应臻恺
Original assignee: Shanghai Qwik Smart Technology Co Ltd
Current assignee: Shanghai Qwik Smart Technology Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-13

Abstract

The application discloses a voice processing method, a voice processing device and a computer storage medium, wherein the voice processing method comprises the following steps: acquiring a first voice, wherein the first voice carries information of a receiver; determining a receiver account based on the receiver information; when the preset condition is met, playing the first voice through the target sound box device; and the target loudspeaker box device is the loudspeaker box device logged in by the account of the receiver. The voice processing method, the voice processing device and the computer storage medium can simply and quickly realize voice interaction among different users through the loudspeaker box equipment, and improve user experience.

Description

Voice processing method and device and computer storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, and a computer storage medium.

Background

The intelligent sound box is based on the traditional sound box, combines the communication technology, the sensing technology or other internet technologies, and has more and more new functions which are in line with the scenes of modern life on the basis of the basic functions of sound amplification. The intelligent sound box is not only considered as a core interaction original piece of a future intelligent home intelligent internet of things, but also considered as an artificial intelligence various content skill aggregate. However, the existing smart speaker can only be used for simple conversation with a user, and cannot realize voice interaction between different users, and how to realize voice interaction between different users through speaker equipment is in research.

The foregoing description is provided for general background information and is not admitted to be prior art.

Disclosure of Invention

An object of the present invention is to provide a voice processing method, apparatus and computer storage medium, which are advantageous in that voice interaction between different users is realized through a speaker device.

Another object of the present invention is to provide a voice processing method, apparatus and computer storage medium, which are advantageous in that a speaker device for outputting voice can be determined based on a real-time location of a user, thereby ensuring timeliness of voice interaction between different users.

Another object of the present invention is to provide a voice processing method, apparatus and computer storage medium, which are advantageous in that a recipient account corresponding to a user voice can be quickly determined according to the obtained user voice.

Another object of the present invention is to provide a voice processing method, apparatus and computer storage medium, which are advantageous in that the satisfaction of the reminding condition can be automatically detected according to the requirement input by the user, and the reminding voice can be accurately and timely output.

Another object of the present invention is to provide a voice processing method, apparatus and computer storage medium, which are advantageous in that the voice is played only when the user privacy conditions are met, so that information leakage is avoided, and the user privacy is better protected.

Additional advantages and features of the invention will become apparent from the following detailed description and may be realized by means of the instruments and combinations particularly pointed out in the appended claims.

In accordance with one aspect of the present invention, the foregoing and other objects and advantages are achieved by a speech processing method of the present invention, comprising the steps of:

acquiring a first voice, wherein the first voice carries information of a receiver;

determining a receiver account based on the receiver information; and

when a preset condition is met, playing the first voice through target sound box equipment; and the target loudspeaker box device is the loudspeaker box device logged in by the account of the receiver.

According to an embodiment of the present invention, the determining the account of the receiving party based on the information of the receiving party includes the following steps:

inquiring the stored login corresponding relation between the equipment identifications of different sound box equipment and the account based on the equipment identification of the sound box equipment for collecting the first voice, and obtaining a target account corresponding to the equipment identification;

acquiring address book information of the target account; and

and determining the account of the receiver based on the address book information of the target account and the information of the receiver. Therefore, the user does not need to additionally input the account of the receiving party, the account of the receiving party can be quickly determined, and the use experience of the user is further improved.

According to an embodiment of the present invention, the determining a receiver account according to the address book information of the target account and the receiver information includes the following steps:

when it is determined that a plurality of users matched with the receiver information exist or no users matched with the receiver information exist based on the address book information of the target account, an account corresponding to a user meeting a preset rule is selected as a receiver account based on preset information, and/or an account corresponding to a user indicated by a selection instruction is selected as a receiver account based on an input selection instruction. Therefore, when the account of the receiver cannot be directly determined based on the address book information and the receiver information, the account of the receiver is determined by combining preset information and/or input selection operation, and the method is flexible in operation and high in accuracy.

According to an embodiment of the invention, the preset condition comprises at least one of the following conditions:

detecting that the account of the receiver logs in the loudspeaker box device;

detecting that a receiver user corresponding to the receiver account is near the target loudspeaker box device; and

and detecting that only a receiver user and an appointed user corresponding to the receiver account are near the target sound box device, wherein the appointed user is preset by the receiver user. Therefore, the voice can be ensured to be timely known by a real receiver, the information leakage risk is reduced, and the voice interaction safety and privacy are improved.

Accordingly, the present invention provides a speech processing apparatus for performing the above method, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program: acquiring a first voice, wherein the first voice carries information of a receiver; determining a receiver account based on the receiver information; when the preset condition is met, playing the first voice through the target sound box device; and the target loudspeaker box device is the loudspeaker box device logged in by the account of the receiver.

Accordingly, the present invention provides a computer storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described speech processing method.

Drawings

Fig. 1 is a schematic application environment diagram of a speech processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a speech processing method according to an embodiment of the present invention;

FIG. 3 is an interaction diagram of a speech processing method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the recitation of an element by the phrase "comprising an … …" does not exclude the presence of additional like elements in the process, method, article, or apparatus that comprises the element, and further, where similarly-named elements, features, or elements in different embodiments of the disclosure may have the same meaning, or may have different meanings, that particular meaning should be determined by their interpretation in the embodiment or further by context with the embodiment.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.

It should be noted that step numbers such as S101 and S102 are used herein for the purpose of more clearly and briefly describing the corresponding contents, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform S102 first and then S101 in specific implementations, but these steps should be within the scope of the present application.

It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module", "component" or "unit" may be used mixedly.

Fig. 1 is a schematic diagram of an application environment of a voice processing method according to an embodiment of the present invention, where a sound box device 1 and a server 2 perform data interaction through a network 3. After a user logs in a server 2, sound box equipment 1 receives input first voice carrying information of a receiver and sends the first voice to the server 2; after the server 2 acquires the first voice, the account of the receiver is determined based on the information of the receiver, and when the preset condition is met, the first voice is played through the sound box equipment 1 logged in by the account of the receiver, so that voice interaction among different users is simply and quickly realized through the sound box equipment. The sound box device 1 receiving the first voice and the sound box device 1 logged in by the account of the receiver may be the same sound box device or different sound box devices. It should be noted that the sound box device 1 may be a car machine, a car-mounted sound box, an intelligent sound box, or the like. In addition, only two speaker devices are shown in fig. 1, respectively, for illustrative purposes, and one or more speaker devices may be used in practical applications. Optionally, the sound box device 1 or the server 2 may also perform the following steps: after first voice carrying receiver information is obtained, a receiver account is determined based on the receiver information, and when a preset condition is met, the first voice is played through target loudspeaker box equipment, wherein the target loudspeaker box equipment is loudspeaker box equipment logged in by the receiver account.

Referring to fig. 2, a schematic flow chart of a voice processing method according to an embodiment of the present invention is shown, where the voice processing method may be executed by a voice processing apparatus according to an embodiment of the present invention, the voice processing apparatus may be implemented in a software and/or hardware manner, and the voice processing apparatus may specifically be the sound box device 1 or the server 2 in fig. 1, and the voice processing method includes the following steps:

step S101: acquiring a first voice, wherein the first voice carries information of a receiver;

in an embodiment, the main execution body of the voice processing method is a sound box device, and the sound box device may acquire the first voice in response to detecting that the preset wake-up key is triggered, and/or acquire the first voice in response to receiving the preset wake-up voice or the preset wake-up word. Therefore, the voice is rapidly acquired through the rapid awakening operation, and the voice processing efficiency is improved. Here, the sound box device may collect, by using a voice collecting device such as a microphone, a first voice carrying the receiver information input by the user. In addition, after the user wakes up the sound box device, the user can control the sound box device to log in. The sound box device may support one or more login manners, for example, the sound box device may support voiceprint login, fingerprint login, face login, voice instruction login, and the like, and specifically, for different users, different voice instructions may be set as login verification information, for example, a first user may use a voice instruction "voicelet, sesame open door" as login verification information, and a second user may use a voice instruction "voicelet, tomaymi" as login verification information. Of course, the speaker device may also log in through a connected mobile terminal having a management application. In order to guide the user to log in the account, the sound box device can play a login prompt audio or display a login prompt message and the like.

In one embodiment, before the speaker device acquires the first voice, the method may further include the steps of: and obtaining login verification information, and sending the equipment identifier of the sound box equipment and the login verification information to a server so as to log in the target account. The device identifier of the sound box device may be unique identification information of the sound box device, such as a barcode, a two-dimensional code, or an electronic product code preset on the sound box device. Taking the login authentication information as login authentication voice as an example, after the sound box device acquires the login authentication voice input by the user, the login authentication voice and the device identifier of the sound box device are sent to the server to request the server to authenticate the login authentication voice and log in the corresponding target account. It can be understood that the purpose of sending the device identifier of the sound box device to the server by the sound box device is to enable the server to associate the device identifier of the sound box device with the target account after the login authentication information is verified, so as to complete login and facilitate subsequent voice interaction operations. In addition, after the login authentication information is authenticated to allow the target account to log in the sound box device, the server may send an authentication passing response message and information related to the target account, such as a user name, to the sound box device.

In another embodiment, the main execution body of the voice processing method is a server, and the server acquiring the first voice may be receiving the first voice sent by the loudspeaker device. Here, the speaker device that sends the first voice to the server may be referred to as a sender account or a speaker device that the sender user logs in.

It should be noted that the receiving side information may include at least one of the following information: the name, nickname, account, etc. of the receiving party, for example, the first voice may contain nicknames such as "mom", "dad", etc. Of course, the receiver information may be one receiver information or a plurality of receiver information. In addition, the first voice may also carry a device identifier of the sound box device logged in by the sender account or the sender user.

Step S102: determining a receiver account based on the receiver information;

specifically, based on the device identifier of the sound box device which collects the first voice, the login corresponding relation between the stored device identifiers of different sound box devices and the account is inquired, and the target account corresponding to the device identifier is obtained; acquiring address book information of the target account; and determining the account of the receiver based on the address book information of the target account and the information of the receiver. Therefore, the user does not need to additionally input the account of the receiving party, the account of the receiving party can be quickly determined, and the use experience of the user is further improved.

In an embodiment, an execution subject of the voice processing method is speaker equipment, and determining, by the speaker equipment, the account of the receiver based on the information of the receiver may specifically include the following steps: the sound box equipment inquires local address book information and takes an account number corresponding to the user matched with the receiver information as a receiver account; or the loudspeaker box equipment sends an inquiry request carrying the receiver information and the equipment identifier of the loudspeaker box equipment to a server so as to inquire the login corresponding relation between the equipment identifiers and accounts of different loudspeaker box equipment stored in the server, receive address book information of a target account, corresponding to the equipment identifiers, returned by the server based on the inquiry request, and determine the account of the receiver based on the address book information of the target account and the receiver information.

In another embodiment, the main execution body of the voice processing method is a server, and the server determines the account of the recipient based on the recipient information specifically includes the following steps: the server inquires the stored login corresponding relation between the equipment identification of different loudspeaker box equipment and the account based on the equipment identification of the loudspeaker box equipment, and obtains a target account corresponding to the equipment identification; acquiring address book information of the target account; and determining the account of the receiver based on the address book information of the target account and the information of the receiver. It can be understood that, after receiving the first voice carrying the receiver information sent by the speaker device, the server correspondingly learns that the speaker device wants to send the first voice to the receiver user corresponding to the receiver information, and at this time, it is necessary to first determine a target account, which may also be referred to as a sender account, for logging in the speaker device. When different users or the same user uses different accounts to log in the same or different loudspeaker box devices, the server can record the log-in corresponding relation between the device identification of the loudspeaker box device logged in by each account and the account, so that the loudspeaker box devices and the account can be managed conveniently. In an embodiment, the method may further comprise: the server receives the equipment identification and login verification information of the sound box equipment, which are sent by the sound box equipment, verifies the login verification information according to the stored corresponding relationship between different login verification information and accounts, logs the sound box equipment in a target account corresponding to the login verification information after the verification is passed, and stores the login corresponding relationship between the equipment identification of the sound box equipment and the target account. Here, in the process of logging in an account by an audio equipment, after receiving the equipment identifier and the login authentication information of the audio equipment sent by the audio equipment, the server verifies the login authentication information according to the stored correspondence between different login authentication information and the account, that is, compares whether the stored login authentication information includes the login authentication information sent by the audio equipment, if the stored login authentication information includes the login authentication information, indicates that the login authentication information passes the verification, and at this time, may correspondingly learn the target account corresponding to the login authentication information, and store the login correspondence between the equipment identifier of the audio equipment and the target account. It should be noted that the login authentication information may be voice information, fingerprint information, or a face image. In addition, after the verification of the login verification information sent by the sound box device is passed, the server can also send a verification passing response message to the sound box device to indicate that the sound box device has logged in the target account. Meanwhile, the verification passing response message may also carry the target account information. Further, the correspondence of the different login authentication information stored by the server with the account is acquired when the user registers the account.

Similar to an address book, an address book of WeChat, a contact list of QQ and the like on a mobile phone, the address book information of the target account includes names or nicknames of different users and data of corresponding accounts, such as names and corresponding accounts of friends, nicknames and corresponding accounts of family and the like. Here, the corresponding account includes, but is not limited to, a cell phone number, SIM card information, a social account number, a payment account number, or the like. Taking the registration of an account by using a mobile phone number as an example, a first user uses the mobile phone number a as an account to register in a server, dad of the first user uses the mobile phone number B as an account to register in the server, and the first user can add the account, namely the mobile phone number B, registered by dad in the server into an address book of the first user's own account by means of friend adding operation and the like, and set the corresponding contact name as a nickname ' dad '; of course, the server may also determine, according to the address book data, the contacts registered in the server among all the contacts of the first user by extracting the address book data on the mobile phone of the first user, and recommend to the first user whether to add one or more registered contacts to the address book of the account of the first user.

In one embodiment, the determining the account of the receiver based on the address book information of the target account and the information of the receiver may include the following steps: and inquiring the address book information of the target account, and determining the account corresponding to the user matched with the name or the nickname of the receiver as the account of the receiver. For example, if the receiver information includes the name of the receiver, "zhang san", the address book information of the target account is queried, and the account corresponding to the user, "zhang san" in the address book is used as the account of the receiver.

In another embodiment, the determining the account of the receiver based on the address book information of the target account and the information of the receiver may include the following steps: when it is determined that a plurality of users matched with the receiver information exist or no users matched with the receiver information exist based on the address book information of the target account, an account corresponding to a user meeting a preset rule is selected as a receiver account based on preset information, and/or an account corresponding to a user indicated by a selection instruction is selected as a receiver account based on an input selection instruction. Here, the preset information and the preset rule may be set according to actual needs, for example, if the preset information includes a chat record, the preset rule may include at least one of the following rules: a chat record is recorded with the account of the sender within a preset time length before the current time; and matching the name of the chat object corresponding to the chat record with the receiver information. Thus, the receiver user can be determined quickly and accurately by a simple screening operation. Examples are as follows: assuming that the receiver information includes a nickname "old teacher" and the address book information of the target account records a plurality of users with names or nicknames "old teacher", an account corresponding to the old teacher having a chat record within a preset time range (e.g., 10 minutes or 24 hours) before the current time may be used as the receiver account. Or, assuming that the recipient information includes a nickname "sixpania", and the address book information of the target account does not record a user with a name or a nickname "sixpania", at this time, chat information between different chat objects in the chat records may be queried, and if one chat information includes a case that a chat object is called "sixpania", such as "sixpania", that you are doing basketball at tomorrow "or" sixpania ", the account corresponding to the chat object may be taken as the recipient account. It should be noted that, when it is determined that there are multiple users matching the recipient information based on the address book information of the target account, multiple pieces of user information matching the recipient information may be output for the user to select, and the selection instruction input by the user may be a voice selection instruction or a touch selection instruction. For example, assuming that there are a plurality of users matching the receiver information "lie four", a plurality of users' information matching the receiver information "lie four" may be outputted by voice for the user to select a desired receiver user by voice. Therefore, when the account of the receiver cannot be directly determined based on the address book information and the receiver information, the account of the receiver is determined by combining preset information and/or input selection operation, and the method is flexible in operation and high in accuracy.

Step S103: when a preset condition is met, playing the first voice through target sound box equipment; and the target loudspeaker box device is the loudspeaker box device logged in by the account of the receiver.

In one embodiment, when the first voice is a reminding voice, that is, the first voice is used for reminding a user, the preset condition may include at least one of the following conditions: the current time meets the time reminding condition indicated by the first voice; the current position meets the position reminding condition indicated by the first voice; and detecting an event occurrence indicated by the first speech. For example, if the first voice is "two and a half afternoon, i.e., remind me to attend a meeting at three pm on time", if the current time is two and a half afternoon, which indicates that the current time satisfies the time reminding condition indicated by the first voice, the first voice is played through a target speaker device. The condition that the current position meets the position reminding condition indicated by the first voice can be that the current position is located at the reminding position indicated by the first voice or the current position is less than a preset distance from the reminding position indicated by the first voice. For example, assuming that the first voice is "remind me when the user will arrive at hotel a", if the distance between the current location and hotel a is smaller than the preset distance, which indicates that the current location meets the location reminding condition indicated by the first voice, the first voice is played through the target loudspeaker device. The detecting that the event indicated by the first voice occurs may be detecting that the event indicated by the first voice has occurred or is occurring, for example, assuming that the first voice is "remind me when weather forecast is broadcast", if weather is currently being broadcast, the first voice is played through a target loudspeaker device. In this embodiment, when the first voice meets at least one of the following requirements, determining that the first voice is a reminding voice includes: carrying no receiver information; the carried information of the receiver is the sender corresponding to the receiver; keywords such as time, place, or reminder are detected. Here, after the first voice is acquired, whether the first voice is a reminder-like voice or other voice may be directly recognized, or whether an instruction message for instructing that a reminder-like voice is to be input has been acquired before the first voice is acquired may be detected first. Therefore, the voice can be ensured to be known by the receiver in time, and the effectiveness and convenience of voice interaction are improved.

In another embodiment, when the first voice is a chat-type voice, that is, the first voice is intended to be a conversation or a chat with a user, the preset condition may include at least one of the following conditions: detecting that the account of the receiver logs in the loudspeaker box device; detecting that a receiver user corresponding to the receiver account is near the target loudspeaker box device; and detecting that only a receiver user and an appointed user corresponding to the receiver account are near the target sound box device, wherein the appointed user is preset by the receiver user or default by a system. Therefore, the voice can be ensured to be timely known by a real receiver, the information leakage risk is reduced, and the voice interaction safety and privacy are improved. Here, in an embodiment, when the main execution body of the voice processing method is a speaker device, the manner that the speaker device detects whether the recipient user and/or the designated user corresponding to the recipient account is/are near the speaker device may include one or more, and optionally, the manner or the parameter of at least one of voiceprint recognition, image recognition, infrared detection, and bluetooth signal strength detects whether the recipient user and/or the designated user corresponding to the recipient account is/are near the speaker device. It can be understood that the sound box device may collect sounds in the surrounding environment through a set sound collection device, such as a microphone, and perform voiceprint recognition on the collected sounds to detect whether a recipient user and/or a designated user corresponding to a recipient account is near the sound box device. For example, assume that a receiver user corresponding to a receiver account logs in to a sound box device by using a voice, after the sound box device acquires a first voice, the sound box device collects sounds in the surrounding environment, performs voiceprint recognition on the collected sounds in the surrounding environment based on the voice input by the receiver user corresponding to the receiver account during logging in, and if the recognition is successful, indicates that the receiver user corresponding to the receiver account is near the sound box device. Of course, the sound box device may also detect whether the receiver user corresponding to the receiver account is near the sound box device through a set human body sensing device such as an infrared detector. In addition, a receiver user corresponding to the receiver account may use the mobile terminal or the wearable device to establish a bluetooth connection with the speaker device, at this time, the bluetooth signal strength between the speaker device and the mobile terminal or the wearable device, which is detected by the mobile terminal, may represent the distance between the receiver user corresponding to the receiver account and the speaker device, if the bluetooth signal strength is greater than a preset strength threshold value, it is detected that the receiver user corresponding to the receiver account is near the speaker device, otherwise, it is indicated that the receiver user corresponding to the receiver account is not near the speaker device. Examples are as follows: supposing that a user inputs voice ' mom ' through a vehicle-mounted sound box, the bank card password is 123456 ', the mom of the user logs in an intelligent sound box at home by using a corresponding account of the mom and the intelligent sound box acquires the voice, the intelligent sound box determines whether to output the voice according to different occasions, for example, the voice is output when only the mom of the user is detected to be at home or only the mom of the user and other appointed family members are detected to be at home, and the voice is not output when the guest is at home except the mom of the user and/or other appointed family members are detected to be at home, so that the voice can be timely known by a receiver and the information leakage risk is reduced.

In another embodiment, when the main execution body of the voice processing method is a server, the server may detect whether the account of the receiver logs in to the speaker device by detecting whether login authentication information related to the account of the receiver is received, and in addition, the server may also determine whether a user of the receiver corresponding to the account of the receiver is near the target speaker device by receiving a detection result sent by the speaker device, and whether only the user of the receiver corresponding to the account of the receiver and the designated user are near the target speaker device. For example, after receiving the voice sent by the first speaker device, if it is detected that the account of the receiver corresponding to the voice logs in the second speaker device, the server may send the voice to the second speaker device, so that the second speaker device plays the voice.

In summary, in the voice processing method provided in the above embodiment, after a first voice is acquired, a receiver account is determined based on receiver information carried in the first voice, and when a preset condition is met, the first voice is played through a speaker device logged in by the receiver account, so that voice interaction between different users is realized through the speaker device; meanwhile, the loudspeaker box equipment for outputting the voice can be determined based on the real-time position of the user, so that the timeliness of voice interaction among different users is ensured; moreover, the account of the receiving party corresponding to the user voice can be quickly determined according to the acquired user voice, the requirement of the user input can be automatically detected, the reminding condition can be automatically met, and the reminding voice can be accurately and timely output, so that the use experience of the user is improved; and the voice is played under the condition of meeting the user privacy conditions, so that information leakage is avoided, and the user privacy is better protected.

In another embodiment, the first voice further carries a sender account, and the method further includes the following steps: and when the second voice is acquired within the preset time length, playing the second voice through the sound box equipment logged in by the account of the sender. Here, the execution subject of the voice processing method is speaker equipment, and when the speaker equipment logged in by the account of the receiver determines that the second voice is acquired within a preset time period, the speaker equipment logged in by the account of the sender plays the second voice so as to realize quick voice interaction. For example, the speaker device includes a first speaker device and a second speaker device, assuming that a sender user inputs a first voice to the first speaker device logged in by a sender account, the first speaker device enables the second speaker device logged in by a receiver account to play the first voice through a server, at this time, the second speaker device can keep a wake-up, monitoring or activation state within a preset time after playing the first voice, if the receiver user inputs a second voice to the second speaker device within the preset time, the second speaker device determines that a sender account corresponding to the first voice is used as a receiver account corresponding to the second voice, and then sends the second voice and the sender account corresponding to the first voice to the server, so that the first speaker device logged in the sender account plays the second voice through the server. The first sound box device and the second sound box device can be the same sound box device or different sound box devices. Therefore, when the second voice is acquired within the preset time after the first voice is played, the second voice is played directly through the sound box device logged in by the sender account, the sound box device does not need to be awakened again, the receiver information does not need to be repeated, the operation is convenient and fast, and the voice interaction efficiency is improved.

Based on the same inventive concept of the foregoing embodiment, the following describes the technical solution of the foregoing embodiment in detail through specific application scenarios. In this embodiment, taking that the execution main body of the voice processing method includes a sound box device and a server, the sound box device includes a first sound box device and a second sound box device, and the first sound box device is a car machine and the second sound box device is an intelligent sound box as an example, referring to fig. 3, an interaction schematic diagram of the voice processing method provided by the embodiment of the present invention includes the following steps:

s201, the vehicle machine acquires a login verification voice input by a user;

in this embodiment, the login authentication voice is "123" as an example.

S202, the vehicle machine sends login verification voice to a server;

s203, the server performs voiceprint recognition on the login verification voice;

specifically, the server performs voice print recognition on the login authentication voice "123" to inquire whether there is a user matching the voice print which uttered the login authentication voice "123".

S204, the server successfully identifies and controls the vehicle machine to log in the target account;

here, after the server successfully identifies the voice print of the login verification voice "123", the server acquires the target account corresponding to the user to control the car machine to log in the target account, and returns a response message of logging in the target account to the car machine.

S205, the car machine acquires a first voice input by a user;

in this embodiment, the first voice is "mom, get home at night" as an example for explanation.

S206, the car machine sends a first voice to a server;

here, the car machine sends the first voice "mom, late-time go home" to the server.

S207, the server searches for a receiver account corresponding to the first voice in the address book of the target account;

here, the server looks up the account of mom of the user in the address book of the target account, and takes the account as the account of the receiving party.

S208, the server controls the intelligent sound box to log in an account of the receiver;

here, the user of the smart speaker may input a login authentication voice to the smart speaker, the smart speaker transmits the login authentication voice to the server to request to log in the account of the recipient, and the server transmits a response message to the smart speaker to log in the account of the recipient after the login authentication voice passes, so that the smart speaker successfully logs in the account of the recipient.

S209, the server sends a first voice to the intelligent sound box;

here, when the server detects that the login account of the smart speaker is the account of the receiver, the server sends the first voice to the smart speaker.

S210, when the intelligent sound box detects that a preset condition is met, playing the first voice;

here, the smart speaker plays the first voice when detecting that the mother of the user is nearby.

S211, the intelligent sound box obtains a second voice in a preset time length;

in this embodiment, the explanation will be given by taking the example that the second voice is the reply voice input by the mother of the user, "good, i go to the park and walk". For example, assuming that the preset time is 10 seconds, after the mother of the user finishes playing the first voice "mother, get home at night" in the smart sound box, the mother replies a second voice "good, i go to the park and walk" within 5 seconds, and then the smart sound box acquires the second voice.

S212, the intelligent sound box sends the second voice to a server;

s213, the server sends the second voice to the vehicle machine;

and S214, the car machine plays the second voice.

So, realize the voice interaction between the user simply and fast through car machine and intelligent audio amplifier, promoted user and used experience.

Based on the same inventive concept as the foregoing embodiment, an embodiment of the present invention provides a voice processing apparatus, as shown in fig. 4, where the voice processing apparatus may be disposed on a server, a car machine, an intelligent speaker, and other devices, and a person skilled in the art may not limit the specific location of the setting and/or installation of the voice processing apparatus, and the voice processing apparatus includes: a processor 110 and a memory 111 for storing computer programs capable of running on the processor 110; the processor 110 illustrated in fig. 4 is not used to refer to the number of the processors 110 as one, but is only used to refer to the position relationship of the processor 110 relative to other devices, and in practical applications, the number of the processors 110 may be one or more; similarly, the memory 111 illustrated in fig. 4 is also used in the same sense, that is, it is only used to refer to the position relationship of the memory 111 relative to other devices, and in practical applications, the number of the memory 111 may be one or more.

Wherein, when the processor 110 is used to run the computer program, the following steps are executed:

acquiring a first voice, wherein the first voice carries information of a receiving party;

determining a receiver account based on the receiver information; and

Therefore, voice interaction among different users is realized through the sound box equipment; meanwhile, the loudspeaker box equipment for outputting the voice can be determined based on the real-time position of the user, so that the timeliness of voice interaction among different users is ensured; moreover, the account of the receiving party corresponding to the user voice can be quickly determined according to the acquired user voice, the requirement of the user input can be automatically detected, the reminding condition can be automatically met, and the reminding voice can be accurately and timely output, so that the use experience of the user is improved; and the voice is played under the condition of meeting the user privacy conditions, so that information leakage is avoided, and the user privacy is better protected.

In an embodiment, the first voice further carries a sender account, and the processor 110 is configured to execute the following steps when running the computer program:

and when the second voice is acquired within the preset time length, playing the second voice through the sound box equipment logged in by the account of the sender.

In another embodiment, the processor 110 is configured to execute the following steps when executing the computer program:

acquiring address book information of the target account; and

and determining the account of the receiver based on the address book information of the target account and the information of the receiver.

Therefore, the user does not need to additionally input the account of the receiving party, the account of the receiving party can be quickly determined, and the use experience of the user is further improved.

In another embodiment, the processor 110 is configured to execute the computer program to perform the following steps:

when it is determined that a plurality of users matched with the receiver information exist or no users matched with the receiver information exist based on the address book information of the target account, an account corresponding to a user meeting a preset rule is selected as a receiver account based on preset information, and/or an account corresponding to a user indicated by a selection instruction is selected as a receiver account based on an input selection instruction.

Therefore, when the account of the receiver cannot be directly determined based on the address book information and the receiver information, the account of the receiver is determined by combining preset information and/or input selection operation, and the method is flexible in operation and high in accuracy.

In another embodiment, the preset information comprises a chat log; the preset rule comprises at least one of the following rules:

a chat record is recorded with the account of the sender within a preset time length before the current time; and

and matching the name of the chat object corresponding to the chat record with the receiver information.

Therefore, the user of the receiving party can be determined quickly and accurately through simple screening operation.

In another embodiment, the preset condition comprises at least one of the following conditions:

the current time meets the time reminding condition indicated by the first voice;

the current position meets the position reminding condition indicated by the first voice; and

an event occurrence indicated by the first voice is detected.

Therefore, the voice can be ensured to be known by the receiver in time, and the effectiveness and convenience of voice interaction are improved.

detecting that the account of the receiving party logs in the loudspeaker box device;

and detecting that only a receiver user and an appointed user corresponding to the receiver account are near the target sound box device, wherein the appointed user is preset by the receiver user.

Therefore, the voice can be ensured to be timely known by a real receiver, the information leakage risk is reduced, and the voice interaction safety and privacy are improved.

acquiring a first voice in response to detecting that a preset awakening key is triggered; and/or the presence of a gas in the gas,

and acquiring a first voice in response to receiving the preset awakening voice or the preset awakening word.

Therefore, the voice is rapidly acquired through the rapid awakening operation, and the voice processing efficiency is improved.

The speech processing apparatus may further include: at least one network interface 112. The various components of the speech processing apparatus are coupled together by a bus system 113. It will be appreciated that the bus system 113 is used to enable communications among the components. The bus system 113 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 113 in FIG. 4.

The memory 111 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 111 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 111 in the embodiment of the present invention is used to store various types of data to support the operation of the voice processing apparatus. Examples of such data include: any computer program for operating on the speech processing apparatus, such as an operating system and application programs; contact data; telephone directory data; a message; a picture; video, etc. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs such as a Media Player (Media Player), a Browser (Browser), etc. for implementing various application services. Here, a program that implements the method of the embodiment of the present invention may be included in the application program.

Based on the same inventive concept of the foregoing embodiments, this embodiment further provides a computer storage medium, where a computer program is stored in the computer storage medium, where the computer storage medium may be a Memory such as a magnetic random access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read Only Memory (CD-ROM), and the like; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

Wherein the computer program, when executed by a processor, performs the steps of:

determining a recipient account based on the recipient information; and

Therefore, voice interaction among different users is realized through the sound box equipment; meanwhile, the loudspeaker box equipment for outputting the voice can be determined based on the real-time position of the user, so that the timeliness of voice interaction among different users is ensured; moreover, the account of the receiving party corresponding to the user voice can be quickly determined according to the acquired user voice, the requirement of the user input can be automatically detected, the reminding condition can be automatically met, and the reminding voice can be accurately and timely output, so that the use experience of the user is improved; and the voice is played under the condition that the user privacy condition is met, so that information leakage is avoided, and the user privacy is better protected.

In one embodiment, the first voice further carries a sender account, and the computer program, when executed by the processor, performs the following steps:

In another embodiment, the computer program, when executed by a processor, performs the steps of:

acquiring address book information of the target account; and

a chat record is formed between the sender account and the account of the sender within a preset time length before the current time; and

an event occurrence indicated by the first voice is detected.

detecting that the account of the receiver logs in the loudspeaker box device;

Therefore, the voice is rapidly acquired through the rapid awakening operation, and the voice processing efficiency is improved. The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, including not only those elements listed, but also other elements not expressly listed.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of speech processing, the method comprising the steps of:

determining a receiver account based on the receiver information; and

2. The method of claim 1, wherein the first voice also carries a sender account, the method further comprising:

3. The method of claim 1, wherein the determining a recipient account based on the recipient information comprises:

acquiring address book information of the target account; and

4. The method of claim 3, wherein the determining a recipient account according to the address book information of the target account and the recipient information comprises the following steps:

5. The method of claim 4, wherein the preset information comprises a chat log; the preset rule comprises at least one of the following rules:

6. The method of claim 1, wherein the preset condition comprises at least one of:

an event occurrence indicated by the first voice is detected.

7. The method of claim 1, wherein the preset condition comprises at least one of:

detecting that the account of the receiver logs in the loudspeaker box device;

8. The method of claim 1, wherein said obtaining a first voice comprises the steps of:

9. A speech processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the speech processing method according to any of claims 1 to 8 when executing the computer program.

10. A computer storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech processing method according to any one of claims 1 to 8.