CN108694947B - Voice control method, device, storage medium and electronic equipment - Google Patents

Voice control method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN108694947B
CN108694947B CN201810681095.6A CN201810681095A CN108694947B CN 108694947 B CN108694947 B CN 108694947B CN 201810681095 A CN201810681095 A CN 201810681095A CN 108694947 B CN108694947 B CN 108694947B
Authority
CN
China
Prior art keywords
voice
target
segment
preset
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810681095.6A
Other languages
Chinese (zh)
Other versions
CN108694947A (en
Inventor
陈岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN201810681095.6A priority Critical patent/CN108694947B/en
Publication of CN108694947A publication Critical patent/CN108694947A/en
Priority to PCT/CN2019/085720 priority patent/WO2020001165A1/en
Application granted granted Critical
Publication of CN108694947B publication Critical patent/CN108694947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/16Storage of analogue signals in digital stores using an arrangement comprising analogue/digital [A/D] converters, digital memories and digital/analogue [D/A] converters 
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephone Function (AREA)

Abstract

The application discloses a voice control method, a device, a storage medium and an electronic device, wherein the voice control method is applied to the electronic device and comprises the following steps: when the electronic equipment is in a voice monitoring state, updating a stored voice segment by using the monitored voice data, wherein the voice segment is a section of voice data monitored in the latest preset time; in the updating process of the voice section, acquiring a current voice section; extracting voiceprint features and keywords from the current voice section; starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice segment at the successful starting moment as a target voice segment; the electronic equipment is correspondingly controlled according to the recorded voice data and the target voice segment, so that the electronic equipment can be directly awakened and input an interactive instruction through continuous voice, voice interruption caused by system preparation time length is not needed, and the method is simple.

Description

Voice control method, device, storage medium and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a voice control method and apparatus, a storage medium, and an electronic device.
Background
With the wide application of mobile terminals, voice assistants of mobile terminals are also commonly used by people. The user can use the voice assistant function of the mobile terminal to perform voice interaction with the machine assistant, so that the machine assistant can complete various operations on the mobile terminal under the voice control of the user, including various operations on an application program on the mobile terminal, such as setting a schedule, starting an alarm clock, setting a proxy, opening an application, making a call, and the like.
The starting process of the existing voice assistant mainly comprises two stages, namely a wakeup stage and a preparation stage, for example, a terminal can monitor the voice of a user in real time, when the situation that the user speaks a wakeup word is monitored, a system carries out related preparation work to start the voice assistant, at present, after the user speaks the wakeup word, the user needs to wait until the system is ready to send a voice instruction, the waiting time is long, and the voice continuity is poor.
Disclosure of Invention
The embodiment of the application provides a voice control method, a voice control device, a storage medium and an electronic device, a voice instruction can be sent out without waiting for the system to be ready, and the voice consistency is good.
The embodiment of the application provides a voice control method, which is applied to electronic equipment and comprises the following steps:
when the electronic equipment is in a voice monitoring state, updating a stored voice segment by using monitored voice data, wherein the voice segment is a segment of voice data monitored in a latest preset time;
acquiring a current voice section in the updating process of the voice section;
extracting voiceprint features and keywords from the current voice section;
starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice segment at the successful starting moment as a target voice segment;
and correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
The embodiment of the present application further provides a voice control apparatus, which is applied to an electronic device, and includes:
the updating module is used for updating a stored voice segment by using the monitored voice data when the electronic equipment is in a voice monitoring state, wherein the voice segment is a segment of voice data monitored in the latest preset time;
the acquisition module is used for acquiring the current voice section in the updating process of the voice section;
the extraction module is used for extracting voiceprint features and keywords from the current voice section;
the starting module is used for starting a voice recording function according to the voiceprint features and the keywords and acquiring a voice segment at the successful starting moment as a target voice segment;
and the control module is used for correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
The embodiment of the application also provides a computer-readable storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute any one of the voice control methods.
An embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the processor is electrically connected to the memory, the memory is used to store instructions and data, and the processor is used in any one of the steps of the voice control method.
The voice control method, the device, the storage medium and the electronic equipment are applied to the electronic equipment, when the electronic equipment is in a voice monitoring state, the stored voice segment is updated by utilizing the monitored voice data, the voice segment is the voice data monitored in the latest preset time length, then, in the updating process of the voice segment, the current voice segment is obtained, the voiceprint characteristic and the keyword are extracted from the current voice segment, then, the voice recording function is started according to the voiceprint characteristic and the keyword, the voice segment at the successful starting moment is obtained as the target voice segment, then, the electronic equipment is correspondingly controlled according to the recorded voice data and the target voice segment, so that the electronic equipment can be directly awakened and input an interactive instruction through continuous voice, voice interruption caused by system preparation time length is avoided, and the method is simple, the voice interaction efficiency can be effectively improved, and the voice interaction effect is good.
Drawings
The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a schematic view of an application scenario of a voice control system according to an embodiment of the present application.
Fig. 2 is a schematic flow chart of a voice control method according to an embodiment of the present application.
Fig. 3 is another schematic flow chart of a voice control method according to an embodiment of the present application.
Fig. 4 is a schematic diagram of a framework of a voice control process according to an embodiment of the present application.
Fig. 5 is a schematic diagram of a server parsing process provided in the embodiment of the present application.
Fig. 6 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of a control module according to an embodiment of the present application.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Fig. 9 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a voice control method, a voice control device, a storage medium and electronic equipment.
Referring to fig. 1, fig. 1 provides a schematic view of an application scenario of a voice control system, where the voice control system may include any one of the voice control apparatuses provided in the embodiments of the present application, and the voice control apparatus may be integrated in an electronic device, and the electronic device may include a smart phone, a tablet computer, and other devices with a touch function.
When the electronic equipment is in a voice monitoring state, the electronic equipment can update a stored voice segment by using the monitored voice data, wherein the voice segment is a section of voice data monitored in the latest preset time; in the updating process of the voice section, acquiring a current voice section, and extracting voiceprint features and keywords from the current voice section; starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice segment at the successful starting moment as a target voice segment; and correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
For example, the preset time period may be 3 seconds set manually, and is usually as long as the system preparation time period for starting the voice recording function. In fig. 1, the electronic device may monitor the user's voice in real time, and save the last 3 seconds of voice data in real time for voiceprint analysis and keyword matching during the monitoring process, once the user's voiceprint and keyword satisfy the condition, for example, for the mobile phone of user a, if user a speaks a sentence: the mobile phone can start recording when a user A sends out the small x, acquire a voice segment at the successful starting moment, such as playing xx response, and splice the voice segment with subsequently recorded voice "used xxx song", so as to obtain continuous voice of playing xxx song in xx application, and correspondingly control the mobile phone according to the continuous voice.
As shown in fig. 2, fig. 2 is a schematic flow chart of a voice control method provided in an embodiment of the present application, which is applied to an electronic device, and the specific flow may be as follows:
101. when the electronic equipment is in a voice monitoring state, the stored voice segment is updated by using the monitored voice data, and the voice segment is a section of voice data monitored in the latest preset time length.
In this embodiment, the voice may be monitored by a device such as a microphone. The preset duration may be set manually, for example, 3 seconds, which is generally as long as the system preparation time for activating the voice recording function. Specifically, the electronic device may monitor the user voice in a low power consumption state, and store the voice data monitored within the latest preset time period in real time.
102. And in the updating process of the voice section, acquiring the current voice section.
In this embodiment, the voice monitoring and the voice segment updating are performed in real time, and in this process, the electronic device may also perform real-time analysis on the voice segment.
103. Voiceprint features and keywords are extracted from the current speech segment.
In this embodiment, the voiceprint feature is mainly a frequency spectrum feature, which may include information such as frequency and amplitude, and generally reflects characteristics such as loudness, tone, and timbre of sound, and different people generally have different voiceprint features, and specifically, a voice segment may be converted into frequency spectrum data by fourier transform, and then relevant information is extracted from the frequency spectrum data as a corresponding voiceprint feature. The keyword may include at least one word, which may be english or chinese, etc.
104. And starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice segment at the successful starting moment as a target voice segment.
For example, the step 104 may specifically include:
judging whether the voiceprint features are matched with preset features or not, and whether the keywords are matched with preset keywords or not;
if yes, starting a voice recording function;
if not, returning to execute the operation of obtaining the current voice section.
In this embodiment, the preset feature is mainly a voiceprint feature of a bound user, which is usually set in advance, for example, the user may be required to record a segment of voice in advance, extract the voiceprint feature from the voice as the preset feature, and bind with the user. The preset keyword is mainly used for triggering and starting a voice interaction function, and may be set by default (that is, set by a manufacturer when the electronic device leaves a factory), or may be set by a user according to preferences of the user, for example, the user may enter a corresponding setting window through different interfaces of a related setting interface to set the setting.
It should be noted that, in the process of starting the voice recording function, the electronic device needs to perform a series of preparation operations, such as interrupting the running of foreground application, setting the calling parameters of the voice recording component, and the like.
105. And correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
For example, the step 105 may specifically include:
1-1, splicing the target voice segment and the recorded voice data to obtain spliced voice.
In this embodiment, because the voice duration of the voice segment is about the same as the system preparation duration, that is, the voice segment stored at the system preparation time (that is, the time when the voice recording function is successfully started) may be just a continuous utterance made by the user after speaking the preset keyword, and the voice segment may be directly used as the initial content of the recorded voice and spliced with the subsequently recorded voice data to form a continuous voice segment.
And 1-2, determining a control instruction according to the spliced voice.
For example, the step 1-2 may specifically include:
sending the spliced voice to a server so that the server performs semantic analysis on the spliced voice and returns a corresponding target analysis word;
determining a target application and an application event according to the returned target analytic words;
and generating a control instruction according to the target application and the application event.
In this embodiment, the server is mainly used for semantic analysis, and in the voice recording process, the electronic device can transmit the spliced voice to the server in real time, and the server can analyze the spliced voice through a trained voice analysis model, where the semantic analysis model may be a deep learning model, and the server can collect different voice samples in advance to train the deep learning model.
Further, the step of "determining the target application according to the returned target parsing word" may specifically include:
judging whether the target analytic word is matched with a stored analytic word set;
if yes, determining the target application according to the successfully matched analytic words;
if not, judging whether the target analytic word is matched with a preset word set or not; and when the matching is successful, determining the target application according to the successfully matched preset word, adding the successfully matched preset word into the analysis word set, and deleting the successfully matched preset word from the preset word set.
In this embodiment, the parsing word set and the preset word set mainly include application related words, such as application names or application types, where the parsing word set is mainly parsed when the electronic device requests the server to perform semantic parsing in a historical period, and the preset word set is mainly set by default, and for example, each time an application is installed in a terminal, the application related words of the application can be obtained. Specifically, when the target parsing word is an application name, the application corresponding to the application name may be directly used as the target application, when the parsing word is an application type, all applications belonging to the same application type in the electronic device may be determined first, and then the application with the highest use frequency may be used as the target application, or a selection interface may be provided for a user according to the applications, and the target application may be determined through a selection operation of the user.
It should be noted that, for the initial parsing, all words default set by the system may be included in the preset word set, the parsed word set may be empty, and then each time the electronic device obtains a new parsed word, the new parsed word may be stored in the parsed word set and deleted from the preset word set, thereby implementing continuous updating of the parsed word set and the preset word set. By matching the target analysis word with the previous analysis records after each analysis, the matching range can be narrowed in combination with the interaction habits of the user, and the matching speed is improved.
1-3, controlling the electronic equipment to execute the control instruction.
For example, the steps 1 to 3 may specifically include:
the application event is executed with the target application.
In this embodiment, if the target application is in the closed state, the target application may be started first, and then the corresponding application event may be executed by using the target application.
As can be seen from the above, the voice control method provided in this embodiment is applied to an electronic device, when the electronic device is in a voice monitoring state, the stored voice segment is updated by using the monitored voice data, the voice segment is the voice data monitored within the latest preset time duration, then, in the updating process of the voice segment, the current voice segment is obtained, the voiceprint feature and the keyword are extracted from the current voice segment, then, the voice recording function is started according to the voiceprint feature and the keyword, the voice segment at the successful start time is obtained as the target voice segment, and then, the electronic device is correspondingly controlled according to the recorded voice data and the target voice segment, so that the electronic device can be directly awakened and input an interactive instruction through continuous voice, without voice interruption due to the system preparation time duration, the method is simple, and the voice interaction efficiency can be effectively improved, the voice interaction effect is good.
In the present embodiment, a description will be given from the perspective of a voice control apparatus, and in particular, a detailed description will be given taking an example in which the voice control apparatus is integrated in an electronic device.
Referring to fig. 3, a specific flow of a voice control method may be as follows:
201. when the electronic equipment is in a voice monitoring state, the electronic equipment updates the stored voice segment by using the monitored voice data, wherein the voice segment is a section of voice data monitored in the latest preset time length.
For example, the preset time period may be 3 seconds set manually. As long as the electronic device is in the power-on state, the voice monitoring can be performed all the time, and for the monitored voice data, the electronic device can only store the voice data within the last 3 seconds.
202. In the updating process of the voice section, the electronic equipment acquires the current voice section and extracts the voiceprint features and the keywords from the current voice section.
For example, referring to fig. 4, when the user speaks "small x, play xxx song in xx application" to the electronic device, since the voice monitoring operation and the voice segment updating operation are performed in real time, when the user speaks the first 3 seconds of voice, such as "small x", the voice can be stored as an initial voice segment, and then the electronic device performs an operation of extracting voiceprint features and keywords from the stored voice segment, such as converting the stored voice segment into spectrum data by using fourier transform, extracting relevant information from the spectrum data as corresponding voiceprint features, and at the same time, extracting the content of the voice segment to obtain the keywords.
203. The electronic device determines whether the voiceprint feature matches a preset feature and whether the keyword matches a preset keyword, if yes, the following step 204 is executed, and if not, the step 202 is executed in a return mode.
For example, the preset feature may be a voiceprint feature input by the user in advance, the preset keyword may be a default phrase set by the system, which may include at least two words or phrases, and of course, to avoid that the voice segment of the preset duration cannot include a complete preset keyword, the preset keyword should be shorter, such as "small x".
204. The electronic equipment starts a voice recording function and acquires a voice segment at the successful starting moment as a target voice segment.
For example, in fig. 4, when the initial voice segment "small x and small x" is analyzed to find that the voice of the user meets the condition, the electronic device may be immediately notified to perform related preparation tasks, such as stopping the running of the foreground application, setting the calling parameters of the voice recording component, and so on.
205. And the electronic equipment splices the target voice segment and the recorded voice data to obtain spliced voice, and then sends the spliced voice to a server so that the server carries out semantic analysis on the spliced voice and returns corresponding target analysis words.
For example, after the voice recording function is successfully started, because the subsequent user says that the voice is stored in a recording mode, the voice segment does not need to be repeatedly stored in a voice segment updating mode, at this time, the updating operation of the voice segment can be stopped, and the current voice segment "play xx application" is used as the initial content of the recorded voice and is spliced with the subsequently recorded voice data into a continuous voice, for example, in the first 2 seconds of recording, the spliced voice can be "play xx application", and meanwhile, the spliced voice is transmitted to the server in real time for semantic analysis.
206. The electronic device determines whether the target parsing word is matched with the stored parsing word set, if so, determines a target application according to the successfully matched parsing word, and determines an application event according to the target parsing word, otherwise, executes the following step 207.
207. The electronic equipment judges whether the target analytic word is matched with a preset word set or not, if yes, the target application is determined according to the successfully matched preset word, an application event is determined according to the target analytic word, then the successfully matched preset word is added into the analytic word set, meanwhile, the successfully matched preset word is deleted from the preset word set, and if not, the user is prompted to input voice again.
For example, referring to fig. 5, for the target parsing word returned by the server, the electronic device may match the target parsing word with the previous parsing record, and only when the matching is unsuccessful, the electronic device may match the target parsing word in the preset word set. Referring to fig. 5, when the target parsing word is an application name, an application corresponding to the application name may be directly used as the target application, for example, a target parsing word parsed from the speech of "xxx song in xx application" may be "xx application", and the application event may be: xxx songs are played. When the target parsing word is of an application type, all applications in the electronic device that belong to the same application type may be determined first, and then the application with the highest usage frequency may be taken as the target application, for example, the target parsing word parsed from the speech of "play xxx song" may be a "music play application", and the application event may be: xxx songs are played, at this time, C1, C2 and C3 applications belonging to music playing applications can be found in the electronic equipment, and if the C1 application is used most frequently, the target application is a C1 application.
208. The electronic device executes the application event with the target application.
For example, for a speech segment "play xxx songs in xx application", if the xx application is not opened at this time, the xx application can be opened first, and then the xx application can find xxx songs to play.
As can be seen from the above, the voice control method provided in this embodiment is applied to an electronic device, and when the electronic device is in a voice monitoring state, the electronic device may update a stored voice segment with monitored voice data, where the voice segment is a segment of voice data monitored within a latest preset time, and in the updating process of the voice segment, obtain a current voice segment, and extract a voiceprint feature and a keyword from the current voice segment, and then determine whether the voiceprint feature matches the preset feature, and whether the keyword matches the preset keyword, if so, start a voice recording function, and obtain a voice segment at a successful start time as a target voice segment, and then splice the target voice segment and the recorded voice data to obtain a spliced voice, so that the spliced voice can be directly awakened and input an interactive instruction through a coherent voice, without causing a voice interruption due to a system preparation time, the method is simple. Then sending the spliced voice to a server to enable the server to carry out semantic parsing on the spliced voice and return a corresponding target parsed word, then judging whether the target parsed word is matched with a stored parsed word set, if so, determining a target application according to the successfully matched parsed word and determining an application event according to the target parsed word, if not, judging whether the target parsed word is matched with a preset word set, if so, determining the target application according to the successfully matched preset word and determining the application event according to the target parsed word, then adding the successfully matched preset word into the parsed word set, simultaneously deleting the successfully matched preset word from the preset word set, and finally executing the application event by utilizing the target application, thereby improving the matching efficiency of the parsed words according to the previous voice interaction habits of users and effectively improving the voice interaction efficiency, and the user experience effect is improved.
According to the method described in the foregoing embodiment, the embodiment will be further described from the perspective of a voice control apparatus, and the voice control apparatus may be specifically implemented as an independent entity, or may be implemented by being integrated in an electronic device, such as a terminal, where the terminal may include a mobile phone, a tablet computer, and the like.
Referring to fig. 6, fig. 6 specifically illustrates a voice control apparatus provided in an embodiment of the present application, which is applied to an electronic device, and the voice control apparatus may include: an update module 10, an acquisition module 20, an extraction module 30, a start module 40 and a control module 50, wherein:
(1) update module 10
The updating module 10 is configured to update a stored voice segment with the monitored voice data when the electronic device is in a voice monitoring state, where the voice segment is a segment of voice data monitored within a latest preset time duration.
In this embodiment, the voice may be monitored by a device such as a microphone. The preset duration may be set manually, for example, 3 seconds, which is generally as long as the system preparation time for activating the voice recording function. Specifically, the electronic device may monitor the user voice in a low power consumption state, and the update module 10 stores the voice data monitored within the latest preset time period in real time.
(2) Acquisition module 20
An obtaining module 20, configured to obtain the current speech segment in the updating process of the speech segment.
In this embodiment, the voice monitoring and the voice segment updating are performed in real time, and in this process, the electronic device may also perform real-time analysis on the voice segment.
(3) Extraction module 30
And the extracting module 30 is configured to extract voiceprint features and keywords from the current speech segment.
In this embodiment, the voiceprint feature is mainly a frequency spectrum feature, which may include information such as frequency and amplitude, and generally reflects characteristics such as loudness, tone, and timbre of sound, and different people generally have different voiceprint features, and specifically, a voice segment may be converted into frequency spectrum data by fourier transform, and then relevant information is extracted from the frequency spectrum data as a corresponding voiceprint feature. The keyword may include at least one word, which may be english or chinese, etc.
(4) Start module 40
And the starting module 40 is used for starting a voice recording function according to the voiceprint features and the keywords and acquiring a voice segment at the successful starting moment as a target voice segment.
For example, the starting module 40 may be specifically configured to:
judging whether the voiceprint features are matched with preset features or not, and whether the keywords are matched with preset keywords or not;
if yes, starting a voice recording function;
if not, triggering the obtaining module to execute the operation of obtaining the current voice section.
In this embodiment, the preset feature is mainly a voiceprint feature of a bound user, which is usually set in advance, for example, the user may be required to record a segment of voice in advance, extract the voiceprint feature from the voice as the preset feature, and bind with the user. The preset keyword is mainly used for triggering and starting a voice interaction function, and may be set by default (that is, set by a manufacturer when the electronic device leaves a factory), or may be set by a user according to preferences of the user, for example, the user may enter a corresponding setting window through different interfaces of a related setting interface to set the setting.
It should be noted that, in the process of starting the voice recording function, the electronic device needs to perform a series of preparation operations, such as interrupting the running of foreground application, setting the calling parameters of the voice recording component, and the like.
(5) Control module 50
And the control module 50 is used for correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
For example, referring to fig. 7, the control module 50 may specifically include:
and a splicing unit 51, configured to splice the target speech segment and the recorded speech data to obtain a spliced speech.
In this embodiment, because the voice duration of the voice segment is about the same as the system preparation duration, that is, the voice segment stored at the system preparation time (that is, the time when the voice recording function is successfully started) may be just a continuous utterance made by the user after speaking the preset keyword, and the voice segment may be directly used as the initial content of the recorded voice and spliced with the subsequently recorded voice data to form a continuous voice segment.
And a determining unit 52, configured to determine a control instruction according to the spliced speech.
For example, the determining unit 52 may be specifically configured to:
judging whether the target analytic word is matched with a stored analytic word set;
if yes, determining the target application according to the successfully matched analytic words;
if not, judging whether the target analytic word is matched with a preset word set or not; and when the matching is successful, determining the target application according to the successfully matched preset word, adding the successfully matched preset word into the analysis word set, and deleting the successfully matched preset word from the preset word set.
In this embodiment, the server is mainly used for semantic analysis, and in the voice recording process, the electronic device can transmit the spliced voice to the server in real time, and the server can analyze the spliced voice through a trained voice analysis model, where the semantic analysis model may be a deep learning model, and the server can collect different voice samples in advance to train the deep learning model.
Further, the determining unit 52 may specifically be configured to:
sending the spliced voice to a server so that the server performs semantic analysis on the spliced voice and returns a corresponding target analysis word;
determining a target application and an application event according to the returned target analytic words;
and generating a control instruction according to the target application and the application event.
In this embodiment, the parsing word set and the preset word set mainly include application related words, such as application names or application types, where the parsing word set is mainly parsed when the electronic device requests the server to perform semantic parsing in a historical period, and the preset word set is mainly set by default, and for example, each time an application is installed in a terminal, the application related words of the application can be obtained. Specifically, when the target parsing word is an application name, the application corresponding to the application name may be directly used as the target application, when the parsing word is an application type, all applications belonging to the same application type in the electronic device may be determined first, and then the application with the highest use frequency may be used as the target application, or a selection interface may be provided for a user according to the applications, and the target application may be determined through a selection operation of the user.
It should be noted that, for the initial parsing, all words default set by the system may be included in the preset word set, the parsed word set may be empty, and then each time the electronic device obtains a new parsed word, the new parsed word may be stored in the parsed word set and deleted from the preset word set, thereby implementing continuous updating of the parsed word set and the preset word set. By matching the target analysis word with the previous analysis records after each analysis, the matching range can be narrowed in combination with the interaction habits of the user, and the matching speed is improved.
And an execution unit 53, configured to control the electronic device to execute the control instruction.
Further, the execution unit 53 may specifically be configured to:
the application event is executed with the target application.
In this embodiment, if the target application is in the closed state, the execution unit 53 may start the target application first, and then execute the corresponding application event by using the target application.
In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.
As can be seen from the above description, the voice control apparatus provided in this embodiment is applied to an electronic device, when the electronic device is in a voice monitoring state, an updating module 10 updates a stored voice segment by using monitored voice data, where the voice segment is a segment of voice data monitored within a preset time period recently, then an obtaining module 20 obtains a current voice segment during the updating process of the voice segment, an extracting module 30 extracts a voiceprint feature and a keyword from the current voice segment, then a starting module 40 starts a voice recording function according to the voiceprint feature and the keyword, and obtains a voice segment at a successful starting time as a target voice segment, and then a control module 50 correspondingly controls the electronic device according to the recorded voice data and the target voice segment, so that the electronic device can be directly awakened and input an interactive instruction by consecutive voices without voice interruption due to a system preparation time period, the method is simple, the voice interaction efficiency can be effectively improved, and the voice interaction effect is good.
In addition, the embodiment of the application further provides an electronic device, and the electronic device can be a smart phone, a tablet computer and other devices. As shown in fig. 8, the electronic device 400 includes a processor 401, a memory 402. The processor 401 is electrically connected to the memory 402.
The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or loading an application program stored in the memory 402 and calling data stored in the memory 402, thereby integrally monitoring the electronic device.
In this embodiment, the processor 401 in the electronic device 400 loads instructions corresponding to processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions:
when the electronic equipment is in a voice monitoring state, updating a stored voice segment by using the monitored voice data, wherein the voice segment is a section of voice data monitored in the latest preset time;
in the updating process of the voice section, acquiring a current voice section;
extracting voiceprint features and keywords from the current voice section;
starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice segment at the successful starting moment as a target voice segment;
and correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 500 may include Radio Frequency (RF) circuitry 501, memory 502 including one or more computer-readable storage media, input unit 503, display unit 504, sensor 504, audio circuitry 506, Wireless Fidelity (WiFi) module 507, processor 508 including one or more processing cores, and power supply 509. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 9 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The rf circuit 501 may be used for receiving and transmitting information, or receiving and transmitting signals during a call, and in particular, receives downlink information of a base station and then sends the received downlink information to one or more processors 508 for processing; in addition, data relating to uplink is transmitted to the base station. In general, radio frequency circuit 501 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the radio frequency circuit 501 may also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 502 may be used to store applications and data. Memory 502 stores applications containing executable code. The application programs may constitute various functional modules. The processor 508 executes various functional applications and data processing by executing application programs stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the electronic device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 508 and the input unit 503 access to the memory 502.
The input unit 503 may be used to receive input numbers, character information, or user characteristic information (such as a fingerprint), and generate a keyboard, mouse, joystick, optical, or trackball signal input related to user setting and function control. In particular, in one particular embodiment, the input unit 503 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 508, and can receive and execute commands sent by the processor 508.
The display unit 504 may be used to display information input by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. The display unit 504 may include a display panel. Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 508 to determine the type of touch event, and then the processor 508 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 9 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.
The electronic device may also include at least one sensor 505, such as light sensors, motion sensors, and other sensors. In particular, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.
The audio circuit 506 may provide an audio interface between the user and the electronic device through a speaker, microphone. The audio circuit 506 can convert the received audio data into an electrical signal, transmit the electrical signal to a speaker, and convert the electrical signal into a sound signal to output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 506 and converted into audio data, which is then processed by the audio data output processor 508 and then sent to another electronic device via the rf circuit 501, or the audio data is output to the memory 502 for further processing. The audio circuit 506 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.
Wireless fidelity (WiFi) belongs to short-distance wireless transmission technology, and electronic equipment can help users to send and receive e-mails, browse webpages, access streaming media and the like through a wireless fidelity module 507, and provides wireless broadband internet access for users. Although fig. 9 shows the wireless fidelity module 507, it is understood that it does not belong to the essential constitution of the electronic device, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 508 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing an application program stored in the memory 502 and calling data stored in the memory 502, thereby integrally monitoring the electronic device. Optionally, processor 508 may include one or more processing cores; preferably, the processor 508 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 508.
The electronic device also includes a power supply 509 (such as a battery) to power the various components. Preferably, the power source may be logically connected to the processor 508 through a power management system, so that the power management system may manage charging, discharging, and power consumption management functions. The power supply 509 may also include any component such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown in fig. 9, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein.
In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and specific implementation of the above modules may refer to the foregoing method embodiments, which are not described herein again.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor. To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute steps in any one of the voice control methods provided by the embodiments of the present invention.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any of the voice control methods provided in the embodiments of the present invention, the beneficial effects that can be achieved by any of the voice control methods provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
In summary, although the present application has been described with reference to the preferred embodiments, the above-described preferred embodiments are not intended to limit the present application, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present application, so that the scope of the present application shall be determined by the appended claims.

Claims (12)

1. A voice control method is applied to electronic equipment and is characterized by comprising the following steps:
when the electronic equipment is in a voice monitoring state, updating a stored voice segment by using monitored voice data, wherein the voice segment is a segment of voice data monitored in a latest preset time;
acquiring a current voice section in the updating process of the voice section;
extracting voiceprint features and keywords from the current voice section;
starting a voice recording function according to the voiceprint features and the keywords, and acquiring a voice section at the successful starting moment as a target voice section, wherein the voice time of the target voice section is the same as the system preparation time for starting the voice recording function, and the target voice section is voice data obtained by monitoring in the system preparation process for starting the voice recording function;
and correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
2. The voice control method according to claim 1, wherein the starting a voice recording function according to the voiceprint features and the keywords comprises:
judging whether the voiceprint features are matched with preset features or not, and whether the keywords are matched with preset keywords or not;
if yes, starting a voice recording function;
if not, returning to execute the operation of obtaining the current voice section.
3. The voice control method according to claim 1, wherein the controlling the electronic device according to the recorded voice data and the target voice segment comprises:
splicing the target voice segment and the recorded voice data to obtain spliced voice;
determining a control instruction according to the spliced voice;
and controlling the electronic equipment to execute the control instruction.
4. The voice control method according to claim 3, wherein the determining a control instruction according to the spliced voice comprises:
sending the spliced voice to a server so that the server performs semantic analysis on the spliced voice and returns a corresponding target analysis word;
determining a target application and an application event according to the returned target analytic words;
generating a control instruction according to the target application and the application event;
the controlling the electronic device to execute the control instruction includes: executing the application event with the target application.
5. The voice control method according to claim 4, wherein the determining a target application according to the target parsing word comprises:
judging whether the target analytic words are matched with a stored analytic word set;
if yes, determining the target application according to the successfully matched analytic words;
if not, judging whether the target analytic words are matched with a preset word set or not; and when the matching is successful, determining a target application according to the successfully matched preset words, adding the successfully matched preset words into the analysis word set, and deleting the successfully matched preset words from the preset word set.
6. A voice control device applied to electronic equipment is characterized by comprising:
the updating module is used for updating a stored voice segment by using the monitored voice data when the electronic equipment is in a voice monitoring state, wherein the voice segment is a segment of voice data monitored in the latest preset time;
the acquisition module is used for acquiring the current voice section in the updating process of the voice section;
the extraction module is used for extracting voiceprint features and keywords from the current voice section;
the starting module is used for starting a voice recording function according to the voiceprint features and the keywords and acquiring a voice section at the successful starting moment as a target voice section, wherein the voice time of the target voice section is the same as the system preparation time for starting the voice recording function, and the target voice section is voice data obtained by monitoring in the system preparation process for starting the voice recording function;
and the control module is used for correspondingly controlling the electronic equipment according to the recorded voice data and the target voice segment.
7. The voice control device according to claim 6, wherein the start module is specifically configured to:
judging whether the voiceprint features are matched with preset features or not, and whether the keywords are matched with preset keywords or not;
if yes, starting a voice recording function;
if not, triggering the acquisition module to execute the operation of acquiring the current voice section.
8. The voice control device according to claim 6, wherein the control module specifically comprises:
the splicing unit is used for splicing the target voice segment and the recorded voice data to obtain spliced voice;
the determining unit is used for determining a control instruction according to the spliced voice;
and the execution unit is used for controlling the electronic equipment to execute the control instruction.
9. The speech control apparatus according to claim 8, wherein the determining unit is specifically configured to:
sending the spliced voice to a server so that the server performs semantic analysis on the spliced voice and returns a corresponding target analysis word;
determining a target application and an application event according to the returned target analytic words;
generating a control instruction according to the target application and the application event;
the execution unit is to: executing the application event with the target application.
10. The speech control apparatus according to claim 9, wherein the determining unit is specifically configured to:
judging whether the target analytic words are matched with a stored analytic word set;
if yes, determining the target application according to the successfully matched analytic words;
if not, judging whether the target analytic words are matched with a preset word set or not; and when the matching is successful, determining a target application according to the successfully matched preset words, adding the successfully matched preset words into the analysis word set, and deleting the successfully matched preset words from the preset word set.
11. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the voice control method of any of claims 1 to 5.
12. An electronic device comprising a processor and a memory, the processor being electrically connected to the memory, the memory being configured to store instructions and data, the processor being configured to perform the steps of the voice control method of any one of claims 1 to 5.
CN201810681095.6A 2018-06-27 2018-06-27 Voice control method, device, storage medium and electronic equipment Active CN108694947B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810681095.6A CN108694947B (en) 2018-06-27 2018-06-27 Voice control method, device, storage medium and electronic equipment
PCT/CN2019/085720 WO2020001165A1 (en) 2018-06-27 2019-05-06 Voice control method and apparatus, and storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810681095.6A CN108694947B (en) 2018-06-27 2018-06-27 Voice control method, device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN108694947A CN108694947A (en) 2018-10-23
CN108694947B true CN108694947B (en) 2020-06-19

Family

ID=63849986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810681095.6A Active CN108694947B (en) 2018-06-27 2018-06-27 Voice control method, device, storage medium and electronic equipment

Country Status (2)

Country Link
CN (1) CN108694947B (en)
WO (1) WO2020001165A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694947B (en) * 2018-06-27 2020-06-19 Oppo广东移动通信有限公司 Voice control method, device, storage medium and electronic equipment
CN110060693A (en) * 2019-04-16 2019-07-26 Oppo广东移动通信有限公司 Model training method, device, electronic equipment and storage medium
CN112053696A (en) * 2019-06-05 2020-12-08 Tcl集团股份有限公司 Voice interaction method and device and terminal equipment
CN112397062A (en) * 2019-08-15 2021-02-23 华为技术有限公司 Voice interaction method, device, terminal and storage medium
CN113129893B (en) * 2019-12-30 2022-09-02 Oppo(重庆)智能科技有限公司 Voice recognition method, device, equipment and storage medium
CN111583929A (en) * 2020-05-13 2020-08-25 军事科学院系统工程研究院后勤科学与技术研究所 Control method and device using offline voice and readable equipment
CN112581957B (en) * 2020-12-04 2023-04-11 浪潮电子信息产业股份有限公司 Computer voice control method, system and related device
CN112634916A (en) * 2020-12-21 2021-04-09 久心医疗科技(苏州)有限公司 Automatic voice adjusting method and device for defibrillator

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510426A (en) * 2011-11-29 2012-06-20 安徽科大讯飞信息科技股份有限公司 Personal assistant application access method and system
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal
CN107147618A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of user registering method, device and electronic equipment
CN107464557A (en) * 2017-09-11 2017-12-12 广东欧珀移动通信有限公司 Call recording method, device, mobile terminal and storage medium
CN108154882A (en) * 2017-12-13 2018-06-12 广东美的制冷设备有限公司 The control method and control device of remote control equipment, storage medium and remote control equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006197115A (en) * 2005-01-12 2006-07-27 Fuji Photo Film Co Ltd Imaging device and image output device
US20080256613A1 (en) * 2007-03-13 2008-10-16 Grover Noel J Voice print identification portal
CN108694947B (en) * 2018-06-27 2020-06-19 Oppo广东移动通信有限公司 Voice control method, device, storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510426A (en) * 2011-11-29 2012-06-20 安徽科大讯飞信息科技股份有限公司 Personal assistant application access method and system
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal
CN107147618A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of user registering method, device and electronic equipment
CN107464557A (en) * 2017-09-11 2017-12-12 广东欧珀移动通信有限公司 Call recording method, device, mobile terminal and storage medium
CN108154882A (en) * 2017-12-13 2018-06-12 广东美的制冷设备有限公司 The control method and control device of remote control equipment, storage medium and remote control equipment

Also Published As

Publication number Publication date
CN108694947A (en) 2018-10-23
WO2020001165A1 (en) 2020-01-02

Similar Documents

Publication Publication Date Title
CN108694947B (en) Voice control method, device, storage medium and electronic equipment
US11670302B2 (en) Voice processing method and electronic device supporting the same
CN108829235B (en) Voice data processing method and electronic device supporting the same
US10964300B2 (en) Audio signal processing method and apparatus, and storage medium thereof
US11042703B2 (en) Method and device for generating natural language expression by using framework
CN106782600B (en) Scoring method and device for audio files
KR20180060328A (en) Electronic apparatus for processing multi-modal input, method for processing multi-modal input and sever for processing multi-modal input
US11537360B2 (en) System for processing user utterance and control method of same
CN109284144B (en) Fast application processing method and mobile terminal
CN108804070B (en) Music playing method and device, storage medium and electronic equipment
CN107731241B (en) Method, apparatus and storage medium for processing audio signal
CN109032491B (en) Data processing method and device and mobile terminal
US11170764B2 (en) Electronic device for processing user utterance
CN111522592A (en) Intelligent terminal awakening method and device based on artificial intelligence
CN109040444B (en) Call recording method, terminal and computer readable storage medium
US11194545B2 (en) Electronic device for performing operation according to user input after partial landing
CN110335629B (en) Pitch recognition method and device of audio file and storage medium
CN113225624A (en) Time-consuming determination method and device for voice recognition
CN115150501A (en) Voice interaction method and electronic equipment
CN109086448B (en) Voice question searching method based on gender characteristic information and family education equipment
CN108711428B (en) Instruction execution method and device, storage medium and electronic equipment
CN106933626B (en) Application association method and device
CN111897916B (en) Voice instruction recognition method, device, terminal equipment and storage medium
CN114999457A (en) Voice system testing method and device, storage medium and electronic equipment
KR101993368B1 (en) Electronic apparatus for processing multi-modal input, method for processing multi-modal input and sever for processing multi-modal input

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant