CN112581937A - Method and device for acquiring voice instruction - Google Patents

Method and device for acquiring voice instruction Download PDF

Info

Publication number
CN112581937A
CN112581937A CN201910947282.9A CN201910947282A CN112581937A CN 112581937 A CN112581937 A CN 112581937A CN 201910947282 A CN201910947282 A CN 201910947282A CN 112581937 A CN112581937 A CN 112581937A
Authority
CN
China
Prior art keywords
voice
user
information
breakpoint
duration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910947282.9A
Other languages
Chinese (zh)
Inventor
杜国威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Anyun Century Technology Co Ltd
Original Assignee
Beijing Anyun Century Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Anyun Century Technology Co Ltd filed Critical Beijing Anyun Century Technology Co Ltd
Priority to CN201910947282.9A priority Critical patent/CN112581937A/en
Publication of CN112581937A publication Critical patent/CN112581937A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a method for acquiring a voice instruction, which is applied to electronic equipment and comprises the following steps: collecting first voice information input by a user; determining age information of the user; obtaining a breakpoint duration based on the first voice information and the age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models; and acquiring a voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.

Description

Method and device for acquiring voice instruction
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for obtaining a speech command.
Background
VAD: voice Activity Detection aims at accurately positioning the start and end points of Voice from Voice information and removing silence and noise as interference information from original Voice so as to save Voice channel resources under the condition of not reducing service quality.
The intelligent sound box with the screen is used as a household device, and the used people are mainly the old and children at home. But the younger the age children the more a common feature: because the logic thinking is immature, the expression of the Chinese character is often greatly different from that of adults, and when the intention of the Chinese character is expressed through a voice instruction, a large number of sentence breaks and other problems exist, such as: the piglet-the peclet-the tin. When facing children, the existing VAD can not completely position the voice commands of the children, so that the existing voice recognition system has the technical problems of inaccuracy and low accuracy when recognizing the intentions of the children.
Disclosure of Invention
The embodiment of the application provides a method and a device for obtaining a voice instruction, electronic equipment and a computer storage medium, and solves the technical problems of inaccuracy and low detection precision of a voice recognition system facing a child user group in the prior art.
In a first aspect, the present application provides the following technical solutions through an embodiment of the present application:
a method for obtaining a voice instruction is applied to electronic equipment and comprises the following steps: collecting first voice information input by a user; determining age information of the user; obtaining a breakpoint duration based on the first voice information and the age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models; and acquiring a voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
In one embodiment, before said determining the age information of the user, further comprising: judging whether the first voice information has complete understandable semantics, wherein the complete understandable semantics comprise a preset grammar structure; when the first voice message does not have the complete understandable semantics, judging whether the user is a specific user, wherein the age of the specific user is less than a preset age; when the user is the specific user, the step of determining the age information of the user is performed.
In one embodiment, the determining whether the user is a specific user includes one or any combination of the following manners: performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology, and determining whether the user is the specific user; collecting the face information of the user, and performing feature extraction and analysis on the face information of the user based on a face recognition technology to determine whether the user is the specific user; determining whether the user is a specific user based on whether a current mode of the electronic device is the specific mode.
In one embodiment, the determining age information of the user includes: collecting face information of the user, and performing feature extraction and analysis on the face information based on a face recognition technology to determine age information of the user; and/or performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
In one embodiment, said obtaining a breakpoint duration based on said first speech information and said age information of said user comprises: uploading the first voice information and the age information of the user to a server, so that the server selects the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, and inputs the first voice information into the target voice breakpoint model to obtain the breakpoint duration, wherein the server stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user; and receiving the breakpoint duration returned by the server.
In one embodiment, said obtaining a breakpoint duration based on said first speech information and said age information of said user comprises: selecting the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the plurality of voice breakpoint models are stored in the electronic equipment, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user; and inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
In one embodiment, before the obtaining of the voice instruction related to the first voice information, which is input by the user, based on the breakpoint duration, the method further includes: acquiring the use duration information of the user and/or proficiency level information of the voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment; adjusting the breakpoint duration based on the use duration information and/or proficiency level information of the voice instruction input by the user to obtain the adjusted breakpoint duration; the obtaining of the voice instruction related to the first voice information, which is input by the user, based on the breakpoint duration includes: and acquiring the voice instruction which is input by the user and is related to the first voice information based on the adjusted breakpoint duration.
In one embodiment, the obtaining the voice instruction related to the first voice information input by the user based on the breakpoint duration includes: determining a starting point of the first voice message as a starting endpoint of the voice instruction; determining a first end point of the first voice message, and adjusting the first end point based on the breakpoint duration to obtain a second end point; detecting whether there is an audio input between the first end point and the second end point; if yes, acquiring second voice information; determining a third end point of the second voice information, and adjusting the third end point based on the breakpoint duration to obtain a fourth end point; detecting whether there is an audio input between the third end point and the fourth end point; if not, determining the fourth end point as an end point of the voice instruction, and obtaining the voice instruction based on a start end point of the voice instruction and the end point of the voice instruction, wherein the voice instruction comprises the first voice information and the second voice information; if yes, acquiring third voice information; determining a fifth endpoint of the third voice message, and adjusting the fifth endpoint based on the breakpoint duration to obtain a sixth endpoint; detecting whether there is an audio input between the fifth end point and the sixth end point; if not, determining the sixth end point as an end point of the voice instruction, and obtaining the voice instruction based on a start end point of the voice instruction and the end point of the voice instruction, wherein the voice instruction comprises the first voice information to the third voice information; if yes, continuing to collect the fourth voice information.
In a second aspect, based on the same inventive concept, the present application provides the following technical solutions through an embodiment of the present application:
an apparatus for obtaining a voice command, applied to an electronic device, the apparatus comprising: the acquisition module is used for acquiring first voice information input by a user; a determining module for determining age information of the user; a first obtaining module, configured to obtain a breakpoint duration based on the first voice information and the age information of the user, where the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from multiple voice breakpoint models; and the second obtaining module is used for obtaining the voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
In one embodiment, further comprising: the first judging module is used for judging whether the first voice information has complete understandable semantics before determining the age information of the user, wherein the complete understandable semantics comprises a preset syntactic structure; the second judging module is used for judging whether the user is a specific user when the first voice information does not have the complete understandable semantics, and the age of the specific user is smaller than a preset age; when the user is the specific user, the step of determining the age information of the user is executed by the determining module.
In one embodiment, the second determining unit includes one or any combination of the following modules: the first determining sub-module is used for extracting and analyzing the characteristics of the first voice information based on a voiceprint recognition technology and determining whether the user is the specific user or not; the second determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information of the user based on a face recognition technology and determining whether the user is the specific user; a third determination sub-module that determines whether the user is a specific user based on whether a current mode of the electronic device is the specific mode.
In one embodiment, the determining module comprises: the fourth determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information based on a face recognition technology and determining the age information of the user; and/or a fifth determining submodule for performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
In one embodiment, the first obtaining module includes: the uploading sub-module is used for uploading the first voice information and the age information of the user to a server, so that the server selects the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user and inputs the first voice information into the target voice breakpoint model to obtain the breakpoint duration, wherein the server stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user; and the receiving submodule is used for receiving the breakpoint duration returned by the server.
In one embodiment, the first obtaining module includes: the selection submodule is used for selecting the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the plurality of voice breakpoint models are stored in the electronic equipment, the plurality of voice breakpoint models correspond to different ages of information, and the target voice breakpoint model corresponds to the age information of the user; and the obtaining submodule is used for inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
In one embodiment, further comprising: the first acquisition module is used for acquiring the use duration information of the user and/or proficiency level information of the voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment; the obtaining module is used for adjusting the breakpoint duration based on the use duration information and/or proficiency level information of the voice instruction input by the user to obtain the adjusted breakpoint duration; the second obtaining module is further configured to obtain, based on the adjusted breakpoint duration, a voice instruction related to the first voice information, which is input by the user.
In one embodiment, the second obtaining module includes: a sixth determining submodule, configured to determine a starting point of the first voice message as a starting endpoint of the voice instruction; the second obtaining submodule is used for determining a first end point of the first voice information and adjusting the first end point based on the breakpoint duration to obtain a second end point; a first detection submodule for detecting whether there is an audio input between the first end point and the second end point; the first acquisition submodule is used for acquiring second voice information when audio input exists between the first end point and the second end point; a third obtaining submodule, configured to determine a third end point of the second voice information, and adjust the third end point based on the breakpoint duration to obtain a fourth end point; a second detection submodule for detecting whether there is an audio input between the third end point and the fourth end point; a first obtaining submodule, configured to determine the fourth end point as an end endpoint of the voice instruction when no audio input is input between the third end point and the fourth end point; obtaining the voice instruction based on a starting endpoint of the voice instruction and an ending endpoint of the voice instruction, wherein the voice instruction comprises the first voice information and the second voice information; the second acquisition submodule is used for acquiring third voice information when audio input exists between the third end point and the fourth end point; a fourth obtaining submodule, configured to determine a fifth end point of the third voice information, and adjust the fifth end point based on the breakpoint duration to obtain a sixth end point; a third detection submodule, configured to detect whether an audio input is provided between the fifth end point and the sixth end point; a second obtaining submodule, configured to determine the sixth end point as an end point of the voice instruction when no audio input is performed between the fifth end point and the sixth end point; obtaining the voice instruction based on a starting endpoint of the voice instruction and an ending endpoint of the voice instruction, wherein the voice instruction comprises the first voice information to the third voice information; and the third acquisition submodule is used for continuously acquiring fourth voice information when audio input exists between the fifth end point and the sixth end point.
In a third aspect, based on the same inventive concept, the present application provides the following technical solutions through an embodiment of the present application:
an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method steps of any of the above embodiments when executing the program.
In a fourth aspect, based on the same inventive concept, the present application provides the following technical solutions through an embodiment of the present application:
a computer storage medium having a computer program stored thereon, comprising: which when executed by a processor may carry out the method steps as described in any of the embodiments above.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
according to the embodiment of the application, the first voice information input by the user is processed based on the target voice breakpoint model associated with the age information of the user to obtain the breakpoint duration, and the voice instruction is obtained based on the obtained breakpoint duration. Compared with the prior art, the method can accurately obtain the breakpoint duration of the user according to the age information input by the user and the first voice information input by the user, and the subsequent voice instruction extracted based on the accurate breakpoint duration can be more accurate, even if a large number of sentences exist when the child expresses the intention by using the voice instruction, as long as the subsequent voice information can be basically collected in the breakpoint duration, and the breakpoint duration is accurately analyzed for the child and accords with the breakpoint habit of the child, so that compared with the prior art, the method can collect the voice information of the child more completely, and only collect the voice information of the child more completely, the intention of the child can be more accurately identified by the subsequent voice identification system, and the problem that the voice identification system in the prior art faces the child user group is solved, the method has the technical problems of inaccuracy and low detection precision.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a flow diagram of a method of obtaining a voice instruction according to one embodiment of the invention;
FIG. 2 is an architecture diagram of a voice command obtaining apparatus according to an embodiment of the present invention;
FIG. 3 shows a block diagram of an electronic device in accordance with one embodiment of the invention;
FIG. 4 shows a block diagram of a computer storage medium, in accordance with one embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method and a device for obtaining a voice instruction, electronic equipment and a computer storage medium, which solve the technical problems of inaccuracy and low detection precision of a voice recognition system in the prior art when the voice recognition system faces a child user group.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
a method for obtaining a voice instruction is applied to electronic equipment and comprises the following steps: collecting first voice information input by a user; determining age information of the user; obtaining a breakpoint duration based on the first voice information and the age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models; and acquiring a voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1. Compared with the prior art, the method can accurately obtain the breakpoint duration of the user according to the age information input by the user and the first voice information input by the user, and the subsequent voice instruction extracted based on the accurate breakpoint duration can be more accurate, even if a large number of sentences exist when the child expresses the intention by using the voice instruction, as long as the subsequent voice information can be basically collected in the breakpoint duration, and the breakpoint duration is accurately analyzed for the child and accords with the breakpoint habit of the child, so that compared with the prior art, the method can collect the voice information of the child more completely, and only collect the voice information of the child more completely, the intention of the child can be more accurately identified by the subsequent voice identification system, and the problem that the voice identification system in the prior art faces the child user group is solved, the method has the technical problems of inaccuracy and low detection precision.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
First, it is stated that the term "and/or" appearing herein is merely one type of associative relationship that describes an associated object, meaning that three types of relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
Example one
As shown in fig. 1, the present embodiment provides a method for obtaining a voice instruction, which is applied to an electronic device, and the method includes:
step S101: first voice information input by a user is collected.
Specifically, the electronic device may be: the device comprises a mobile phone, a tablet, a computer, an intelligent sound box and the like, wherein the mobile phone, the tablet, the computer, the intelligent sound box and the like are provided with a sound acquisition unit. In the embodiment of the application, the intelligent sound box with the screen is mainly used as an application object for description. The actual product that corresponds to intelligent tape screen audio amplifier usually includes sound acquisition unit, touch display screen, has high tone quality, more three-dimensional sound output unit. This intelligence area screen audio amplifier mainly provides assorted service based on user's voice command: for example: playing music and video, video calls, etc.
When a user inputs a voice instruction, the intelligent sound box with the screen collects first voice information input by the user through the microphone.
Step S102: age information of the user is determined.
In order to avoid that the privacy of the user is freely obtained, a determination of whether the intention expressed by the prepositive first voice information is complete may be set, and in an optional embodiment, before determining the age information of the user, the method further includes:
and judging whether the first voice information has complete understandable semantics, wherein the complete understandable semantics comprise a preset grammar structure.
Specifically, the preset grammar structure at least includes two grammar components of a predicate and an object, and further includes a grammar structure including two grammar components of a predicate and an object, which may be a grammar structure of a predicate + an object or a grammar structure of an object + a predicate, and both of the grammar structures are within the range of the preset grammar structure. In addition, the preset grammar structure may also be configured to include more complex and comprehensive grammar components, such as: subject + predicate + object. In the actual implementation process, the first voice information can be segmented through the segmentation model, and the segmentation result is judged to determine the grammar component.
When the first voice message does not have completely understandable semantics, whether the user is a specific user is judged, and the age of the specific user is smaller than a preset age.
Specifically, the preset age can be adjusted according to needs, and generally, children under the age of 7 are defaulted to be not contacted with the study of grammar structures, so that the situations of logic confusion and sentence break in expression are more serious. Based on this, the specific user should refer to a child under the age of 7.
When the user is a specific user, step S102 of determining age information of the user is performed.
In an optional embodiment, determining whether the user is a specific user includes one or any combination of the following manners:
and performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine whether the user is a specific user.
Voiceprint recognition technology, which is one of the biometric recognition technologies, is a very mature technology for speaker identification, and is not expanded here.
The method comprises the steps of collecting face information of a user, extracting and analyzing features of the face information of the user based on a face recognition technology, and determining whether the user is a specific user.
Specifically, the intelligent sound box with the screen comprises an image acquisition unit, for example: the intelligent sound box with the screen is used for collecting face information of a user through the camera, and face age analysis is carried out on a speaker through the existing mature face recognition technology on the collected face information, so that the development is not carried out.
Whether the user is a specific user is determined based on whether the current mode of the electronic device is a specific mode.
Specifically, the electronic device further includes an input module, for example: and the touch screen is used for selecting the current mode of the electronic equipment by a user.
When the user inputs in the working mode, namely the child mode, whether the user is a child is directly determined. Of course, the scheme can be used for people who have a lot of sentence-breaking habits, such as the elderly and people with limited civil behavior ability, so that the working modes of the electronic equipment can also comprise an elderly mode, and when the user inputs in the elderly mode, the user is determined to be the elderly.
In the actual implementation process, when determining whether the user is a specific user, the determination may be made in any one, any two, or three of the three manners. For example: the face recognition technology and the voiceprint recognition technology are combined and are used for judging whether the user is a specific user or not.
In an alternative embodiment, determining age information for the user comprises:
the method comprises the steps of collecting face information of a user, and carrying out feature extraction and analysis on the face information based on a face recognition technology to determine age information of the user.
Specifically, the intelligent sound box with the screen collects face information of a user by using the camera, and performs face age analysis on a speaker by using the existing well-developed face recognition technology on the collected face information, so that age information of the user is determined, and the method is not expanded.
And/or performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
The voiceprint recognition technology, which is one of the biometric recognition technologies, is a very mature technology for identifying the age of a speaker, and is not expanded here.
Step S103: and acquiring breakpoint duration corresponding to the first voice information user based on the first voice information and the age information of the user, wherein the breakpoint duration is acquired after the first voice information is processed by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models.
Specifically, the process of obtaining the breakpoint duration through the target speech breakpoint model is as follows:
after a target voice breakpoint model corresponding to age information is obtained based on the age information of a user, the target voice breakpoint model analyzes the first voice information to obtain part-of-speech characteristics of the first voice information, a grammatical structure which is possibly used by the user with the part-of-speech characteristics of the first voice information as the first when the user expresses a grammatical instruction is predicted according to the part-of-speech characteristics of the first voice information, and the target voice breakpoint model processes the predicted grammatical structure to obtain breakpoint duration and outputs the breakpoint duration. For example: the user wants to express: that is, the first speech message is a "pig cookie", the part-of-speech feature of the first speech message is a noun, and the grammatical structure beginning with the noun may be: the object + predicate and the subject + predicate are obtained, the probability coefficient of the object + predicate is m, the probability coefficient of the subject + predicate is n, and then the target voice breakpoint model estimates the breakpoint duration of the user in the corresponding age group based on the probability coefficients of m and n.
The model establishing process comprises the following steps:
firstly, a plurality of voice sample sets are obtained, each voice sample set corresponds to different age groups, each voice sample set comprises a plurality of pieces of voice sample information, and each piece of voice sample information comprises a plurality of pieces of voice information. In order to make the final speech breakpoint model more accurate, the speech information samples in each speech information sample set preferably contain a plurality of different grammatical structures and are sufficiently large in amount during the screening process of each speech sample set.
Secondly, performing big data analysis on each voice sample information in the voice sample set of each age group to obtain a grammatical structure of each voice sample information of the age group and corresponding breakpoint duration, and obtaining all grammatical structures used by children of the age group to be accustomed to and corresponding breakpoint duration. The grammar structure of the child of the age group and the breakpoint duration corresponding to the grammar structure can be obtained by referring to the existing machine learning mode.
And then, performing big data analysis on the part-of-speech characteristics of the first voice information of each voice sample information and various grammar structures including the part-of-speech characteristics to obtain probability coefficients of the appearance of various grammar structures which are possibly used by children of the age group including the part-of-speech characteristics of the first voice information.
And finally, modeling according to the obtained grammar structure used by children in the age group and the corresponding breakpoint duration and probability coefficients of the appearance of various grammar structures including the part-of-speech characteristics of the first voice information, obtaining a voice breakpoint model corresponding to the age group, wherein the voice breakpoint model can predict various grammar structures of the whole voice instruction which the user wants to express, the breakpoint durations corresponding to various grammar structures and the appearance probabilities of various grammar structures when the user inputs the first voice information, and outputting the breakpoint duration related to the first voice information and the appearance probabilities of various grammar structures.
Specifically, each voice breakpoint model in the multiple voice breakpoint models corresponds to different age information, in the actual implementation process, children under 1 year of age cannot express their intentions through language, the speaking logic of children under 1 year of age to 2 years of age is very disordered, and the multiple speech breakthroughs can correspond to the voice breakpoint model a; the age of 2-3 is the rapid language development period of children, and the speaking habit is greatly changed compared with the age of 1-2 and can correspond to a speech breakpoint model B; children aged 3-4 begin to enter the kindergarten, and the speaking habit has a qualitative leap and can correspond to the voice breakpoint model C; the speaking level of children aged 4-6 is in a relatively stable development state, and can correspond to the speech breakpoint model D. The breakpoint habits of the children of each age group have great difference, and the breakpoint habits of the children of the same age group after subdivision are approximately the same, so that a corresponding voice breakpoint model is established for the children of each age group, and the breakpoint duration of the user can be obtained more accurately. For the same user, as the age increases, the speaking logic is clearer, the sentence break is less, and the breakpoint duration is shorter, so that a new voice breakpoint model is called as a target voice breakpoint model based on the age information of the user.
In an alternative embodiment, step S103 includes:
uploading the first voice information and the age information of the user to a server, so that the server selects a target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, and inputs the first voice information into the target voice breakpoint model to obtain breakpoint duration corresponding to the first voice information user, wherein the server stores a plurality of voice breakpoint models which correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user; and receiving the breakpoint duration corresponding to the first voice information user returned by the server.
This embodiment will need the work of a large amount of calculations to put at the server, and local end only plays the effect of sending and receiving data, can alleviate the operand of local intelligent area screen audio amplifier, avoids local intelligent area screen audio amplifier when gathering voice information, appears blocking.
In an optional embodiment, a target voice breakpoint model can be selected from a plurality of voice breakpoint models based on age information of a user, wherein the electronic device stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different pieces of age information, and the target voice breakpoint model corresponds to the age information of the user; and inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration corresponding to the first voice information user.
According to actual needs, on the premise that the operation speed and the storage space of the local-end processor are guaranteed, the work of obtaining the breakpoint duration through the target voice breakpoint model is put on the local intelligent sound box with the screen, the data transmission time can be saved, and the response can be quicker.
Step S104: based on the breakpoint duration, obtaining a voice instruction which is input by a user and related to first voice information, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
Specifically, a piece of speech information should be understood as audio information formed by consecutive speech frames, and a speech frame should be understood as an audio frame whose energy is greater than a threshold K, and which is speech intended by a user, and is distinguished from silence and noise, for example: the piggy is a segment of voice information; the small-page pig comprises two sections of voice information, namely a small pig and a pig, wherein the small pig and the pig are not continuous because silence or external noise with energy smaller than a threshold value K is arranged in the middle of the small pig.
The breakpoint duration is the pause time that the user is accustomed to between two adjacent sections of voice information when inputting a voice instruction
In an optional implementation, before step S104, the method further includes:
acquiring the use duration information of a user and/or proficiency level information of a voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment; adjusting the breakpoint duration based on the use duration information and/or proficiency level information of the voice command input by the user to obtain the adjusted breakpoint duration;
at this time, step S104 includes: and acquiring a voice instruction which is input by a user and is related to the first voice information based on the adjusted breakpoint duration.
Specifically, as an optional mode, the use duration information of the user is stored in the server, when the server obtains the break point duration based on the target voice break point model, the break point duration is adjusted at the server according to the use duration information to obtain the adjusted break point duration, the server returns the adjusted break point duration to the intelligent sound box with the screen, and the intelligent sound box with the screen obtains the voice command related to the first voice information, which is input by the user, based on the adjusted break point duration. As another optional mode, the use duration information of the user is stored in a database of the smart sound box with a screen, when the smart sound box with a screen or the server obtains the breakpoint duration based on the target voice breakpoint model, the smart sound box with a screen adjusts the breakpoint duration according to the use duration information to obtain the adjusted breakpoint duration, and the smart sound box with a screen obtains the voice instruction related to the first voice information, which is input by the user, based on the adjusted breakpoint duration.
Specifically, proficiency level information of the voice instruction input by the user is obtained according to the historical voice instruction input by the user. After the user uses the intelligent sound box with the screen every time, the intelligent sound box with the screen analyzes the grammatical structure of the obtained voice command to obtain the proficiency degree information of the current voice command of the user, and the proficiency degree information of the voice command input by the user is updated based on the proficiency degree information of the current voice command of the user so that the user can adjust the breakpoint duration next time.
In the actual implementation process, the proficiency level information of the voice instruction can be represented in an integral or level mode, and the specific mode is as follows: judging the similarity between a grammar structure corresponding to the voice instruction of the user and a preset grammar structure, wherein the more similar the similarity is, the higher the description level is or the integral is increased, so as to obtain the proficiency level information of the current voice instruction of the user, and updating the proficiency level information of the voice instruction input by the user by utilizing the proficiency level information of the current voice instruction of the user, and the specific mode is as follows: using the model: and Yn is (1-a) × Xn + a × Y (n-1), wherein Xn is proficiency information of the current voice command of the user, Y (n-1) is proficiency information of the historical voice command input by the user, and Yn is proficiency information of the voice command input by the user after being updated by the proficiency information of the current voice command of the user. By using the model for updating, the contingency can be avoided.
Furthermore, after the user uses the intelligent sound box with the screen each time, the using time length information stored in the database is updated according to the using time of the user.
In an alternative implementation, step S104 includes:
determining a starting point of the first voice information as a starting endpoint of the voice instruction;
specifically, the method for determining the starting point of the first voice message includes: the VAD model sets a threshold k and calculates the energy of the first speech information at each moment, if the energy is larger than the threshold k, 1 is output (1 represents that the point is speech), otherwise, 0 is output (0 represents that the point is mute or noise). Therefore, when the user inputs the first voice message, the VAD model suddenly detects that the energy of a certain point is greater than the threshold value k, recognizes that the point is the starting point of the first voice message, and determines the point as the starting endpoint of the voice command, i.e. the starting endpoint when the VAD model continuously acquires the voice command.
Determining a first end point of the first voice information, and adjusting the first end point based on the breakpoint duration to obtain a second end point;
detecting whether there is an audio input between the first end point and the second end point;
if yes, acquiring second voice information;
determining a third end point of the second voice information, and adjusting the third end point based on the breakpoint duration to obtain a fourth end point;
detecting whether audio input exists between the third end point and the fourth end point;
if not, determining the fourth end point as an end point of the voice command, and obtaining the voice command based on the start end point of the voice command and the end point of the voice command, wherein the voice command comprises first voice information and second voice information;
if yes, acquiring third voice information;
determining a fifth end point of the third voice information, and adjusting the fifth end point based on the breakpoint duration to obtain a sixth end point;
detecting whether audio input exists between the fifth end point and the sixth end point;
if not, determining the sixth end point as an end point of the voice command, and obtaining the voice command based on the start end point of the voice command and the end point of the voice command, wherein the voice command comprises first voice information to third voice information;
if yes, continuing to collect the fourth voice information.
Specifically, the method for determining the end point of the ith voice message also includes: the VAD model sets a threshold value k and calculates the energy of the ith voice message at each moment, if the energy is larger than the threshold value k, 1 is output (1 represents that the point is language), and otherwise, the energy is 0(0 represents that the point is mute or noise). Therefore, when the user inputs the ith voice message, the VAD model suddenly detects that the energy of a certain point is suddenly smaller than the threshold k, and the certain point is taken as the end point of the ith voice message. Wherein i is a positive integer greater than 1.
It should be noted that, in this embodiment, the end point of the ith voice message is not equivalent to the end point of the voice command (i.e., the end point of the voice command finally extracted by the VAD model). The reason is that when a child speaks, there are serious situations of logic confusion and sentence break, and if the end point of the ith voice message detected by the VAD model is taken as the end point of the extracted voice command, a large amount of voice messages dropped by the following teasel are inevitably lost, so that the intention of the child cannot be accurately identified. In the scheme, the breakpoint duration based on the habit of children is adjusted on the basis of the termination point of the ith voice message, that is, the termination point of the voice command extracted by the VAD model is the point obtained by superposing the breakpoint duration on the termination point of the ith voice message detected by the VAD model, during the period, if the VAD model detects audio input, i takes a positive integer, the termination point of the ith voice message is repositioned, the termination point of the ith voice message is adjusted based on the breakpoint duration, and the termination point of the voice command extracted by the VAD model is repositioned. The breakpoint duration is obtained by processing according to a trained target speech breakpoint model, the model is trained according to habits of grammatical structures of ages of children, the breakpoint duration accords with habits of the children, the breakpoint duration is used for adjusting the end point of the current speech information, and if the children really have sentences, subsequent speech information can be detected.
It should be noted that, between the end point of the ith voice message and the adjusted end point of the ith voice message (i.e. the point after the breakpoint duration is superimposed on the end point of the ith voice message), the method for detecting whether there is audio input by the VAD model also includes: the VAD model sets a threshold k and calculates the energy at each moment, if the energy is larger than the threshold k, 1 is output (1 means that the point is speech), otherwise, 0 is output (0 means that the point is mute or noise). Therefore, when there is an audio input, the VAD model suddenly detects that the energy at a certain point is suddenly larger than the threshold k, which represents that there is an audio input.
Specifically, when the VAD model is obtained to extract the start endpoint and the end endpoint of the voice command, the VAD model removes the mute parts at the two ends of the start endpoint and the end endpoint, only the voice part between the start endpoint and the end endpoint is reserved, and the voice part is determined as the voice command and sent to a subsequent voice recognition system for processing.
It should be noted that, in addition to children, the present embodiment can also be applied to people in other age groups with logic confusion and large sentence breaks, for example: elderly people, people suffering from a crust disorder. Then, the speech endpoint model needs to be trained on speech information samples of the elderly and people with the academia, and in the case that the end point of the speech instruction obtained by the VAD model is adjusted by using the breakpoint duration to accurately obtain the intention of the user, the skilled person can easily adapt the embodiment. Then, the method after adaptive adjustment according to the inventive concept still outputs the protection scope of the present invention.
The following describes in detail the process of obtaining the voice command by the smart sound box with screen through a specific example for understanding.
A two year old child who wants to see the pig cookie, however, when expressing his intent, the spoken voice may be: the. (interrupt 2s) a pecker.. said. (interrupt 1.5s) a look.. the. ", smart ribbon screen speakers include at least: touch display screen, camera, microphone, treater, communication module, memory and have the better speaker of tone quality.
When a child inputs first voice information 'piggy', the intelligent sound box with the screen collects the first voice information 'piggy' input by the user, the judgment is carried out corresponding to the first voice information, the situation that grammar components of the 'piggy' belong to 'subject', 'object' or 'table language', the preset grammar structure (the preset grammar structure comprises the object and predicate) is not met, the intention of the child cannot be recognized, therefore, the situation that the first voice information is not complete and understandable semantics is determined, the camera is called to obtain face information of the child at the moment, the situation that the user is the child is determined through a face recognition technology, and the age information of the child is further obtained to be 2 years. The intelligent sound box with the screen uploads the obtained first voice information 'piglet' and the age information '2 years old' of children to the server through the communication module so as to obtain the breakpoint duration 2.3s corresponding to the first voice information returned by the server.
Next, the VAD model extracts a voice command based on the obtained breakpoint duration 2.3s, when the VAD model detects that the energy of the first sound frame of the first voice message "piggy" is greater than the threshold K, determines the starting end point of the voice command, continuously detects, when the next sound frame of the last sound frame of the first voice message "piggy" is detected to be less than the threshold K, determines the end point of the first voice message, adjusts the end point of the first voice message by using the breakpoint duration 2.3s to obtain the adjusted end point of the first voice message, the VAD model continues to continuously detect the energy at each moment after the end point of the first voice message, detects that the child inputs the second voice message "peyjj" before the adjusted end point of the first voice message, at this moment, continuously detects, and detects that the next sound frame of the last sound frame of the second voice message "peyjj" is less than the threshold K, determining an end point of the second voice information, adjusting the end point of the second voice information by using a breakpoint duration of 2.3s to obtain an adjusted end point of the second voice information, continuously detecting the energy of each moment after the end point of the second voice information by using a VAD (voice activity detection) model, detecting that the third voice information 'see' is input by a child before the end point of the adjusted second voice information, continuously detecting, determining the end point of the third voice information when detecting that the next voice frame of the last voice frame of the 'see' of the third voice information is less than a threshold K, adjusting the end point of the third voice information by using the breakpoint duration of 2.3s to obtain an adjusted end point of the third voice information, continuously detecting the energy of each moment after the end point of the third voice information by using the VAD model until no audio input is detected at the adjusted end point of the third voice information, and determining the end point of the adjusted third voice message (namely the last voice frame viewed is overlapped with the voice frame corresponding to the power-off time length of 2.3 s) as the end point of the voice command. And extracting the voice part between the starting end point and the ending end point to obtain a voice instruction, and sending the obtained voice instruction to a subsequent voice recognition system for recognition.
The technical scheme in the embodiment of the application at least has the following technical effects or advantages:
according to the method and the device, the first voice information input by the user is processed based on the target voice breakpoint model associated with the age information of the user, the habitual pause time between two adjacent voice information when the user inputs the voice instruction, namely the breakpoint duration, is obtained, and the voice instruction is obtained based on the obtained breakpoint duration. Compared with the prior art, the method can accurately obtain the breakpoint duration of the user according to the age information input by the user and the first voice information input by the user, and the subsequent voice instruction extracted based on the accurate breakpoint duration can be more accurate, even if a large number of sentences exist when the child expresses the intention by using the voice instruction, as long as the subsequent voice information can be basically collected in the breakpoint duration, and the breakpoint duration is accurately analyzed for the child and accords with the breakpoint habit of the child, so that compared with the prior art, the method can collect the voice information of the child more completely, and only collect the voice information of the child more completely, the intention of the child can be more accurately identified by the subsequent voice identification system, and the problem that the voice identification system in the prior art faces the child user group is solved, the method has the technical problems of inaccuracy and low detection precision.
Example two
As shown in fig. 2, based on the same inventive concept, the present embodiment provides an apparatus for obtaining a voice command, which is applied to an electronic device, and includes:
the acquisition module 201 is used for acquiring first voice information input by a user;
a determining module 202, configured to determine age information of a user;
a first obtaining module 203, configured to obtain a breakpoint duration based on the first voice information and the age information of the user, where the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from multiple voice breakpoint models;
the second obtaining module 204 obtains a voice instruction related to the first voice information, which is input by a user, based on the breakpoint duration, where the voice instruction includes N pieces of voice information, the N pieces of voice information include the first voice information, and N is a positive integer greater than 1.
In an optional implementation, the obtaining apparatus of the voice instruction further includes:
the first judging module is used for judging whether the first voice information has complete understandable semantics before determining the age information of the user, wherein the complete understandable semantics comprises a preset grammar structure;
the second judging module is used for judging whether the user is a specific user when the first voice information does not have complete understandable semantics, and the age of the specific user is less than a preset age; when the user is a specific user, the step of determining the age information of the user is performed by the determining module 202.
In an optional implementation, the second determining unit includes one or any combination of the following modules:
the first determining submodule is used for extracting and analyzing the characteristics of the first voice information based on the voiceprint recognition technology and determining whether the user is a specific user;
the second determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information of the user based on a face recognition technology and determining whether the user is a specific user;
and a third determining sub-module which determines whether the user is a specific user based on whether the current mode of the electronic device is the specific mode.
In an alternative implementation, the determining module 202 includes:
the fourth determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information based on a face recognition technology and determining the age information of the user; and/or
And the fifth determining submodule is used for extracting and analyzing the characteristics of the first voice information based on the voiceprint recognition technology and determining the age information of the user.
In an alternative implementation, the first obtaining module 203 includes:
the uploading sub-module is used for uploading the first voice information and the age information of the user to the server so that the server can select a target voice breakpoint model from the multiple voice breakpoint models based on the age information of the user and input the first voice information into the target voice breakpoint model to obtain breakpoint duration, wherein the server stores the multiple voice breakpoint models which correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and the receiving submodule is used for receiving the breakpoint duration returned by the server.
In an alternative implementation, the first obtaining module 203 includes:
the selection submodule selects a target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the electronic equipment stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and the obtaining submodule is used for inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
In an optional implementation, the obtaining apparatus of the voice instruction further includes:
the first acquisition module is used for acquiring the use duration information of a user and/or proficiency level information of a voice instruction input by the user, wherein the use duration information is used for expressing the total duration of the user using the electronic equipment;
the obtaining module is used for adjusting the breakpoint duration based on the use duration information and/or proficiency information of the voice command input by the user to obtain the adjusted breakpoint duration;
the second obtaining module 204 is further configured to obtain, based on the adjusted breakpoint duration, a voice instruction related to the first voice information, which is input by the user.
In an alternative implementation, the second obtaining module 204 includes:
a sixth determining submodule, configured to determine a starting point of the first voice message as a starting endpoint of the voice instruction;
the second obtaining submodule is used for determining a first end point of the first voice information and adjusting the first end point based on the breakpoint duration to obtain a second end point;
a first detection submodule for detecting whether there is an audio input between a first end point and a second end point;
the first acquisition submodule is used for acquiring second voice information when audio input exists between the first end point and the second end point;
the third obtaining submodule is used for determining a third end point of the second voice information and adjusting the third end point based on the breakpoint duration to obtain a fourth end point;
a second detection submodule for detecting whether there is an audio input between the third end point and the fourth end point:
the first obtaining submodule is used for determining the fourth end point as the end point of the voice instruction when no audio frequency is input between the third end point and the fourth end point; obtaining a voice instruction based on a starting end point of the voice instruction and an ending end point of the voice instruction, wherein the voice instruction comprises first voice information and second voice information;
the second acquisition submodule is used for acquiring third voice information when audio is input between the third end point and the fourth end point;
a fourth obtaining submodule, configured to determine a fifth end point of the third voice information, and adjust the fifth end point based on the breakpoint duration to obtain a sixth end point;
the third detection submodule is used for detecting whether audio input exists between the fifth end point and the sixth end point;
the second obtaining submodule is used for determining the sixth end point as the end point of the voice instruction when no audio frequency is input between the fifth end point and the sixth end point; acquiring a voice instruction based on a starting end point of the voice instruction and an ending end point of the voice instruction, wherein the voice instruction comprises first voice information to third voice information;
and the third acquisition submodule is used for continuously acquiring fourth voice information when audio is input between the fifth end point and the sixth end point.
The technical scheme in the embodiment of the application at least has the following technical effects or advantages:
according to the method and the device, the first voice information input by the user is processed based on the target voice breakpoint model associated with the age information of the user, the habitual pause time between two adjacent voice information when the user inputs the voice instruction, namely the breakpoint duration, is obtained, and the voice instruction is obtained based on the obtained breakpoint duration. Compared with the prior art, the method can accurately obtain the breakpoint duration of the user according to the age information input by the user and the first voice information input by the user, and the subsequent voice instruction extracted based on the accurate breakpoint duration can be more accurate, even if a large number of sentences exist when the child expresses the intention by using the voice instruction, as long as the subsequent voice information can be basically collected in the breakpoint duration, and the breakpoint duration is accurately analyzed for the child and accords with the breakpoint habit of the child, so that compared with the prior art, the method can collect the voice information of the child more completely, and only collect the voice information of the child more completely, the intention of the child can be more accurately identified by the subsequent voice identification system, and the problem that the voice identification system in the prior art faces the child user group is solved, the method has the technical problems of inaccuracy and low detection precision.
EXAMPLE III
Based on the same inventive concept, as shown in fig. 3, the present embodiment provides an electronic device 300, which includes a memory 310, a processor 320, and a computer program 311 stored in the memory 310 and executable on the processor 320, wherein the processor 320 executes the computer program 311 to implement the following method steps:
collecting first voice information input by a user; determining age information of a user; obtaining breakpoint duration corresponding to the first voice information user based on the first voice information and age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models; based on the breakpoint duration, obtaining a voice instruction which is input by a user and related to first voice information, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
In a specific implementation, when the processor 320 executes the program 311, any method steps in the first embodiment may also be implemented.
Example four
Based on the same inventive concept, as shown in fig. 4, the present embodiment provides a computer-readable storage medium 400, on which a computer program 411 is stored, the computer program 411 implementing the following steps when being executed by a processor:
collecting first voice information input by a user; determining age information of a user; obtaining breakpoint duration corresponding to the first voice information user based on the first voice information and age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models; based on the breakpoint duration, obtaining a voice instruction which is input by a user and related to first voice information, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
In a specific implementation, the computer program 411, when executed by a processor, may implement the method steps of the second embodiment.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the means for obtaining voice instructions, the electronics, the computer storage medium, according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The invention discloses A1, a method for obtaining a voice command, which is applied to electronic equipment and is characterized by comprising the following steps:
collecting first voice information input by a user;
determining age information of the user;
obtaining a breakpoint duration based on the first voice information and the age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models;
and acquiring a voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
A2, the method for obtaining voice instructions according to a1, further comprising, before the determining age information of the user:
judging whether the first voice information has complete understandable semantics, wherein the complete understandable semantics comprise a preset grammar structure;
when the first voice message does not have the complete understandable semantics, judging whether the user is a specific user, wherein the age of the specific user is less than a preset age;
when the user is the specific user, the step of determining the age information of the user is performed.
A3, the method for obtaining voice command as in a2, wherein the determining whether the user is a specific user comprises one or any combination of the following ways:
performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology, and determining whether the user is the specific user;
collecting the face information of the user, and performing feature extraction and analysis on the face information of the user based on a face recognition technology to determine whether the user is the specific user;
determining whether the user is a specific user based on whether a current mode of the electronic device is the specific mode.
A4, the method for obtaining voice commands according to a1, wherein the determining age information of the user comprises:
collecting face information of the user, and performing feature extraction and analysis on the face information based on a face recognition technology to determine age information of the user; and/or
And performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
A5, the method for obtaining voice command according to a1, wherein the obtaining a breakpoint duration based on the first voice message and the age information of the user comprises:
uploading the first voice information and the age information of the user to a server, so that the server selects the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, and inputs the first voice information into the target voice breakpoint model to obtain the breakpoint duration, wherein the server stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and receiving the breakpoint duration returned by the server.
A6, the method for obtaining voice command according to a1, wherein the obtaining a breakpoint duration based on the first voice message and the age information of the user comprises:
selecting the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the plurality of voice breakpoint models are stored in the electronic equipment, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
A7, the method for obtaining the voice command according to a1, wherein before the obtaining the voice command related to the first voice information input by the user based on the breakpoint duration, the method further comprises:
acquiring the use duration information of the user and/or proficiency level information of the voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment;
adjusting the breakpoint duration based on the use duration information and/or proficiency level information of the voice instruction input by the user to obtain the adjusted breakpoint duration;
the obtaining of the voice instruction related to the first voice information, which is input by the user, based on the breakpoint duration includes:
and obtaining the voice instruction based on the adjusted breakpoint duration.
A8, the method for obtaining the voice command according to any one of a1-a7, wherein the obtaining the voice command related to the first voice information input by the user based on the breakpoint duration includes:
determining a starting point of the first voice message as a starting endpoint of the voice instruction;
determining a first end point of the first voice message, and adjusting the first end point based on the breakpoint duration to obtain a second end point;
detecting whether there is an audio input between the first end point and the second end point;
if yes, acquiring second voice information;
determining a third end point of the second voice information, and adjusting the third end point based on the breakpoint duration to obtain a fourth end point;
detecting whether there is an audio input between the third end point and the fourth end point;
if not, determining the fourth end point as an end point of the voice instruction, and obtaining the voice instruction based on a start end point of the voice instruction and the end point of the voice instruction, wherein the voice instruction comprises the first voice information and the second voice information;
if yes, acquiring third voice information;
determining a fifth endpoint of the third voice message, and adjusting the fifth endpoint based on the breakpoint duration to obtain a sixth endpoint;
detecting whether there is an audio input between the fifth end point and the sixth end point;
if not, determining the sixth end point as an end point of the voice instruction, and obtaining the voice instruction based on a start end point of the voice instruction and the end point of the voice instruction, wherein the voice instruction comprises the first voice information to the third voice information;
if yes, continuing to collect the fourth voice information.
B9, an apparatus for obtaining voice command, applied to an electronic device, the apparatus comprising:
the acquisition module is used for acquiring first voice information input by a user;
a determining module for determining age information of the user;
a first obtaining module, configured to obtain a breakpoint duration based on the first voice information and the age information of the user, where the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from multiple voice breakpoint models;
and the second obtaining module is used for obtaining the voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
B10, the apparatus for obtaining voice command according to B9, further comprising:
the first judging module is used for judging whether the first voice information has complete understandable semantics before determining the age information of the user, wherein the complete understandable semantics comprises a preset syntactic structure;
the second judging module is used for judging whether the user is a specific user when the first voice information does not have the complete understandable semantics, and the age of the specific user is smaller than a preset age; when the user is the specific user, the step of determining the age information of the user is executed by the determining module.
B11, the apparatus for obtaining voice command according to B10, wherein the second judging unit comprises one or any combination of the following modules:
the first determining sub-module is used for extracting and analyzing the characteristics of the first voice information based on a voiceprint recognition technology and determining whether the user is the specific user or not;
the second determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information of the user based on a face recognition technology and determining whether the user is the specific user;
a third determination sub-module that determines whether the user is a specific user based on whether a current mode of the electronic device is the specific mode.
B12, the device for obtaining voice commands according to B9, wherein the determining module comprises:
the fourth determining submodule is used for acquiring the face information of the user, extracting and analyzing the features of the face information based on a face recognition technology and determining the age information of the user; and/or
And the fifth determining submodule is used for performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
B13, the apparatus for obtaining voice command according to B9, wherein the first obtaining module comprises:
the uploading sub-module is used for uploading the first voice information and the age information of the user to a server, so that the server selects the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user and inputs the first voice information into the target voice breakpoint model to obtain the breakpoint duration, wherein the server stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and the receiving submodule is used for receiving the breakpoint duration returned by the server.
B14, the apparatus for obtaining voice command according to B9, wherein the first obtaining module comprises:
the selection submodule is used for selecting the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the plurality of voice breakpoint models are stored in the electronic equipment, the plurality of voice breakpoint models correspond to different ages of information, and the target voice breakpoint model corresponds to the age information of the user;
and the first obtaining submodule is used for inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
B15, the apparatus for obtaining voice command according to B9, further comprising:
the first acquisition module is used for acquiring the use duration information of the user and/or proficiency level information of the voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment;
the obtaining module is used for adjusting the breakpoint duration based on the use duration information to obtain the adjusted breakpoint duration;
the second obtaining module is further configured to obtain the voice instruction related to the first voice information, which is input by the user, based on the adjusted breakpoint duration and/or proficiency level information of the voice instruction input by the user.
B16, the apparatus for obtaining voice command according to any one of B9-B15, wherein the second obtaining module comprises:
a sixth determining submodule, configured to determine a starting point of the first voice message as a starting endpoint of the voice instruction;
the second obtaining submodule is used for determining a first end point of the first voice information and adjusting the first end point based on the breakpoint duration to obtain a second end point;
a first detection submodule for detecting whether there is an audio input between the first end point and the second end point;
the first acquisition submodule is used for acquiring second voice information when audio input exists between the first end point and the second end point;
a third obtaining submodule, configured to determine a third end point of the second voice information, and adjust the third end point based on the breakpoint duration to obtain a fourth end point;
a second detection submodule for detecting whether there is an audio input between the third end point and the fourth end point;
a first obtaining submodule, configured to determine the fourth end point as an end endpoint of the voice instruction when no audio input is input between the third end point and the fourth end point; obtaining the voice instruction based on a starting endpoint of the voice instruction and an ending endpoint of the voice instruction, wherein the voice instruction comprises the first voice information and the second voice information;
the second acquisition submodule is used for acquiring third voice information when audio input exists between the third end point and the fourth end point;
a fourth obtaining submodule, configured to determine a fifth end point of the third voice information, and adjust the fifth end point based on the breakpoint duration to obtain a sixth end point;
a third detection submodule, configured to detect whether an audio input is provided between the fifth end point and the sixth end point;
a second obtaining submodule, configured to determine the sixth end point as an end point of the voice instruction when no audio input is performed between the fifth end point and the sixth end point; obtaining the voice instruction based on a starting endpoint of the voice instruction and an ending endpoint of the voice instruction, wherein the voice instruction comprises the first voice information to the third voice information;
and the third acquisition submodule is used for continuously acquiring fourth voice information when audio input exists between the fifth end point and the sixth end point.
C17, an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, performs the method steps of any of claims a1-a 8.
D18, a computer storage medium having a computer program stored thereon, comprising: the program may, when executed by a processor, implement the method steps of any one of claims a1 to A8.

Claims (10)

1. A method for obtaining a voice command, applied to an electronic device, is characterized in that the method comprises:
collecting first voice information input by a user;
determining age information of the user;
obtaining a breakpoint duration based on the first voice information and the age information of the user, wherein the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from a plurality of voice breakpoint models;
and acquiring a voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
2. The method for obtaining a voice instruction according to claim 1, further comprising, before said determining age information of said user:
judging whether the first voice information has complete understandable semantics, wherein the complete understandable semantics comprise a preset grammar structure;
when the first voice message does not have the complete understandable semantics, judging whether the user is a specific user, wherein the age of the specific user is less than a preset age;
when the user is the specific user, the step of determining the age information of the user is performed.
3. The method for obtaining the voice command according to claim 2, wherein the determining whether the user is a specific user includes one or any combination of the following manners:
performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology, and determining whether the user is the specific user;
collecting the face information of the user, and performing feature extraction and analysis on the face information of the user based on a face recognition technology to determine whether the user is the specific user;
determining whether the user is a specific user based on whether a current mode of the electronic device is the specific mode.
4. The method for obtaining voice instructions according to claim 1, wherein the determining age information of the user comprises:
collecting face information of the user, and performing feature extraction and analysis on the face information based on a face recognition technology to determine age information of the user; and/or
And performing feature extraction and analysis on the first voice information based on a voiceprint recognition technology to determine the age information of the user.
5. The method for obtaining the voice command according to claim 1, wherein the obtaining a breakpoint duration based on the first voice information and the age information of the user includes:
uploading the first voice information and the age information of the user to a server, so that the server selects the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, and inputs the first voice information into the target voice breakpoint model to obtain the breakpoint duration, wherein the server stores the plurality of voice breakpoint models, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and receiving the breakpoint duration returned by the server.
6. The method for obtaining the voice command according to claim 1, wherein the obtaining a breakpoint duration based on the first voice information and the age information of the user includes:
selecting the target voice breakpoint model from a plurality of voice breakpoint models based on the age information of the user, wherein the plurality of voice breakpoint models are stored in the electronic equipment, the plurality of voice breakpoint models correspond to different age information, and the target voice breakpoint model corresponds to the age information of the user;
and inputting the first voice information into the target voice breakpoint model to obtain the breakpoint duration.
7. The method for obtaining the voice command according to claim 1, wherein before the obtaining the voice command related to the first voice information input by the user based on the breakpoint duration, the method further comprises:
acquiring the use duration information of the user and/or proficiency level information of the voice instruction input by the user, wherein the use duration information is used for representing the total duration of the user using the electronic equipment;
adjusting the breakpoint duration based on the use duration information and/or proficiency level information of the voice instruction input by the user to obtain the adjusted breakpoint duration;
the obtaining of the voice instruction related to the first voice information, which is input by the user, based on the breakpoint duration includes:
and obtaining the voice instruction based on the adjusted breakpoint duration.
8. An apparatus for obtaining a voice command, applied to an electronic device, the apparatus comprising:
the acquisition module is used for acquiring first voice information input by a user;
a determining module for determining age information of the user;
a first obtaining module, configured to obtain a breakpoint duration based on the first voice information and the age information of the user, where the breakpoint duration is obtained after processing the first voice information by using a target voice breakpoint model corresponding to the age information of the user, and the target voice breakpoint model is selected from multiple voice breakpoint models;
and the second obtaining module is used for obtaining the voice instruction which is input by the user and is related to the first voice information based on the breakpoint duration, wherein the voice instruction comprises N sections of voice information, the N sections of voice information comprise the first voice information, and N is a positive integer greater than 1.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, is adapted to carry out the method steps of any of claims 1 to 8.
10. A computer storage medium having a computer program stored thereon, comprising: the program may, when executed by a processor, implement the method steps of any of claims 1 to 8.
CN201910947282.9A 2019-09-29 2019-09-29 Method and device for acquiring voice instruction Withdrawn CN112581937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910947282.9A CN112581937A (en) 2019-09-29 2019-09-29 Method and device for acquiring voice instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910947282.9A CN112581937A (en) 2019-09-29 2019-09-29 Method and device for acquiring voice instruction

Publications (1)

Publication Number Publication Date
CN112581937A true CN112581937A (en) 2021-03-30

Family

ID=75117149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910947282.9A Withdrawn CN112581937A (en) 2019-09-29 2019-09-29 Method and device for acquiring voice instruction

Country Status (1)

Country Link
CN (1) CN112581937A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113707154A (en) * 2021-09-03 2021-11-26 上海瑾盛通信科技有限公司 Model training method and device, electronic equipment and readable storage medium
CN114420103A (en) * 2022-01-24 2022-04-29 中国第一汽车股份有限公司 Voice processing method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113707154A (en) * 2021-09-03 2021-11-26 上海瑾盛通信科技有限公司 Model training method and device, electronic equipment and readable storage medium
CN113707154B (en) * 2021-09-03 2023-11-10 上海瑾盛通信科技有限公司 Model training method, device, electronic equipment and readable storage medium
CN114420103A (en) * 2022-01-24 2022-04-29 中国第一汽车股份有限公司 Voice processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109509470B (en) Voice interaction method and device, computer readable storage medium and terminal equipment
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
US10825470B2 (en) Method and apparatus for detecting starting point and finishing point of speech, computer device and storage medium
CN108735201B (en) Continuous speech recognition method, device, equipment and storage medium
CN112102850B (en) Emotion recognition processing method and device, medium and electronic equipment
CN112825248B (en) Voice processing method, model training method, interface display method and equipment
CN106875936A (en) Voice recognition method and device
CN115062143A (en) Voice recognition and classification method, device, equipment, refrigerator and storage medium
WO2023222089A1 (en) Item classification method and apparatus based on deep learning
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
KR20180106817A (en) Electronic device and controlling method thereof
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
WO2024140434A1 (en) Text classification method based on multi-modal knowledge graph, and device and storage medium
WO2024140430A1 (en) Text classification method based on multimodal deep learning, device, and storage medium
CN116070020A (en) Food material recommendation method, equipment and storage medium based on knowledge graph
CN114242064A (en) Speech recognition method and device, and training method and device of speech recognition model
CN115098765A (en) Information pushing method, device and equipment based on deep learning and storage medium
CN110853669B (en) Audio identification method, device and equipment
CN112581937A (en) Method and device for acquiring voice instruction
CN112309398B (en) Method and device for monitoring working time, electronic equipment and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN110099332B (en) Audio environment display method and device
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN114495981A (en) Method, device, equipment, storage medium and product for judging voice endpoint
CN112837688B (en) Voice transcription method, device, related system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210330