CN111081257A - Voice acquisition method, device, equipment and storage medium - Google Patents
Voice acquisition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111081257A CN111081257A CN201811223872.9A CN201811223872A CN111081257A CN 111081257 A CN111081257 A CN 111081257A CN 201811223872 A CN201811223872 A CN 201811223872A CN 111081257 A CN111081257 A CN 111081257A
- Authority
- CN
- China
- Prior art keywords
- voice
- voice information
- determining
- voiceprint
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 9
- 238000012549 training Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000012880 independent component analysis Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 235000021395 porridge Nutrition 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a voice acquisition method, a voice acquisition device, voice acquisition equipment and a storage medium, which are used for providing a method for determining effective voice in a multi-person conversation scene, so that the effective voice information can be identified from the mixed voice information of a plurality of persons, the voice authority control of the multi-person conversation scene is further realized, and the use experience is improved. The method comprises the following steps: determining voice information corresponding to each user from the obtained mixed voice information of at least two users; extracting the voiceprint characteristics of the voice information of each user to obtain at least two corresponding voiceprint characteristics; determining a target voiceprint feature belonging to a predetermined voiceprint feature set; inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model to obtain a voice effective weight corresponding to each target voiceprint characteristic; and determining effective voice information from the voice information corresponding to the target voiceprint characteristics according to the obtained voice effective weight.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech acquisition method, apparatus, device, and storage medium.
Background
The application of the voice acquisition technology in the modern society is more and more extensive, for example, the voice acquisition technology has wide application in various aspects such as education, meetings, household appliance control and the like, and the voice acquisition technology has higher index requirements in the fields such as voice recognition, voice control, intelligent interaction and the like, and the acquired voice is required to be clear and accurate.
However, when capturing voice in a complex environment, for example, the voice captured in an environment including a plurality of users is usually mixed voice information of the plurality of users, and it is difficult to determine which voice information has a voice dominant authority (e.g., speaker authority) for voice information of all users, so it is necessary to determine valid voice information in a multi-person conversation scenario because this is related to determination of the dominant authority in the multi-person conversation scenario, but such a determination scheme is still lacking at present.
Disclosure of Invention
The embodiment of the invention provides a voice acquisition method, a voice acquisition device, voice acquisition equipment and a storage medium, which are used for providing a method for determining effective voice in a multi-person conversation scene, so that the effective voice information can be identified from the mixed voice information of a plurality of persons, the voice authority control of the multi-person conversation scene is further realized, and the use experience is improved.
In a first aspect, a method for acquiring speech is provided, the method comprising:
determining voice information corresponding to each user from the obtained mixed voice information of at least two users;
extracting the voiceprint characteristics of the voice information of each user to obtain at least two corresponding voiceprint characteristics;
determining a target voiceprint feature belonging to a predetermined voiceprint feature set;
inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model to obtain a voice effective weight corresponding to each target voiceprint characteristic;
and determining effective voice information from the voice information corresponding to the target voiceprint characteristics according to the obtained voice effective weight.
Optionally, the voice feature parameter includes at least one of a voice frequency, a voice duration of this time, and a voice sequence of this time corresponding to the voiceprint feature.
Optionally, the predetermined valid speech recognition model includes at least one speech valid weight calculation rule that the higher the speech frequency is, the larger the corresponding weight is, the longer the duration of the current speech is, the larger the corresponding weight is, and the larger the corresponding weight is before the current speech sequence is.
Optionally, the method further includes:
and if the at least two voiceprint features do not belong to the preset voiceprint feature set, determining effective voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule.
Optionally, determining valid voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule, including:
determining the voice information with the longest voice time as the effective voice information; or,
and determining the position of each user according to the signal receiving strength of the voice information of each user, and determining the voice information of the user with the closest position distance as the effective voice information.
Optionally, determining valid voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule, including:
respectively carrying out voice recognition on the voice information of each user to obtain corresponding voice content;
and determining the voice information comprising the preset keywords as the effective voice information.
Optionally, the method further includes:
after the valid voice information is determined, determining that a user corresponding to the valid voice information has the speaker authority, and/or executing corresponding voice control operation according to the valid voice.
In a second aspect, a speech acquisition apparatus is provided, the apparatus comprising:
the first determining module is used for determining voice information corresponding to each user from the obtained mixed voice information of at least two users;
the first obtaining module is used for extracting the voiceprint features of the voice information of each user so as to obtain at least two corresponding voiceprint features;
a second determining module for determining a target voiceprint feature belonging to a predetermined voiceprint feature set;
the second obtaining module is used for inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model so as to obtain the voice effective weight corresponding to each target voiceprint characteristic;
and the third determining module is used for determining effective voice information from the voice information corresponding to the target voiceprint feature according to the obtained voice effective weight.
Optionally, the voice feature parameter includes at least one of a voice frequency, a voice duration of this time, and a voice sequence of this time corresponding to the voiceprint feature.
Optionally, the predetermined valid speech recognition model includes at least one speech valid weight calculation rule that the higher the speech frequency is, the larger the corresponding weight is, the longer the duration of the current speech is, the larger the corresponding weight is, and the larger the corresponding weight is before the current speech sequence is.
Optionally, the apparatus further includes a fourth determining module, configured to:
and if the at least two voiceprint features do not belong to the preset voiceprint feature set, determining effective voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule.
Optionally, the fourth determining module is configured to:
determining the voice information with the longest voice time as the effective voice information; or,
and determining the position of each user according to the signal receiving strength of the voice information of each user, and determining the voice information of the user with the closest position distance as the effective voice information.
Optionally, the fourth determining module is configured to:
respectively carrying out voice recognition on the voice information of each user to obtain corresponding voice content;
and determining the voice information comprising the preset keywords as the effective voice information.
Optionally, the apparatus further includes a fifth determining module, configured to:
after the effective voice information is determined, determining that a user corresponding to the effective voice information has the authority of a speaker, and/or executing corresponding voice control operation according to the effective voice.
In a third aspect, a voice collecting apparatus is provided, which includes:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing the steps included in any method in the first aspect according to the obtained program instructions.
In a fourth aspect, there is provided a storage medium having stored thereon computer-executable instructions for causing a computer to perform the steps included in any one of the methods of the first aspect.
In the embodiment of the invention, in a multi-user voice scene, the obtained mixed voice of a plurality of users can be processed to separate and obtain the voice information corresponding to each user, then the voiceprint features of each user are extracted, all the extracted voiceprint features are respectively matched with the preset voiceprint feature set to determine the target voiceprint features belonging to the preset voiceprint feature set, which is equivalent to preliminarily screening the voice control authority through the voiceprint features, further, the voice feature parameters corresponding to each target voiceprint feature are input into the preset effective voice recognition model to perform weight calculation, after the voice effective weight corresponding to each target voiceprint feature is obtained, the effective voice information can be determined according to the voice effective weight of each target voiceprint feature, for example, the voice information corresponding to the target voiceprint feature with the largest voice effective weight can be determined as the effective voice information, that is to say, the voice control authority can be further screened by the voice characteristic parameters corresponding to each target voiceprint characteristic, and the accurate recognition of effective voice can be ensured as much as possible by a two-layer screening mode, so as to improve the accuracy of determining the voice control authority. Meanwhile, the voice effective weight of each target voiceprint feature can be rapidly calculated by adopting the voice feature parameters and the preset effective voice recognition model through machine learning and recognition, and the difference of voice control authority is reflected through the difference of the voice feature parameters of each voiceprint feature, so that the accuracy and the effectiveness of effective voice information can be ensured, and the voice control authority is accurately and effectively distributed to the user.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a speech acquisition method in an embodiment of the present invention;
FIG. 2 is another flow chart of a voice capture method in an embodiment of the present invention;
FIG. 3 is a block diagram of a voice capture device in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a voice acquisition device in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The embodiments and features of the embodiments of the present invention may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The terms "first" and "second" in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
In the embodiments of the present invention, the "plurality" may mean at least two, for example, two, three, or more, and the embodiments of the present application are not limited.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document generally indicates that the preceding and following related objects are in an "or" relationship unless otherwise specified.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The voice collecting method provided by the embodiment of the present invention may be executed by a device having a voice collecting function, for example, the device having the voice collecting function is called a voice collecting device, and the voice collecting device is, for example, a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a notebook computer, an intelligent wearable device (e.g., an intelligent watch and an intelligent helmet), a terminal device of a Personal computer, or may also be an intelligent home device such as a television, an air conditioner, a refrigerator, and the like.
The technical scheme provided by the embodiment of the invention is described in the following with the accompanying drawings of the specification.
Referring to fig. 1, a flow of a speech acquisition method according to an embodiment of the present invention is described as follows.
Step 101: and acquiring mixed voice information of at least two users.
As described above, in a multi-user scenario, voice information of multiple users may be obtained, and the voice information of the multiple users may be referred to as mixed voice information.
Step 102: and processing the mixed voice information to obtain the voice information corresponding to each user.
The obtained mixed voice information may be subjected to independent component analysis, where the independent component analysis is to separate the obtained mixed voice information to obtain separated voice information, and in the separation process, for example, the obtained mixed voice information may be distinguished according to different voiceprint characteristics of each person, and then voice recognition is performed on the separated voice information to obtain voice information corresponding to each user.
Step 103: and extracting the voiceprint characteristics of the voice information of each user to obtain at least two corresponding voiceprint characteristics.
In the process of voice separation, it may be separation analysis based on voiceprint features, so that the voiceprint features of each user may be recognized according to different voiceprint features of the users, so as to obtain multiple voiceprint features corresponding to multiple users, for example, the voiceprint feature of user a is a first voiceprint feature, the voiceprint feature of user B is a second voiceprint feature, the voiceprint feature of user C is a third voiceprint feature, and so on.
Step 104: it is determined which voiceprint features belong to a predetermined set of voiceprint features.
That is, the voiceprint features belonging to the predetermined voiceprint feature set are screened from at least two voiceprint features, and for convenience of description, the voiceprint features belonging to the predetermined voiceprint feature set are referred to as target voiceprint features.
In the embodiment of the present invention, the predetermined voiceprint feature set is a set of a plurality of predetermined voiceprint features, each predetermined voiceprint feature may be a voiceprint feature of a user who has been configured with a certain voice control authority (for example, a speaker authority), and users who have a voice control authority in advance may be preliminarily screened out by comparing the predetermined voiceprint feature set.
Step 104 may include two types of determination results, the first type of determination result: determining at least two voiceprint features, wherein the voiceprint features belonging to a preset voiceprint feature set are determined, and for convenience of description, the voiceprint features belonging to the preset voiceprint feature set are called target voiceprint features; the second judgment result: and determining that at least two voiceprint characteristics do not belong to the preset voiceprint characteristic set. For different determination results, different determination methods may be used to determine valid speech information from the speech information of at least two users, and the following description will be made for the two determination methods with different determination results.
First mode of determination
Step 105: a target voiceprint feature belonging to a predetermined set of voiceprint features is obtained.
Step 106: and inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model so as to obtain the voice effective weight corresponding to each target voiceprint characteristic.
Step 107: and determining effective voice information from the voice information corresponding to the target voiceprint characteristics according to the obtained voice effective weight.
That is, the first determination means explains a determination means of effective speech information in the case where the target voiceprint feature is included in the predetermined voiceprint feature set. Wherein the voice feature parameter corresponding to the voiceprint feature refers to an attribute description parameter capable of indicating that the voiceprint feature corresponds to the expressed voice, it is understood that, through the voice feature parameter of the voiceprint feature, the history of the voiceprint feature corresponding to the voiceprint feature can be known approximately, for example, the semantic feature parameter may include at least one parameter of a voice frequency corresponding to the voiceprint feature, a duration of a last voice and a sequence of the last voice, specifically, the voice frequency corresponding to the voiceprint feature, for example, a historical voice total frequency or a recent period (for example, within a week) total frequency, the voice frequency refers to a number of times of the voice, a frequency +1 recorded by inputting the voice every time using the voice collecting device, a duration of the last voice refers to a total duration of the voice corresponding to each voiceprint feature in a last voice conversation scene, the sequence of the last voice refers to the chronological order of starting the voice of each voiceprint feature in the last voice session scene, for example, in a multi-user session scene composed of A, B, C three users, the total duration of the voice of each user and the sequence of starting speaking of each user can be counted, for example, the duration of the voice of A, B, C three users is 2 minutes, 3.5 minutes and 4 minutes, respectively, and the sequence of the voice of A, B, C three users is counted as user B first, user C second and user a third.
The preset effective speech recognition model in the embodiment of the invention is a mathematical model which is pre-established and trained, the value of the speech characteristic parameter of each user under a plurality of different multi-user conversation scenes can be used as a training condition to carry out original training in the training process, and the preset effective speech recognition model obtained by training can carry out calculation recognition on effective speech under most situations, specifically, the preset effective speech recognition model is determined by calculating the speech effective weight of each voiceprint characteristic. For example, in training a predetermined valid speech recognition model, one possible training rule is, for example: comparing the speaking frequency of each voiceprint for nearly 3 days, setting a higher weight for the higher speaking frequency, if the speaking frequency of each voiceprint does not exceed 20 times, further setting a higher weight for the longer speaking time of the last time, and setting a higher weight for the speaking first of the last time, and in addition, if the speaking frequency of a certain voiceprint exceeds 20 times, setting a higher weight for the higher speaking time of the last time, and the like, that is, the predetermined valid speech recognition model may include at least one speech valid weight calculation rule of the higher corresponding weight for the higher speech frequency, the longer corresponding weight for the duration of the speech of the time, and the larger corresponding weight for the earlier speech sequence of the time, because in the actual conversation scene, the three speech feature authority parameters can reflect the speech control of the user for the current conversation scene to a greater extent, for example, the longer the speech is, the more hoped to contend for the speech control right may be, or what is stated first may also be, the more hoped to contend for the speech control right may be, for example, the most frequent speech control in the past can indicate to a greater extent that it is the user who actually has the speech control right, and the like, by means of the way that these speech characteristic parameters are trained as training factors, the predetermined effective speech recognition model at the training position can reflect the actual multi-person conversation scene to a greater extent, thereby ensuring accurate recognition of effective speech information and improving the effectiveness of recognition.
After calculation, the valid speech weight corresponding to each voiceprint feature can be obtained, and then valid speech information can be determined from the speech information corresponding to the target voiceprint feature according to the calculated valid speech weight, for example, the speech information corresponding to the largest valid speech weight can be determined as valid speech information, because the larger the valid speech weight is, the higher the possibility of having the speech control weight is, the higher the accuracy is.
Second mode of determination
Step 108: if the determined voiceprint features do not belong to the preset voiceprint feature set, the effective voice information can be determined from the voice information corresponding to the at least two voiceprint features according to the additional determination rule.
That is, the second determination means is a determination means of valid speech information in a case where none of the plurality of voiceprint features belongs to the predetermined voiceprint feature set, in this case, that is, it indicates that none of the plurality of current voiceprint features has the speech control authority set in advance, at this time, a dynamic additional determination rule may be used to determine valid speech information, so as to ensure accuracy of valid speech information recognition.
In a possible implementation mode, the voice information with the longest voice time at this time can be directly determined as the effective voice information, because the voice time is longest, which indicates that the user probably wants to obtain the voice control right most, the voice control requirement of the user can be met to a certain extent by the mode.
In another possible implementation, the position of each user can be determined according to the signal receiving strength of the voice information of each user, and the voice information of the user whose position is closest to the voice acquisition device is determined as valid voice information, and the closer the position is to the voice acquisition device, the user who probably wants to use the voice acquisition device is indicated, so the voice control requirement of the user can be met to a certain extent by adopting the judgment mode of the distance between the positions.
In yet another possible implementation, some keywords may be stored in advance, where the keywords are referred to as preset keywords, the preset keywords may be one word or a set of words (that is, the preset keywords include multiple preset keywords), and for example, the preset keywords may be "cook", "turn on an air conditioner", "cook, cook porridge, cool", and the like, and for example, in a meeting scenario, the preset keywords may be "total", "performance assessment", and the like. Furthermore, voice recognition can be performed on the voice information of each user respectively to obtain the voice content corresponding to each user, and then the voice information including the preset keyword is determined as effective voice information, that is, effective voice can be dynamically screened according to the real-time voice content of the user, so that dynamic allocation of the voice control authority is realized, and the applicability of the scheme is enhanced.
Further, after the valid voice information is determined, the user corresponding to the valid voice information can be determined as the user having the authority of the speaker, that is, the allocation of the voice control right is realized, and in addition, the corresponding voice control operation can be executed according to the valid voice, for example, the voice control of turning on the air conditioner and cooling for 25 degrees is executed, so that the accurate voice control of the air conditioner is realized, and the purpose of trying down the voice control is realized.
To facilitate the overall understanding of the scheme of the embodiment of the present invention, it is briefly described with reference to fig. 2. Under the multi-person conversation scene that multiple persons speak simultaneously, the voice information of each user can be obtained, then voice recognition is carried out, to judge whether the semanteme of each voice message is clear or not, thereby eliminating the unclear voice signals, for clear voice information, whether voiceprint features of preset voiceprints exist in the users can be judged, if yes, whether more than 1 preset voiceprint is detected or not is further judged, if yes, the speaking frequency of each voiceprint for nearly 3 days is counted, whether the speaking frequency which is different between every two is larger than 20 is compared, for a phase difference of more than 20 between two, the voice command can be executed directly with the frequency being high, for the difference between every two voice commands is not more than 20, the voice command with the longest last speaking time or the longest total speaking time can be selected to be executed so as to realize the recognition of effective voice and the control of voice authority.
In the embodiment of the invention, in a multi-user voice scene, the obtained mixed voice of a plurality of users can be processed to separate and obtain the voice information corresponding to each user, then the voiceprint features of each user are extracted, all the extracted voiceprint features are respectively matched with the preset voiceprint feature set to determine the target voiceprint features belonging to the preset voiceprint feature set, which is equivalent to preliminarily screening the voice control authority through the voiceprint features, further, the voice feature parameters corresponding to each target voiceprint feature are input into the preset effective voice recognition model to perform weight calculation, after the voice effective weight corresponding to each target voiceprint feature is obtained, the effective voice information can be determined according to the voice effective weight of each target voiceprint feature, for example, the voice information corresponding to the target voiceprint feature with the largest voice effective weight can be determined as the effective voice information, that is to say, the voice control authority can be further screened by the voice characteristic parameters corresponding to each target voiceprint characteristic, and the accurate recognition of effective voice can be ensured as much as possible by a two-layer screening mode, so as to improve the accuracy of determining the voice control authority. Meanwhile, the voice effective weight of each target voiceprint feature can be rapidly calculated by adopting the voice feature parameters and the preset effective voice recognition model through machine learning and recognition, and the difference of voice control authority is reflected through the difference of the voice feature parameters of each voiceprint feature, so that the accuracy and the effectiveness of effective voice information can be ensured, and the voice control authority is accurately and effectively distributed to the user.
Based on the same inventive concept, the embodiment of the invention provides a voice acquisition device, and the voice acquisition device can realize the corresponding function of the voice acquisition method. The voice acquisition device can be a hardware structure, a software module or a hardware structure and a software module. The voice acquisition device can be realized by a chip system, and the chip system can be formed by a chip and can also comprise a chip and other discrete devices. Referring to fig. 3, the voice collecting apparatus includes a first determining module 301, a first obtaining module 302, a first determining module 303, a second obtaining module 304, and a third determining module 305. Wherein:
a first determining module 301, configured to determine, from the obtained mixed voice information of at least two users, voice information corresponding to each user;
a first obtaining module 302, configured to extract voiceprint features of voice information of each user to obtain at least two corresponding voiceprint features;
a second determining module 303, configured to determine a target voiceprint feature belonging to a predetermined voiceprint feature set;
a second obtaining module 304, configured to input the speech feature parameters corresponding to each target voiceprint feature into a predetermined valid speech recognition model, so as to obtain a speech valid weight corresponding to each target voiceprint feature;
and a third determining module 305, configured to determine valid voice information from the voice information corresponding to the target voiceprint feature according to the obtained voice valid weight.
In a possible implementation manner, the voice feature parameter includes at least one of a voice frequency, a duration of the current voice, and a voice sequence of the current voice corresponding to the voiceprint feature.
In one possible implementation, the predetermined valid speech recognition model includes at least one speech valid weight calculation rule, wherein the higher the speech frequency is, the higher the corresponding weight is, the longer the duration of the current speech is, the higher the corresponding weight is, and the earlier the speech sequence of the current speech is, the larger the corresponding weight is.
In a possible implementation manner, referring to fig. 3, the speech acquisition apparatus in the embodiment of the present invention further includes a fourth determining module 306, configured to:
and if the at least two voiceprint features do not belong to the preset voiceprint feature set, determining effective voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule.
In a possible implementation manner, the fourth determining module 306 is configured to determine the voice information with the longest voice time of this time as valid voice information; or, determining the position of each user according to the signal receiving strength of the voice information of each user, and determining the voice information of the user with the position closest to the position as effective voice information.
In a possible implementation manner, the fourth determining module 306 is configured to perform speech recognition on the speech information of each user respectively to obtain corresponding speech content; and determining the voice information including the preset keyword as effective voice information.
In a possible implementation manner, the voice collecting apparatus in the embodiment of the present invention further includes a fifth determining module, configured to determine, after determining the valid voice information, that a user corresponding to the valid voice information has the right of a speaker, and/or perform a corresponding voice control operation according to the valid voice.
All relevant contents of each step related to the embodiment of the voice acquisition method may be referred to the functional description of the functional module corresponding to the voice acquisition device in the embodiment of the present invention, and are not described herein again. The division of the modules in the embodiments of the present invention is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present invention may be integrated in one processor, or may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
Based on the same inventive concept, an embodiment of the present invention provides a voice collecting device, which may be, for example, a mobile phone, a tablet computer, a PDA, a notebook computer, an intelligent wearable device (e.g., an intelligent watch and an intelligent helmet), a terminal device such as a personal computer, or may also be an intelligent home device such as a television, an air conditioner, a refrigerator, and the like. Referring to fig. 4, the voice capturing apparatus includes at least one processor 401 and a memory 402 connected to the at least one processor, a specific connection medium between the processor 401 and the memory 402 is not limited in the embodiment of the present invention, in fig. 4, the processor 401 and the memory 402 are connected by a bus 400 as an example, the bus 400 is represented by a thick line in fig. 4, and connection manners between other components are only schematically illustrated and are not limited. The bus 400 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 4 for ease of illustration, but does not represent only one bus or type of bus. In addition, the voice collecting device may further include a voice collecting module 403, where the voice collecting module 403 may also be connected to the processor 401 and the memory 402 through the bus 400, and may perform voice collection according to the control of the processor 401, for example, where the voice collecting module 403 is a microphone or a microphone array.
In the embodiment of the present invention, the memory 402 stores instructions executable by the at least one processor 401, and the at least one processor 401 may execute the steps included in the aforementioned method for controlling a journey in public transportation by executing the instructions stored in the memory 402.
The processor 401 is a control center of the voice collecting device, and can connect various parts of the whole voice collecting device by using various interfaces and lines, and by operating or executing instructions stored in the memory 402 and calling data stored in the memory 402, various functions and processing data of the voice collecting device are performed, so as to perform overall monitoring on the voice collecting device. Optionally, the processor 401 may include one or more processing units, and the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles an operating system, a user interface, an application program, and the like, and the modem processor mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401. In some embodiments, processor 401 and memory 402 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 401 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
By programming the processor 401, the code corresponding to the voice acquisition method described in the foregoing embodiment may be solidified in the chip, so that the chip can execute the steps of the voice acquisition method when running, and how to program the processor 401 is a technique known by those skilled in the art, and is not described herein again.
Based on the same inventive concept, embodiments of the present invention further provide a storage medium storing computer instructions, which, when executed on a computer, cause the computer to perform the steps of the foregoing voice collecting method.
In some possible embodiments, the various aspects of the voice capturing method provided by the present invention may also be implemented in the form of a program product, which includes program code for causing a voice capturing device to perform the steps of the voice capturing method according to various exemplary embodiments of the present invention described above in this specification, when the program product runs on the voice capturing device.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. A method for speech acquisition, the method comprising:
determining voice information corresponding to each user from the obtained mixed voice information of at least two users;
extracting the voiceprint characteristics of the voice information of each user to obtain at least two corresponding voiceprint characteristics;
determining a target voiceprint feature belonging to a predetermined voiceprint feature set;
inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model to obtain a voice effective weight corresponding to each target voiceprint characteristic;
and determining effective voice information from the voice information corresponding to the target voiceprint characteristics according to the obtained voice effective weight.
2. The method of claim 1, wherein the speech characteristic parameters include at least one of a speech frequency, a duration of a last speech, and an order of the last speech corresponding to a voiceprint characteristic.
3. The method according to claim 2, wherein the predetermined valid speech recognition model includes at least one speech valid weight calculation rule of the higher frequency of speech corresponding to the larger weight, the longer duration of the current speech corresponding to the larger weight, and the earlier order of the current speech corresponding to the larger weight.
4. The method of claim 1, wherein the method further comprises:
and if the at least two voiceprint features do not belong to the preset voiceprint feature set, determining effective voice information from the voice information corresponding to the at least two voiceprint features according to an additional determination rule.
5. The method of claim 4, wherein determining valid speech information from the speech information corresponding to the at least two voiceprint features according to additional determination rules comprises:
determining the voice information with the longest voice time as the effective voice information; or,
and determining the position of each user according to the signal receiving strength of the voice information of each user, and determining the voice information of the user with the closest position distance as the effective voice information.
6. The method of claim 4, wherein determining valid speech information from the speech information corresponding to the at least two voiceprint features according to additional determination rules comprises:
respectively carrying out voice recognition on the voice information of each user to obtain corresponding voice content;
and determining the voice information comprising the preset keywords as the effective voice information.
7. The method of any of claims 1-6, further comprising:
after the valid voice information is determined, determining that a user corresponding to the valid voice information has the speaker authority, and/or executing corresponding voice control operation according to the valid voice.
8. A speech acquisition device, the device comprising:
the first determining module is used for determining voice information corresponding to each user from the obtained mixed voice information of at least two users;
the first obtaining module is used for extracting the voiceprint features of the voice information of each user so as to obtain at least two corresponding voiceprint features;
a second determining module for determining a target voiceprint feature belonging to a predetermined voiceprint feature set;
the second obtaining module is used for inputting the voice characteristic parameters corresponding to each target voiceprint characteristic into a preset effective voice recognition model so as to obtain the voice effective weight corresponding to each target voiceprint characteristic;
and the third determining module is used for determining effective voice information from the voice information corresponding to the target voiceprint feature according to the obtained voice effective weight.
9. A voice collecting apparatus, characterized in that the voice collecting apparatus comprises:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory and for executing the steps comprised in the method of any one of claims 1 to 7 in accordance with the obtained program instructions.
10. A storage medium storing computer-executable instructions for causing a computer to perform the steps comprising the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811223872.9A CN111081257A (en) | 2018-10-19 | 2018-10-19 | Voice acquisition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811223872.9A CN111081257A (en) | 2018-10-19 | 2018-10-19 | Voice acquisition method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111081257A true CN111081257A (en) | 2020-04-28 |
Family
ID=70309437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811223872.9A Pending CN111081257A (en) | 2018-10-19 | 2018-10-19 | Voice acquisition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111081257A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583934A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Data processing method and device |
CN111583956A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Voice processing method and device |
CN111653283A (en) * | 2020-06-28 | 2020-09-11 | 讯飞智元信息科技有限公司 | Cross-scene voiceprint comparison method, device, equipment and storage medium |
CN111833876A (en) * | 2020-07-14 | 2020-10-27 | 科大讯飞股份有限公司 | Conference speech control method, system, electronic device and storage medium |
CN112259097A (en) * | 2020-10-27 | 2021-01-22 | 深圳康佳电子科技有限公司 | Control method for voice recognition and computer equipment |
CN112511877A (en) * | 2020-12-07 | 2021-03-16 | 四川长虹电器股份有限公司 | Intelligent television voice continuous conversation and interaction method |
CN112700781A (en) * | 2020-12-24 | 2021-04-23 | 江西台德智慧科技有限公司 | Voice interaction system based on artificial intelligence |
CN113113044A (en) * | 2021-03-23 | 2021-07-13 | 北京小米移动软件有限公司 | Audio processing method and device, terminal and storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120245941A1 (en) * | 2011-03-21 | 2012-09-27 | Cheyer Adam J | Device Access Using Voice Authentication |
CN102890936A (en) * | 2011-07-19 | 2013-01-23 | 联想(北京)有限公司 | Audio processing method and terminal device and system |
CN102945669A (en) * | 2012-11-14 | 2013-02-27 | 四川长虹电器股份有限公司 | Household appliance voice control method |
CN103106390A (en) * | 2011-11-11 | 2013-05-15 | 索尼公司 | Information processing apparatus, information processing method, and program |
CN106782522A (en) * | 2015-11-23 | 2017-05-31 | 宏碁股份有限公司 | Sound control method and speech control system |
CN107767875A (en) * | 2017-10-17 | 2018-03-06 | 深圳市沃特沃德股份有限公司 | Sound control method, device and terminal device |
CN107808668A (en) * | 2017-09-14 | 2018-03-16 | 成都晓懋科技有限公司 | A kind of control method of smart home |
CN107885818A (en) * | 2017-11-06 | 2018-04-06 | 深圳市沃特沃德股份有限公司 | Robot and its method of servicing and device |
CN107918726A (en) * | 2017-10-18 | 2018-04-17 | 深圳市汉普电子技术开发有限公司 | Apart from inducing method, equipment and storage medium |
CN108108006A (en) * | 2017-12-19 | 2018-06-01 | 广东小天才科技有限公司 | Microphone remote control method and system |
CN108305632A (en) * | 2018-02-02 | 2018-07-20 | 深圳市鹰硕技术有限公司 | A kind of the voice abstract forming method and system of meeting |
CN108399923A (en) * | 2018-02-01 | 2018-08-14 | 深圳市鹰硕技术有限公司 | More human hairs call the turn spokesman's recognition methods and device |
-
2018
- 2018-10-19 CN CN201811223872.9A patent/CN111081257A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120245941A1 (en) * | 2011-03-21 | 2012-09-27 | Cheyer Adam J | Device Access Using Voice Authentication |
CN102890936A (en) * | 2011-07-19 | 2013-01-23 | 联想(北京)有限公司 | Audio processing method and terminal device and system |
CN103106390A (en) * | 2011-11-11 | 2013-05-15 | 索尼公司 | Information processing apparatus, information processing method, and program |
CN102945669A (en) * | 2012-11-14 | 2013-02-27 | 四川长虹电器股份有限公司 | Household appliance voice control method |
CN106782522A (en) * | 2015-11-23 | 2017-05-31 | 宏碁股份有限公司 | Sound control method and speech control system |
CN107808668A (en) * | 2017-09-14 | 2018-03-16 | 成都晓懋科技有限公司 | A kind of control method of smart home |
CN107767875A (en) * | 2017-10-17 | 2018-03-06 | 深圳市沃特沃德股份有限公司 | Sound control method, device and terminal device |
CN107918726A (en) * | 2017-10-18 | 2018-04-17 | 深圳市汉普电子技术开发有限公司 | Apart from inducing method, equipment and storage medium |
CN107885818A (en) * | 2017-11-06 | 2018-04-06 | 深圳市沃特沃德股份有限公司 | Robot and its method of servicing and device |
CN108108006A (en) * | 2017-12-19 | 2018-06-01 | 广东小天才科技有限公司 | Microphone remote control method and system |
CN108399923A (en) * | 2018-02-01 | 2018-08-14 | 深圳市鹰硕技术有限公司 | More human hairs call the turn spokesman's recognition methods and device |
CN108305632A (en) * | 2018-02-02 | 2018-07-20 | 深圳市鹰硕技术有限公司 | A kind of the voice abstract forming method and system of meeting |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583934A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Data processing method and device |
CN111583956A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Voice processing method and device |
CN111583956B (en) * | 2020-04-30 | 2024-03-26 | 联想(北京)有限公司 | Voice processing method and device |
CN111653283A (en) * | 2020-06-28 | 2020-09-11 | 讯飞智元信息科技有限公司 | Cross-scene voiceprint comparison method, device, equipment and storage medium |
CN111653283B (en) * | 2020-06-28 | 2024-03-01 | 讯飞智元信息科技有限公司 | Cross-scene voiceprint comparison method, device, equipment and storage medium |
CN111833876A (en) * | 2020-07-14 | 2020-10-27 | 科大讯飞股份有限公司 | Conference speech control method, system, electronic device and storage medium |
CN112259097A (en) * | 2020-10-27 | 2021-01-22 | 深圳康佳电子科技有限公司 | Control method for voice recognition and computer equipment |
CN112511877A (en) * | 2020-12-07 | 2021-03-16 | 四川长虹电器股份有限公司 | Intelligent television voice continuous conversation and interaction method |
CN112511877B (en) * | 2020-12-07 | 2021-08-27 | 四川长虹电器股份有限公司 | Intelligent television voice continuous conversation and interaction method |
CN112700781A (en) * | 2020-12-24 | 2021-04-23 | 江西台德智慧科技有限公司 | Voice interaction system based on artificial intelligence |
CN112700781B (en) * | 2020-12-24 | 2022-11-11 | 江西台德智慧科技有限公司 | Voice interaction system based on artificial intelligence |
CN113113044A (en) * | 2021-03-23 | 2021-07-13 | 北京小米移动软件有限公司 | Audio processing method and device, terminal and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111081257A (en) | Voice acquisition method, device, equipment and storage medium | |
CN111081234B (en) | Voice acquisition method, device, equipment and storage medium | |
US11055516B2 (en) | Behavior prediction method, behavior prediction system, and non-transitory recording medium | |
CN110019744A (en) | Auxiliary generates method, apparatus, equipment and the computer storage medium of meeting summary | |
CN107562403A (en) | A kind of volume adjusting method, smart machine and storage medium | |
CN109961214B (en) | Complaint butt joint processing person distribution method, complaint butt joint processing person distribution device, computer equipment and storage medium | |
CN110321863A (en) | Age recognition methods and device, storage medium | |
CN109215638B (en) | Voice learning method and device, voice equipment and storage medium | |
CN106601257A (en) | Sound identification method and device and first electronic device | |
CN106356077B (en) | A kind of laugh detection method and device | |
CN106096519A (en) | Live body discrimination method and device | |
CN108989581A (en) | A kind of consumer's risk recognition methods, apparatus and system | |
CN111062221A (en) | Data processing method, data processing device, electronic equipment and storage medium | |
CN109740530A (en) | Extracting method, device, equipment and the computer readable storage medium of video-frequency band | |
CN111651454B (en) | Data processing method and device and computer equipment | |
CN108053023A (en) | A kind of self-action intent classifier method and device | |
CN113129893B (en) | Voice recognition method, device, equipment and storage medium | |
CN109658776A (en) | Recitation fluency detection method and electronic equipment | |
CN109800675A (en) | A kind of method and device of the identification image of determining face object | |
CN109274562B (en) | Voice instruction execution method and device, intelligent household appliance and medium | |
CN105551504A (en) | Method and device for triggering function application of intelligent mobile terminal based on crying sound | |
CN112738344B (en) | Method and device for identifying user identity, storage medium and electronic equipment | |
CN112148864B (en) | Voice interaction method and device, computer equipment and storage medium | |
CN114390133A (en) | Recording method and device and electronic equipment | |
CN108563709A (en) | Exercise pushing method and device and terminal equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200428 |