CN111276155A

CN111276155A - Voice separation method, device and storage medium

Info

Publication number: CN111276155A
Application number: CN202010286958.7A
Authority: CN
Inventors: 吴梅; 梁志婷; 徐浩; 徐世超
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2019-12-20
Filing date: 2020-04-13
Publication date: 2020-06-12
Anticipated expiration: 2040-04-13
Also published as: CN111276155B

Abstract

The invention provides a voice separation method, a voice separation device and a storage medium, wherein the method comprises the following steps: under the condition that voice is detected by voice input equipment, acquiring time intervals corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises an environment data acquisition device, and the environment data acquisition device faces to a mouth of the first sound source object; determining environmental data acquired by an environmental data acquisition device in a time interval; and under the condition that the environment data accord with the preset data condition, determining the voice as a first voice emitted by the first sound source object. The invention can reduce the complexity of voice separation.

Description

Voice separation method, device and storage medium

Technical Field

The invention relates to the field of computers, in particular to a voice separation method, a voice separation device and a storage medium.

Background

At present, a great deal of convenience is brought to life by using a machine to realize voice recognition, but in a conversation scene, not only semantics corresponding to voice needs to be recognized, but also which character in the conversation scene the voice speaker is.

For example, in a scenario where an attendant and a customer are talking, the attendant and the customer both speak, and the machine needs to recognize whether the speaker who utters the speech is the attendant or the customer, so that by analyzing the speaker-distinguished speech, the attendant's speech can be verified and customer needs can be acquired. In the above scenario, it is necessary to separate the voice received by the machine into the voice of the attendant and the voice of the customer. At present, speech separation in the above-mentioned scenes is often realized through speech feature extraction, but a large amount of operations are often required in the process of speech feature extraction, so that the problem of high speech separation complexity is caused.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice separation method, a voice separation device and a storage medium, which at least solve the technical problem of high complexity of voice separation.

According to an aspect of an embodiment of the present invention, there is provided a speech separation method, including: under the condition that voice is detected by voice input equipment, acquiring time intervals corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises an environment data acquisition device, and the environment data acquisition device faces to the mouth of the first sound source object; determining the environmental data acquired by the environmental data acquisition device in the time interval; and under the condition that the environment data accord with a preset data condition, determining the voice as a first voice sent by the first sound source object.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned voice separation method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the voice separation method through the computer program.

In the embodiment of the present invention, when a voice is detected by a voice input device, a time interval corresponding to the voice and the voice is obtained, where a distance between a position of the voice input device and a position of a first sound source object is smaller than a first threshold, the voice input device includes an environment data acquisition device, and the environment data acquisition device faces a mouth of the first sound source object; determining the environmental data acquired by the environmental data acquisition device in the time interval; and under the condition that the environment data accord with a preset data condition, determining the voice as a first voice sent by the first sound source object. According to the process, the first voice sent by the first sound source object can be separated by means of the environment data acquired by the environment data acquisition device, complex operations such as extraction and calculation of voice features are not needed, and the complexity of voice separation can be reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative voice separation method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative speech separation method according to an embodiment of the present invention;

FIG. 3 is a flow diagram of an alternative speech separation method according to an embodiment of the present invention;

FIG. 4 is a flow diagram of an alternative speech separation method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative voice separation apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention provides a voice separation method, which comprises the following steps as shown in fig. 1:

s101, under the condition that voice is detected by voice input equipment, acquiring time intervals corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises an environment data acquisition device, and the environment data acquisition device faces to a mouth of the first sound source object;

s102, determining environmental data acquired by an environmental data acquisition device in a time interval;

and S103, determining the voice as a first voice sent by the first sound source object under the condition that the environment data accord with the preset data condition.

In this embodiment of the present invention, the voice input device is an electronic device capable of receiving a voice input, and may include, but is not limited to, a microphone device, a recording pen, a mobile phone with a voice input function, or a tablet with a voice input function, and the like. When the voice input device detects a voice, the voice may be a first voice uttered by a first sound source object, the voice may also be a second voice uttered by a second sound source object, and the voice and a time interval corresponding to the voice may be acquired, where the time interval corresponding to the voice is a time period between an utterance time point of the voice and a stop time point of the voice. In addition, the voice input device comprises an environmental data acquisition device, wherein the environmental data acquisition device can be any one of a wind speed sensor, a humidity sensor or a temperature sensor.

Alternatively, the position of the environmental data collection device may be set to be close to one side of the first sound source object, for example, in a case where the voice input device is a microphone device and the speaker wears the microphone device on the chest, the environmental data collection device may be set to be located at a position close to the mouth of the speaker on the upper portion of the microphone device, and at this time, the environmental data may be better detected, and the accuracy of the environmental data detection may be improved. Further, in the case that the environmental data collection device is a wind speed sensor, the preset data condition may be that a wind speed acquired by the wind speed sensor in a time interval is greater than a preset wind speed, the voice is determined as a first voice uttered by the first sound source object in the case that the wind speed acquired by the wind speed sensor in the time interval is greater than the preset wind speed, and the voice is determined as a second voice uttered by the second sound source object in the case that the wind speed acquired by the wind speed sensor in the time interval is less than or equal to the preset wind speed. Or, in the case that the environmental data acquisition device is a humidity sensor, the preset data condition may be that the humidity acquired by the humidity sensor in the time interval is greater than a preset humidity, the voice is determined as a first voice sent by the first sound source object in the case that the humidity acquired by the humidity sensor in the time interval is greater than the preset humidity, and the voice is determined as a second voice sent by the second sound source object in the case that the humidity acquired by the humidity sensor in the time interval is less than or equal to the preset humidity. Or, in the case that the environmental data acquisition device is a temperature sensor, the preset data condition may be that the temperature acquired by the temperature sensor in the time interval is higher than a preset temperature, the voice is determined as a first voice uttered by the first sound source object when the temperature acquired by the temperature sensor in the time interval is higher than the preset temperature, and the voice is determined as a second voice uttered by the second sound source object when the temperature acquired by the temperature sensor in the time interval is lower than or equal to the preset temperature.

It can be understood that the determined wind speed, humidity, or temperature acquired in the time interval may be affected by the speaking of the speaker to generate a certain change, for example, the wind speed or humidity generated when the user wearing the microphone device speaks may be greater than the wind speed or humidity generated when the user farther away from the microphone device speaks, and the temperature generated when the user wearing the microphone device speaks may be greater than the temperature generated when the user farther away from the microphone device speaks, so that it may be determined whether the speaker of the speech received by the microphone device is a user close to the microphone device or a user far away from the microphone device according to environment data such as the wind speed, humidity, or temperature, and thus it is determined that the speaker of the speech is the first sound source object or the second sound source object. Further, the preset wind speed, the preset humidity and the preset temperature are preset threshold values, if the wind speed is higher than the preset wind speed, the voice can be determined as a first voice close to the voice input device, and if the wind speed is lower than or equal to the preset wind speed, the voice can be determined as a second voice far away from the voice input device; if the humidity is greater than the preset humidity, the voice may be determined as a first voice close to the voice input device, and if the humidity is less than or equal to the preset humidity, the voice may be determined as a second voice far from the voice input device; if the temperature is higher than the preset temperature, the voice may be determined as a first voice close to the voice input device, and if the temperature is lower than or equal to the preset temperature, the voice may be determined as a second voice far from the voice input device.

As an optional implementation manner, before acquiring the voice and the time interval corresponding to the voice, the following steps may also be performed:

and the voice input equipment is arranged at a position where the distance between the voice input equipment and the position of the first sound source object is less than a first threshold value.

In the embodiment of the present invention, the first threshold is a preset distance value between the voice input device and the position of the first sound source object, for example, the first threshold may be 10 centimeters, and at this time, the voice input device should be arranged at a position where the distance between the voice input device and the position of the first sound source object is less than 10 centimeters. This process can ensure that the distance between voice input equipment and the first sound source object is nearer to the environmental data that environmental data collection system obtained receives the influence of first sound source object is bigger, changes the pronunciation of confirming that first sound source object sent, has improved the degree of accuracy of voice separation.

As an optional implementation manner, after determining the environmental data acquired by the environmental data acquisition device in the time interval, the following steps may be further performed:

and determining the voice as a second voice sent by a second sound source object under the condition that the environment data do not accord with the preset data condition and the second sound source object exists in the environment where the voice input equipment is located.

In the embodiment of the invention, the environment data is wind speed under the condition that the environment data acquisition device is a wind speed sensor, the environment data is humidity under the condition that the environment data acquisition device is a humidity sensor, and the environment data is temperature under the condition that the environment data acquisition device is a temperature sensor. Under the condition that the environmental data is the wind speed, if the wind speed is less than or equal to the preset wind speed, the environmental data is considered to be not in accordance with the preset data condition; under the condition that the environmental data is humidity, if the humidity is less than or equal to the preset humidity, the environmental data is considered not to conform to the preset data condition; and under the condition that the environmental data is the temperature, if the temperature is less than or equal to the preset temperature, the environmental data is considered not to meet the preset data condition.

Further, if a second sound source object exists in the environment where the voice input device is located, the voice may be determined as a second voice uttered by the second sound source object under the condition that the environment data does not meet the preset data condition. Optionally, an obtaining request may be sent to the camera apparatus, where the obtaining request is used to obtain the face image information in the environment where the voice input device is located, and after the camera apparatus obtains the face image information, the camera apparatus may identify the face image information, so as to determine whether the second sound source object exists in the environment where the voice input device is located. For example, in a shop shopping guide scene, the first sound source object may be a shopping guide person, the second sound source object may be a customer, the voice input device may be connected to the camera apparatus, information of whether the second sound source object exists in an environment where the voice input device is located is acquired, if the second sound source object exists in the environment where the voice input device is located and the environment data does not meet a preset data condition, the voice may be determined as a second voice uttered by the second sound source object, so that separation of voices of the first sound source object and the second sound source object is achieved.

As an alternative implementation, after determining the voice as the first voice uttered by the first sound source object, the following steps may be further performed:

a first identity token is added to the first voice, the first identity token indicating an identity of the first audio source object.

As an alternative implementation, after adding the first identity token to the first voice, the following steps may be further performed:

s1, acquiring the matching degree between the first voice and a preset voice, wherein the preset voice is a voice stored corresponding to the first identity tag;

and S2, determining the first voice as the standard voice under the condition that the matching degree is higher than the second threshold value.

In the embodiment of the present invention, a first identity tag may be added to the first voice, where the first identity tag is used to indicate an identity of the first character object close to the voice input device, that is, an identity of the first sound source object, specifically, the first identity tag may include, but is not limited to, an identity code, a name, and the like, and in the above dialog scenario between the attendant and the customer, the first identity tag may include an identity code, a name, and the like of the attendant. Further, the matching degree between the first voice and the preset voice can be obtained, the preset voice is a voice corresponding to a preset standard expression of the attendant, the higher the matching degree between the first voice and the preset voice is, the speech of the attendant meets the standard, and the first voice can be determined as the standard voice under the condition that the matching degree is higher than or equal to the second threshold. The second threshold is a preset matching degree for distinguishing the standard voice from the non-standard voice, and in the case that the matching degree is lower than the second threshold, the first voice can be determined as the non-standard voice.

As an alternative implementation, the following steps may also be performed:

s1, calling a plurality of first voices corresponding to the first identity marks and the matching degree corresponding to each first voice;

s2, determining the number of the non-standard voices, the matching degree of which is lower than a second threshold value, in the first voice and the number of the standard voices, the matching degree of which is higher than or equal to the second threshold value, in the first voice;

s3, calculating a first ratio of the number of the non-standard speeches to the total speeches and a second ratio of the number of the standard speeches to the total speeches;

s4, when the first ratio reaches a third threshold or the second ratio reaches a fourth threshold, marking an abnormal flag on the first identity flag, where the abnormal flag is used to indicate that the attendant corresponding to the first identity flag is unqualified, so as to manage the corresponding attendant according to the abnormal flag in the following.

By implementing the optional implementation mode, the waiters with unqualified dialogues can be determined according to the matching degree condition of the first voices corresponding to the first identity marks, the first identity marks corresponding to the waiters can be marked with abnormal marks, subsequent management is facilitated, management of the waiters is enhanced in the process, and service quality is facilitated to be improved.

In the embodiment of the invention, the voice can be separated into the first voice close to the voice input device and the second voice far away from the voice input device by virtue of the environmental data acquired by the environmental data acquisition device, under the scene that the voice input device needs to be separated from the voice input device, complex operations such as extraction and calculation of voice features are not needed, the voice separation can be realized by only comparing and judging the acquired environmental data, and the technical problem of high voice separation complexity is solved. In addition, a first identity mark can be added for the first voice, so that the identity mark of the first voice can be realized, the identity attribution of the first voice can be conveniently seen, the classified management can be conveniently carried out on the voices attributing to different identities, and the voice management efficiency is improved. In addition, whether the first voice is the standard voice can be determined according to the matching degree between the first voice and the preset voice, voice separation is achieved under the scene that a waiter wearing the voice input device and a customer carry out conversation, the voice of the waiter is matched with the standard service conversation, the service quality of the waiter is monitored, and the monitoring degree and the monitoring intelligentization degree of the waiter are improved.

As an optional implementation manner, when the environment data does not meet the preset data condition and the second sound source object exists in the environment where the voice input device is located, after determining the voice as the second voice uttered by the second sound source object, the following steps may be further performed:

and performing voice recognition and semantic analysis on the second voice to acquire the requirement information of the second sound source object.

and adding a second identity tag for the second voice, wherein the second identity tag is used for indicating the identity of the second sound source object.

As an optional implementation, performing speech recognition and semantic analysis on the second speech to obtain the requirement information of the second sound source object may include the following steps:

s1, performing voice recognition and semantic analysis on the second voice to obtain semantic content and obtain a target text;

and S2, extracting requirement information according to the target text, wherein the requirement information at least comprises a requirement commodity name and a requirement commodity quantity.

In the embodiment of the present invention, a second identity tag may be further added to the second voice, where the second identity tag is used to indicate an identity of a second character object that is far away from the voice input device, and in the above scenario where the attendant and the customer have a conversation, the second identity tag may include, but is not limited to, a customer number, a customer name, and the like. Furthermore, semantic analysis can be performed on the second voice, semantic content is obtained, and requirement information is extracted according to the semantic content, wherein the requirement information at least comprises a requirement commodity name and a requirement commodity quantity.

As an optional implementation, after extracting the requirement information according to the semantic content, the following steps may be further performed:

and sending the required commodity name and the required commodity quantity included in the required information to the appointed terminal so that a user of the appointed terminal can stock according to the required commodity name and the required commodity quantity.

Through implementing the optional implementation mode, the voice input device can be used for being connected with the appointed terminal in advance, then the requirement information received by the voice input device is sent to the appointed terminal to meet the requirement of a customer, for example, under the condition that a conversation scene of the customer and a waiter is an ordering scene, the requirement commodity name included by the requirement information can be a dish name, the requirement commodity number can be the corresponding dish number, the appointed terminal can be a terminal used in a kitchen, the kitchen can be efficiently prepared in the process, and the experience of the customer can be improved.

As another optional implementation, after extracting the requirement information according to the semantic content, the following steps may be further performed:

s1, querying and acquiring historical demand information corresponding to the second identity tag in a preset database;

s2, extracting requirement characteristics corresponding to the second identity mark from the historical requirement information;

and S3, adjusting the demand information according to the demand characteristics to obtain the adjusted demand information, and sending the demand commodity name and the demand commodity quantity included in the adjusted demand information to the appointed terminal, so that a user of the appointed terminal can stock according to the demand commodity name and the demand commodity quantity.

By implementing the optional implementation manner, the historical demand information corresponding to the second identity tag can be stored, and the demand information can be adjusted according to the historical demand information, thereby perfecting the demand information and sending the adjusted demand information to the designated terminal so as to enable the user of the designated terminal to stock, for example, when the conversation scene of the waiter and the customer is the ordering scene, the historical demand information corresponding to the second identity mark comprises the historical ordering record of the customer, can extract the demand characteristics from the historical demand information, the demand characteristics can be the taste of the dishes, the demand information is adjusted according to the demand characteristics, the taste of the dishes can be supplemented to the demand information which does not contain the taste of the dishes, and then the supplemented information is sent to kitchen spare dishes, so that the intelligent degree of the spare dishes is improved, and the user experience is better.

As an alternative embodiment, determining the wind speed acquired by the wind speed sensor in the time interval may include:

s1, determining a plurality of time points in a time interval, and acquiring the wind speed corresponding to each time point in the plurality of time points;

and S2, calculating the average value of the wind speed corresponding to each time point, and determining the average value as the wind speed acquired by the wind speed sensor in the time interval.

As an alternative embodiment, determining the humidity acquired by the humidity sensor in the time interval may include:

s1, determining a plurality of time points in a time interval, and acquiring the humidity corresponding to each time point in the plurality of time points;

and S2, calculating the average value of the humidity corresponding to each time point, and determining the average value as the humidity acquired by the humidity sensor in the time interval.

As an alternative embodiment, determining the temperature acquired by the temperature sensor in the time interval may include:

s1, determining a plurality of time points in a time interval, and acquiring the temperature corresponding to each time point in the plurality of time points;

s2, calculating an average value of the temperatures corresponding to each time point, and determining the average value as the temperature acquired by the temperature sensor in the time interval.

By implementing the optional implementation mode, the average value of the wind speed corresponding to each time point in a plurality of time points included in the time interval can be determined as the wind speed acquired by the wind speed sensor in the time interval, so that the reliability of determining the wind speed is improved.

For example, the voice separation method described in the embodiment of the present invention may be applied to a conversation scene between an attendant and a customer, where the attendant wears the voice input device, and in a process that the attendant completes a consultation service of the customer for a corresponding service by interacting with the customer, a voice detected in the voice input device may be a voice uttered by the attendant or a voice uttered by the customer. The voice input equipment can be provided with a wind speed sensor, a humidity sensor or a temperature sensor, if the wind speed corresponding to the detected voice is high, the voice is the voice sent by the waiter, if the wind speed corresponding to the detected voice is low, the voice is the voice sent by the customer, if the humidity corresponding to the detected voice is high, the voice is the voice sent by the waiter, if the humidity corresponding to the detected voice is low, the voice is the voice sent by the customer, if the temperature corresponding to the detected voice is high, the voice is the voice sent by the waiter, and if the temperature corresponding to the detected voice is low, the voice is the voice sent by the customer. The voice sent by the customer and the voice sent by the waiter can be distinguished through wind speed detection, humidity detection or temperature detection, the demand information of the customer can be acquired through technologies such as voice recognition and semantic analysis for the voice sent by the customer, and the voice sent by the waiter can be compared with a preset standard speech technology through voice recognition and semantic analysis, so that the work of the waiter is monitored, and the work quality is improved.

Optionally, under the condition that the environmental data acquisition device simultaneously includes a wind speed sensor, a humidity sensor and a temperature sensor, the environmental data acquired by the determined environmental data acquisition device in the time interval may include wind speed, humidity and temperature. The preset data conditions may be that the wind speed is greater than a preset wind speed and the humidity is greater than a preset humidity, and the temperature is greater than a preset temperature. Under the condition that the wind speed is higher than the preset wind speed, the humidity is higher than the preset humidity, and the temperature is higher than the preset temperature, the voice can be determined as a first voice sent by the first sound source object. Or, the preset data condition may be split into three preset data sub-conditions, where the first preset data sub-condition is that the wind speed is greater than the preset wind speed, the second preset data sub-condition is that the humidity is greater than the preset humidity, and the third preset data sub-condition is that the temperature is greater than the preset temperature, and if any two preset data sub-conditions are satisfied among the three preset data sub-conditions, the voice may be determined as the first voice sent by the first sound source object.

Further, after acquiring the wind speed acquired by the wind speed sensor in the time interval, the humidity acquired by the humidity sensor in the time interval, and the temperature acquired by the temperature sensor in the time interval, it may be further determined whether the wind speed is in the preset wind speed interval, whether the humidity is in the preset humidity interval, and whether the temperature is in the preset temperature interval. And under the condition that the wind speed is not in the preset wind speed interval, judging whether the wind speed is greater than the preset wind speed or not, judging whether the humidity is greater than the preset humidity or not and whether the temperature is greater than the preset temperature or not, and under the condition that the humidity is greater than the preset humidity and the temperature is greater than the preset temperature, determining the voice as the first voice sent by the first sound source object. And under the condition that the humidity is not in the preset humidity range, judging whether the humidity is larger than the preset humidity or not without judging whether the humidity is larger than the preset humidity or not, judging whether the wind speed is larger than the preset wind speed or not and whether the temperature is larger than the preset temperature or not, and under the condition that the wind speed is larger than the preset wind speed and the temperature is larger than the preset temperature, determining the voice as a first voice sent by the first sound source object. And under the condition that the temperature is not in the preset temperature range, judging whether the temperature is higher than the preset temperature or not without judging whether the wind speed is higher than the preset wind speed or not and whether the humidity is higher than the preset humidity or not, and under the condition that the wind speed is higher than the preset wind speed and the humidity is higher than the preset humidity, determining the voice as the first voice sent by the first sound source object. The abnormal data can be filtered through the preset wind speed interval, the preset humidity interval and the preset temperature interval in the process, the influence of the abnormal data on the judgment result is reduced, and therefore the accuracy of determining the first voice is improved.

In the embodiment of the invention, the voice input device is arranged at a position where the distance between the voice input device and the position of the first sound source object is less than a first threshold value, the voice input device comprises an environment data acquisition device, and the environment data acquisition device faces to the mouth of the first sound source object; acquiring a voice and a time interval corresponding to the voice under the condition that the voice input equipment detects the voice; determining environmental data acquired by an environmental data acquisition device in a time interval; determining the voice as a first voice sent by a first sound source object under the condition that the environment data accord with the preset data condition; and under the condition that the environment data do not accord with the preset data condition, determining the voice as a second voice emitted by a second sound source object. According to the process, the voice can be separated into the first voice sent by the first sound source object and the second voice sent by the second sound source object by means of the environment data acquired by the environment data acquisition device, under the scene that the voice sent by the first sound source object and the voice sent by the second sound source object need to be separated, complex operations such as extraction and calculation of voice features are not needed, and the technical problem of high voice separation complexity is solved.

Referring to fig. 2, fig. 2 is another speech separation method disclosed in the embodiment of the present invention, and the speech separation method shown in fig. 2 is executed under the condition that the environmental data acquisition device includes a wind speed sensor, and specifically, the following steps may be executed:

s201, arranging the voice input equipment at a position where the distance between the voice input equipment and the position of the first sound source object is smaller than a first threshold value;

s202, under the condition that voice is detected by voice input equipment, acquiring a time interval corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises a wind speed sensor, and the wind speed sensor faces to a mouth of the first sound source object;

s203, determining the wind speed acquired by the wind speed sensor in a time interval;

s204, determining the voice as a first voice sent by the first sound source object under the condition that the wind speed is greater than the preset wind speed;

s205, adding a first identity mark for the first voice, wherein the first identity mark is used for indicating the identity of the first sound source object;

s206, acquiring the matching degree between the first voice and a preset voice, wherein the preset voice is a voice stored corresponding to the first identity mark;

s207, under the condition that the matching degree is higher than a second threshold value, determining the first voice as standard voice;

and S208, determining the voice as a second voice sent by the second sound source object under the condition that the wind speed is less than or equal to the preset wind speed and the second sound source object exists in the environment where the voice input equipment is located.

In the embodiment of the present invention, the wind speed sensor may be disposed at a mouth of the voice input device facing the first sound source object, and determine a wind speed acquired by the wind speed sensor in a time interval, if the wind speed is greater than a preset wind speed, determine the voice as a first voice uttered by the first sound source object, and if the wind speed is less than or equal to the preset wind speed, determine the voice as a second voice uttered by the second sound source object, so as to distinguish the voice uttered by the first sound source object closer to the voice input device and the voice uttered by the second sound source object farther from the voice input device by measuring the wind speed. After distinguishing the first voice uttered by the first audio source object from the second voice uttered by the second audio source object, a first identity tag may be added to the first voice, and a second identity tag may be added to the second voice. Further, a matching degree between the first voice and a preset voice may be obtained, and the first voice is determined as a standard voice under the condition that the matching degree is higher than a second threshold. Therefore, on the basis of realizing the separation of the first voice and the second voice, the automatic verification of the first voice is realized, and whether the first voice is standard or not is verified.

Referring to fig. 3, fig. 3 is another speech separation method disclosed in the embodiment of the present invention, and the speech separation method shown in fig. 3 is executed under the condition that the environmental data acquisition device includes a humidity sensor, and specifically, the following steps may be executed:

s301, the voice input equipment is arranged at a position where the distance between the voice input equipment and the position of the first sound source object is smaller than a first threshold value;

s302, under the condition that voice is detected by voice input equipment, acquiring a time interval corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises a humidity sensor, and the humidity sensor faces to a mouth of the first sound source object;

s303, determining the humidity acquired by the humidity sensor in a time interval;

s304, determining the voice as a first voice sent by the first sound source object under the condition that the humidity is greater than the preset humidity;

s305, adding a first identity mark for the first voice, wherein the first identity mark is used for indicating the identity of the first sound source object;

s306, acquiring the matching degree between the first voice and a preset voice, wherein the preset voice is a voice stored corresponding to the first identity mark;

s307, under the condition that the matching degree is higher than a second threshold value, determining the first voice as standard voice;

and S308, determining the voice as a second voice sent by the second sound source object under the conditions that the humidity is less than or equal to the preset humidity and the second sound source object exists in the environment where the voice input equipment is located.

In the embodiment of the present invention, the humidity sensor may be disposed in a mouth of the voice input device facing the first sound source object, and determine the humidity acquired by the humidity sensor in the time interval, if the humidity is greater than a preset humidity, determine the voice as a first voice uttered by the first sound source object, and if the humidity is less than or equal to the preset humidity, determine the voice as a second voice uttered by the second sound source object, so as to distinguish the voice uttered by the first sound source object closer to the voice input device and the voice uttered by the second sound source object farther from the voice input device by measuring the humidity. After distinguishing the first voice uttered by the first audio source object from the second voice uttered by the second audio source object, a first identity tag may be added to the first voice, and a second identity tag may be added to the second voice. Further, a matching degree between the first voice and a preset voice may be obtained, and the first voice is determined as a standard voice under the condition that the matching degree is higher than a second threshold. Therefore, on the basis of realizing the separation of the first voice and the second voice, the automatic verification of the first voice is realized, and whether the first voice is standard or not is verified.

Referring to fig. 4, fig. 4 is another speech separation method disclosed in the embodiment of the present invention, and the speech separation method shown in fig. 4 is executed under the condition that the environment data acquisition device includes a temperature sensor, and specifically, the following steps may be executed:

s401, arranging the voice input equipment at a position where the distance between the voice input equipment and the position of the first sound source object is smaller than a first threshold value;

s402, under the condition that voice is detected by voice input equipment, acquiring a time interval corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises a temperature sensor, and the temperature sensor faces to a mouth of the first sound source object;

s403, determining the temperature acquired by the temperature sensor in a time interval;

s404, determining the voice as a first voice sent by the first sound source object under the condition that the temperature is higher than the preset temperature;

s405, adding a first identity mark for the first voice, wherein the first identity mark is used for indicating the identity of the first sound source object;

s406, acquiring the matching degree between the first voice and a preset voice, wherein the preset voice is a voice stored corresponding to the first identity tag;

s407, under the condition that the matching degree is higher than a second threshold value, determining the first voice as standard voice;

and S408, determining the voice as a second voice sent by the second sound source object under the condition that the temperature is less than or equal to the preset temperature and the second sound source object exists in the environment where the voice input equipment is located.

In the embodiment of the present invention, the temperature sensor may be disposed at a mouth of the voice input device facing the first sound source object, and determine a temperature acquired by the temperature sensor in a time interval, if the temperature is greater than a preset temperature, determine the voice as a first voice uttered by the first sound source object, and if the temperature is less than or equal to the preset temperature, determine the voice as a second voice uttered by the second sound source object, so as to distinguish the voice uttered by the first sound source object closer to the voice input device and the voice uttered by the second sound source object farther from the voice input device by measuring the temperature. After distinguishing the first voice uttered by the first audio source object from the second voice uttered by the second audio source object, a first identity tag may be added to the first voice, and a second identity tag may be added to the second voice. Further, a matching degree between the first voice and a preset voice may be obtained, and the first voice is determined as a standard voice under the condition that the matching degree is higher than a second threshold. Therefore, on the basis of realizing the separation of the first voice and the second voice, the automatic verification of the first voice is realized, and whether the first voice is standard or not is verified.

According to another aspect of the embodiment of the present invention, there is also provided a voice separation apparatus for implementing the voice separation method. As shown in fig. 5, the apparatus includes:

an obtaining unit 501, configured to obtain, when a voice is detected by a voice input device, a time interval corresponding to the voice and the voice, where a distance between a position of the voice input device and a position of a first sound source object is smaller than a first threshold, where the voice input device includes an environment data collecting device, and the environment data collecting device faces a mouth of the first sound source object;

a first determining unit 502, configured to determine environment data acquired by an environment data acquisition device in a time interval;

the second determining unit 503 is configured to determine, when the environment data meets the preset data condition, the voice as a first voice uttered by the first sound source object.

As an optional implementation manner, in a case that the environmental data collection device includes a wind speed sensor, a manner that the first determination unit 502 is used to determine the environmental data acquired by the environmental data collection device in a time interval may specifically be: the first determining unit 502 is configured to determine a wind speed obtained by the wind speed sensor in a time interval, where the environmental data at least includes the wind speed; the manner that the second determining unit 503 is configured to determine the voice as the first voice uttered by the first sound source object when the environment data meets the preset data condition may specifically be: and under the condition that the wind speed is greater than the preset wind speed, determining the voice as a first voice sent by the first sound source object.

As an optional implementation manner, in a case that the environmental data collecting apparatus includes a humidity sensor, a manner that the first determining unit 502 is used to determine the environmental data acquired by the environmental data collecting apparatus in a time interval may specifically be: the first determining unit 502 is configured to determine humidity acquired by the humidity sensor in a time interval, where the environmental data at least includes humidity; the manner that the second determining unit 503 is configured to determine the voice as the first voice uttered by the first sound source object when the environment data meets the preset data condition may specifically be: and under the condition that the humidity is greater than the preset humidity, determining the voice as a first voice sent by the first sound source object.

As an optional implementation manner, in a case that the environmental data acquisition device includes a temperature sensor, a manner that the first determination unit 502 is used to determine the environmental data acquired by the environmental data acquisition device in a time interval may specifically be: the first determining unit 502 is configured to determine a temperature obtained by the temperature sensor in a time interval, where the environmental data at least includes the temperature; the manner that the second determining unit 503 is configured to determine the voice as the first voice uttered by the first sound source object when the environment data meets the preset data condition may specifically be: and under the condition that the temperature is higher than the preset temperature, determining the voice as a first voice sent by the first sound source object.

As an optional implementation, the apparatus may further include:

and the setting unit is used for setting the voice input equipment at a position where the distance between the voice input equipment and the position of the first sound source object is less than a first threshold value before the voice and the time interval corresponding to the voice are acquired.

As an optional implementation, the apparatus may further include:

and the third determining unit is used for determining the voice as the second voice sent by the second sound source object under the condition that the environment data do not accord with the preset data condition and the second sound source object exists in the environment where the voice input equipment is located after the environment data acquired by the environment data acquisition device in the time interval are determined.

As an optional implementation, the apparatus may further include:

and the adding unit is used for adding a first identity mark for the first voice after the voice is determined as the first voice emitted by the first voice source object, wherein the first identity mark is used for indicating the identity of the first voice source object.

As an optional implementation, the apparatus may further include:

a fourth determining unit, configured to obtain a matching degree between the first voice and a preset voice after adding the first identity tag to the first voice, where the preset voice is a voice stored corresponding to the first identity tag; and in the case that the matching degree is higher than a second threshold value, determining the first voice as the standard voice.

According to yet another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the voice separation method, as shown in fig. 6, the electronic device includes a memory 602 and a processor 604, the memory 602 stores therein a computer program, and the processor 604 is configured to execute the steps in any one of the method embodiments through the computer program.

S1, acquiring a time interval corresponding to the voice and the voice under the condition that the voice input equipment detects the voice, wherein the distance between the position of the voice input equipment and the position of the first sound source object is smaller than a first threshold value, the voice input equipment comprises an environment data acquisition device, and the environment data acquisition device faces to the mouth of the first sound source object;

s2, determining the environmental data acquired by the environmental data acquisition device in a time interval;

and S3, determining the voice as a first voice emitted by the first sound source object under the condition that the environment data accord with the preset data condition.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 6 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

The memory 602 may be used to store software programs and modules, such as program instructions/modules corresponding to the virtual object control method and apparatus in the embodiments of the present invention, and the processor 604 executes various functional applications and data processing by running the software programs and modules stored in the memory 602, that is, implementing the virtual object control method described above. The memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 602 may further include memory located remotely from the processor 604, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. As an example, as shown in fig. 6, the memory 602 may include, but is not limited to, the obtaining unit 501, the first determining unit 502, and the second determining unit 503 in the voice separating apparatus. In addition, the present invention may further include, but is not limited to, other module units in the voice separation apparatus, which are not described in this example again.

Optionally, the transmitting device 606 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 606 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmitting device 606 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In addition, the electronic device further includes: a display 608 for displaying environmental data; and a connection bus 614 for connecting the respective module parts in the above-described electronic apparatus.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method of speech separation, comprising:

under the condition that voice is detected by voice input equipment, acquiring time intervals corresponding to the voice and the voice, wherein the distance between the position of the voice input equipment and the position of a first sound source object is smaller than a first threshold value, the voice input equipment comprises an environment data acquisition device, and the environment data acquisition device faces to the mouth of the first sound source object;

determining the environmental data acquired by the environmental data acquisition device in the time interval;

and under the condition that the environment data accord with a preset data condition, determining the voice as a first voice sent by the first sound source object.

2. The method of claim 1, wherein, in the case where the environmental data collection device comprises a wind speed sensor,

the determining the environmental data acquired by the environmental data acquisition device in the time interval includes: determining the wind speed acquired by the wind speed sensor in the time interval, wherein the environmental data at least comprise the wind speed;

determining the voice as a first voice emitted by the first sound source object under the condition that the environment data meet a preset data condition, wherein the determining includes: and determining the voice as the first voice sent by the first sound source object under the condition that the wind speed is greater than a preset wind speed.

3. The method of claim 1, wherein, where the environmental data collection device comprises a humidity sensor,

the determining the environmental data acquired by the environmental data acquisition device in the time interval includes: determining the humidity acquired by the humidity sensor in the time interval, wherein the environmental data at least comprises the humidity;

determining the voice as a first voice emitted by the first sound source object under the condition that the environment data meet a preset data condition, wherein the determining includes: and under the condition that the humidity is greater than the preset humidity, determining the voice as the first voice sent by the first sound source object.

4. The method of claim 1, wherein, where the environmental data acquisition device comprises a temperature sensor,

the determining the environmental data acquired by the environmental data acquisition device in the time interval includes: determining the temperature acquired by the temperature sensor in the time interval, wherein the environmental data at least comprises the temperature;

determining the voice as a first voice emitted by the first sound source object under the condition that the environment data meet a preset data condition, wherein the determining includes: and under the condition that the temperature is higher than a preset temperature, determining the voice as the first voice emitted by the first sound source object.

5. The method according to any one of claims 1 to 4, wherein before the obtaining the speech and the time interval corresponding to the speech, the method further comprises:

and arranging the voice input equipment at a position where the distance between the voice input equipment and the position of the first sound source object is less than the first threshold value.

6. The method according to any one of claims 1 to 4, wherein after the determining the environmental data acquired by the environmental data acquisition device in the time interval, further comprising:

and under the condition that the environment data do not accord with the preset data condition and a second sound source object exists in the environment where the voice input equipment is located, determining the voice as a second voice sent by the second sound source object.

7. The method according to any one of claims 1 to 4, wherein after the determining the voice as the first voice uttered by the first audio source object, further comprises:

adding a first identity mark to the first voice, wherein the first identity mark is used for indicating the identity of the first sound source object.

8. The method of claim 7, after adding a first identity token for the first voice, further comprising:

acquiring the matching degree between the first voice and a preset voice, wherein the preset voice is a voice stored corresponding to the first identity mark;

and determining the first voice as standard voice under the condition that the matching degree is higher than a second threshold value.

9. A speech separation apparatus, comprising:

the voice input device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice and a time interval corresponding to the voice under the condition that the voice input device detects the voice, the distance between the position of the voice input device and the position of a first sound source object is smaller than a first threshold value, the voice input device comprises an environment data acquisition device, and the environment data acquisition device faces to the mouth of the first sound source object;

the first determining unit is used for determining the environmental data acquired by the environmental data acquisition device in the time interval;

and the second determining unit is used for determining the voice as the first voice sent by the first sound source object under the condition that the environment data meet the preset data condition.

10. A computer-readable storage medium comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 8.