CN112365895B - Audio processing method, device, computing equipment and storage medium - Google Patents

Audio processing method, device, computing equipment and storage medium Download PDF

Info

Publication number
CN112365895B
CN112365895B CN202011072474.9A CN202011072474A CN112365895B CN 112365895 B CN112365895 B CN 112365895B CN 202011072474 A CN202011072474 A CN 202011072474A CN 112365895 B CN112365895 B CN 112365895B
Authority
CN
China
Prior art keywords
audio data
frame
user
recording device
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011072474.9A
Other languages
Chinese (zh)
Other versions
CN112365895A (en
Inventor
谭聪慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202011072474.9A priority Critical patent/CN112365895B/en
Publication of CN112365895A publication Critical patent/CN112365895A/en
Application granted granted Critical
Publication of CN112365895B publication Critical patent/CN112365895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses an audio processing method, an audio processing device, a computing device and a storage medium, wherein after audio data of N recording devices are acquired; for the audio data of any recording device Ni, determining the voiceprint similarity of any frame of the audio data after framing and reference factors; reference factors include: the sound recording equipment N i is a far-near field factor or a volume factor; and determining target users corresponding to the audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors. When the application processes the audio data, the target user corresponding to the audio data is determined by determining the similarity between the audio data of each recording device and the voiceprint of each user and the reference factors, so that the accuracy of the audio data identification can be improved.

Description

Audio processing method, device, computing equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to an audio processing method and apparatus, a computing device, and a storage medium.
Background
One-to-one recording means that each user has a corresponding recording device, and the recording devices of each user synchronously record audio data of the corresponding user. If the user is in a meeting, each user corresponds to a single microphone to record audio data of the user, and the voice activity detection (Voice Activity Detection, VAD) is used for judging whether the microphone records the sound or not so as to determine which time period the microphone records the sound and which time period the microphone does not record the sound.
However, in practical application, each microphone may not only record audio data of its corresponding user, but also record audio data of other users, and especially in the case that the space is relatively narrow or the microphone has a relatively long receiving distance, a microphone of a certain user may record audio data of a plurality of users, and only by VAD, it is impossible to distinguish which section of audio data in the microphone corresponds to which user. In this case, the VAD cannot determine which session corresponds to which user from the audio data recorded by the microphone at all.
Based on this, the related art introduces a voiceprint recognition technology, and determines the user corresponding to each frame of audio data by performing frame processing on the audio data recorded by one microphone, and although the user corresponding to the audio data in a single microphone can be determined in this manner, in the same closed space, a plurality of microphones exist, and when the audio data of a plurality of users are recorded by the plurality of microphones, the user corresponding to each audio data is determined to be easily wrong by using the voiceprint recognition technology.
Disclosure of Invention
The application mainly aims to provide an audio processing method, an audio processing device, audio processing equipment and a storage medium, aiming at improving the accuracy of audio data identification.
In order to achieve the above object, according to a first aspect of the present application, there is provided an audio processing method, including being executable by a service or by an intelligent device having a data processing function, wherein audio data of N recording devices are acquired first when the method is executed; the N is an integer; then, aiming at the audio data of any recording device N i, determining the voiceprint similarity of any frame of the audio data after framing and reference factors; the i is any integer from 1 to N; the reference factors include: the sound recording equipment N i is a far-near field factor or a volume factor; and finally, determining target users corresponding to the audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors.
After the audio data of the N recording devices are obtained, the voiceprint similarity and the reference factors of any frame of the audio data after framing are determined, and the target users corresponding to the audio frame data are determined based on the voiceprint similarity and the reference factors.
In an alternative embodiment, the sound recording device N i near-far field factor characterizes a correspondence between the sound recording device N i and a user; the recording device N i is configured to record audio data of a user i, where a far-near field factor between the recording device N i and the user i is a first relationship value, and a far-near field factor between the recording device N i and other users except the user i is a second relationship value; the first relationship value is greater than the second relationship value.
When the target users corresponding to the audio data of each frame are determined, the target users corresponding to the audio data of each frame are determined based on the far and near field factors and the voiceprint similarity on the basis of the far and near field factors of the reference recording equipment, so that the accuracy of audio data identification can be improved.
In an alternative embodiment, the volume factor is indicated by the volume of each frame of audio data of the recording device N i; the volume of each frame of audio data of the recording device N i is indicated by the following manner: proportional calculation is carried out on the average volume value and the first volume value of each frame of audio data, and the volume of each frame of audio data is determined; the first volume value is an average volume value of audio data of a recording device to which each frame of audio data belongs.
According to the application, when the target user corresponding to each frame of audio data is determined, the target user corresponding to each frame of audio data is determined based on the volume factor and the voiceprint similarity on the basis of the volume factor of the reference recording equipment, so that the accuracy of audio data identification can be improved.
In an alternative implementation manner, for the audio data of any recording device N i, frame-by-frame processing is performed according to a preset time interval, and M frames of audio frame data are determined; m is an integer; comparing the t frame audio data of the recording device N i with the voiceprint data of the kth user one by one, and determining a similarity score a itk of the t frame audio data of the recording device N i and the voiceprint data of the kth user; the t is any integer from 1 to M; and k is any integer from 1 to N.
By the method, the corresponding relation between each audio frame data and the user voiceprint data can be accurately achieved.
In an optional implementation manner, weighted summation calculation is performed on the value of the voiceprint similarity and the reference factor, and target users corresponding to the audio data of each frame of the N recording devices are determined.
The accuracy of the audio data identification can be improved by carrying out weighted summation calculation on the value of the voiceprint similarity and the reference factors.
In an alternative embodiment, the reference data includes: the sound recording device N i is a far-near field factor and the volume factor; weighting and summing the value of the voiceprint similarity of each frame of audio data and the far-near field factors of the recording equipment corresponding to each frame of audio data, and determining a first weighting value; then multiplying the first weighted value with the volume factor of each frame of audio data to determine a second weighted value; and finally, taking the user with the largest second weighting value as a target user of the audio data of each frame of the N recording devices.
When the target users corresponding to the audio data of each frame are determined, the far and near field factors and the volume factors of the recording equipment are considered, and the target users corresponding to the audio data of each frame are determined based on the far and near field factors, the volume factors and the voiceprint similarity, so that the accuracy of the identification of the audio data can be improved.
In an alternative implementation manner, each frame of audio data corresponding to a first user in the N recording devices is spliced, and speaking content of the first user is determined.
In this way, the content of the user's speech can be stitched together and the meeting record can be better consolidated.
In a second aspect, an embodiment of the present application provides an audio processing apparatus, including: the system comprises an acquisition module, an audio frame data determination module and a target user determination module.
The acquisition module is used for acquiring the audio data of the N recording devices; the N is an integer;
The audio frame data determining module is used for determining the voiceprint similarity of any frame of the audio data after framing and reference factors aiming at the audio data of any recording device N i; the i is any integer from 1 to N; the reference factors include: the sound recording equipment N i is a far-near field factor or a volume factor;
And the target user determining module is used for determining target users corresponding to the audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors.
The beneficial effects of the audio processing apparatus may refer to the description of the audio processing method in the first aspect, which is not described herein.
In a third aspect, the present application provides an audio processing apparatus comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the audio processing method as in any of the first aspects when executed by the processor.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the audio processing method as in any of the first aspects.
Drawings
Fig. 1 is a schematic view of an application scenario of an audio processing method according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of an audio processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As described in the background, the related art is based on the VAD performing audio data processing, however, the VAD can only determine which recording device has recorded sound and which recording device has not recorded sound. In the case where a plurality of users exist in a certain closed space, it is impossible to determine which user the audio data collected by each recording device is spoken by based on the VAD.
In order to more accurately determine a user corresponding to audio data, the scheme of the present application is provided, and fig. 1 shows a schematic diagram of an application scenario of an audio processing method, including: the recording device 101, the recording device 102, the user 1, the user 2, and the server 103 are described as examples, and the number of users and recording devices is not limited in practical application. Wherein, the recording device 101, the recording device 102, the user 1 and the user 2 are in the same space, such as: conference rooms of company a, classrooms of school B, etc. In fig. 1, it is assumed that the recording apparatus 101, the recording apparatus 102, the user 1, and the user 2 are located in the conference room of the a company, and the recording apparatus 101 is allocated to the user 1 for use, and the recording apparatus 102 is allocated to the user 2 for use. The recording device 101 and the recording device 102 may transmit the collected audio data to the server 103 for data processing in a wired or wireless manner, and the transmission manner of the audio data is not particularly limited herein. After the server 103 receives the audio data transmitted by the recording device 101 and the recording device 102, the audio processing method of the present application may be executed to determine the target users corresponding to the audio data of each frame of the recording device 101 and the recording device 102.
Next, the audio processing method of the present application is specifically described, and the solution of the present application may be executed by an intelligent device or a server having a data processing function, where the intelligent device may be: cell phones, robots, etc. The present application is described by taking an execution body as a server, and referring to a flowchart of an audio processing method shown in fig. 2, the server may execute:
Step 201, obtaining audio data of N recording devices; n is an integer.
Step 202, determining the voiceprint similarity of any frame of the audio data after framing and reference factors for the audio data of any recording device N i; the i is any integer from 1 to N; the reference factors include: the recording device N i is a far-near field factor or a volume factor.
And 203, determining target users corresponding to the audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors.
It should be noted that the recording devices in step 201 are located in the same space, and the server may obtain audio data collected by a plurality of recording devices recording audio data, for example, 10 recording devices are configured in the meeting room a, but only 7 users are in the meeting room at this time, each user corresponds to a recording device one by one, and the server may obtain audio data collected by recording devices corresponding to the 7 users; or 10 recording devices in the meeting room a are all in use, and some users may correspond to a plurality of recording devices, for example: the chairman station of the conference room A corresponds to 3 recording devices, the user 1 is exactly located at the chairman station, the user 2 corresponds to 2 recording devices, and other users respectively correspond to 1 recording device, so that the server can acquire audio data acquired by 10 recording devices.
In step 202, when the audio data of the recording device is framed, framing may be performed according to a preset time interval, or may be set according to a user's requirement, for example: set to 1 second or 2 seconds, or other time period; the time interval may also be set according to the specific capabilities of the device such as: if the recording device is a microphone which is more suitable for the device parameter requirement of the microphone every 1.5 seconds, the audio data is framed every 1.5 seconds; if the recording device is a recording pen, the device parameter requirement of the recording pen is more adapted to the frame every 2 seconds, the audio data is framed every 2 seconds; or the setting is performed according to the specific duration of the audio data, such as: the server acquires the audio data of 5 recording devices, the duration of the audio data acquired by each recording device is 5 minutes, and the audio data corresponding to each recording device can be divided into 300 frames of audio data according to one frame of 1 second. The manner in which the time intervals are determined is not particularly limited herein.
If there are 10 users in the conference room, the designated number is 10. And (3) comparing the audio data after framing of each recording device in the step (202) with the voiceprints of 10 users respectively, and determining the voiceprint similarity of any frame of the audio data after framing.
In addition, the server also determines reference data of the audio data after framing, so as to determine target users corresponding to the audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors.
It should be noted that, the reference factors may include: the far and near field factor of the recording device N i or the volume factor; the sound recording device N i far and near field factors represent the corresponding relation between the sound recording device N i and a user; the recording device N i is configured to record audio data of the user i, where far and near field factors of the recording device N i and the user i are a first relationship value, and far and near field factors of the recording device N i and other users except the user i are a second relationship value; the first relationship value is greater than the second relationship value. For example, there are 3 users in the conference room being user 1, user 2 and user 3, respectively, there are 4 recording devices in the conference room being recording device a, recording device B, recording device C and recording device D, respectively, the first relationship value being 1, the second relationship value being 0, assuming that recording device a and recording device B are used for recording audio data of user 1, the far and near field factors of recording device a and recording device B and user 1 are set to 1, and the far and near field factors of recording device a and recording device B and user 2, user 3 are set to 0; assuming that the recording device C is used for recording audio data of the user 2, the far and near field factors of the recording device C and the user 2 are set to 1, and the far and near field factors of the recording device C and the user 1 and the user 3 are set to 0; assuming that the recording device D is used to record audio data of the user 3, the far and near field factors of the recording device D and the user 3 are set to 1, and the far and near field factors of the recording device D and the users 1 and 2 are set to 0. The first relationship value and the second relationship value are specifically set according to the user requirement, the first relationship value is 1, the second relationship value is 0, and the application is not particularly limited to the values of the first relationship value and the second relationship value.
The volume factor is indicated by the volume of each frame of audio data of the recording device N i, and the server can calculate the proportion of the average volume value and the first volume value of each frame of audio data to determine the volume of each frame of audio data; the first volume value is an average volume value of audio data of a recording device to which each frame of audio data belongs.
It should be noted that, although the volume itself cannot be used to judge the speaker alone, it can also provide a certain reference value. For example, when the i-th microphone receives a large volume, user i should have a large probability of speaking. Thus, the present application trades off by calculating the volume: i.e. the volume factor is determined by the average volume of microphone N i at frame t/the average volume of microphone N i throughout the recording. In addition, the determination of the above-mentioned volume factor is related to the attribute of the microphone N i only, but not related to the volume of other microphones, mainly to eliminate the additional interference factor caused by the difference of the device itself or the setting between different microphones, and in practical application, the calculated volume characteristic value may be superimposed to the user corresponding to the microphone.
In one embodiment, the server may determine M frames of audio frame data for audio data of any recording device N i according to a preset time interval frame-by-frame processing; m is an integer; the method comprises the steps of comparing the t frame of audio data of the recording device N i with the voiceprint data of a kth user one by one, and determining a similarity score a i t k of the t frame of audio data of the recording device N i and the voiceprint data of the kth user; the t is any integer from 1 to M; and k is any integer from 1 to N.
The server may compare the M frames of audio data with the voiceprint data of the number of specified users one by one, and determine a similarity score between each frame of audio data of each recording device and the voiceprint data of the number of specified users, where the similarity score may be implemented by an i-vector algorithm. The following is described by way of example: the conference room A is provided with 3 recording devices, namely recording device 1, recording device 2 and recording device 3, wherein each recording device corresponds to user 1, user 2 and user 3.3 recording devices collect 3 minutes of audio data altogether, the 3 minutes of audio data of each recording device are divided into 180 frames of audio data according to a time interval of 1 second, the similarity of the audio data of any frame of the recording device and the voiceprints of each user is obtained respectively, and the similarity score is determined.
It should be noted that, table 1 shows similarity scores of 1 st frame of audio data of each recording device and each user voiceprint, for example: the similarity of the 1 st frame audio data of the recording apparatus 1 and the voiceprint of the user 1 is 0.1, and the similarity of the 1 st frame audio data of the recording apparatus 2 and the voiceprint of the user 1 is 0.5, and the meaning of the data in table 1 will not be described here.
TABLE 1
1 St frame of audio data User 1 User 2 User 3
Recording apparatus 1 0.1 0.4 0.5
Recording apparatus 2 0.5 0.8 0.6
Recording apparatus 3 0.2 0.7 0.4
Total score 0.8 1.9 1.5
After the voiceprint similarity is determined in the above manner, a reference factor can be introduced when a target user corresponding to the audio data is determined, and the similarity between the audio frame data and the voiceprint is combined with the reference factor to judge, so that the target user corresponding to each frame of audio data is judged more accurately. And when the method is actually executed, carrying out weighted summation calculation on the value of the similarity and the reference factors, and determining target users corresponding to the audio data of each frame of the N recording devices.
In addition, it should be noted that, the reference factor is not only related to the far-near field factor of the recording device N i, but also related to the volume factor, and any parameter determined by the target user that affects the audio data is applicable to the present application, for example: the speech rate at which the user speaks in the audio data (different users may differ in speaking at speech rate, and may be determined in combination with speech rate), etc.
The following reference factors include: the recording device N i is illustrated by way of example as far and near field factors, and/or as volume factors, and/or is comprised of 3 schemes, i.e., the reference values may have the following three cases:
case 1, reference factors may include: sound recording device N i far and near field factors.
Case 2, reference factors may include: volume factor.
Case 3, reference factors may include: sound recording device N i far and near field factors, and volume factors.
For example, in the scene 1, when three people of a first person, a second person and a third person are in a meeting in the meeting room 1, the first person is seated at the position of the host, 3 microphones which are arranged side by side are respectively a microphone 1, a microphone 2 and a microphone 3, a microphone 4 corresponds to the position of the second person, and a microphone 5 corresponds to the position of the third person, the far-near field factors of the microphone 1, the microphone 2 and the microphone 3 and the first person can be set as a first relation value, and the far-near field factors of the microphone 4, the microphone 5 and the first person can be set as a second relation value; the far and near field factors of the microphone 4 and the second are set as a first relation value, and the far and near field factors of the microphone 1, the microphone 2, the microphone 3 and the microphone 5 and the second are set as a second relation value; the far and near field factors of the microphone 5 and the third may be set to a first relationship value, the far and near field factors of the microphone 1, the microphone 2, the microphone 3, and the microphone 4 and the third may be set to a second relationship value, as shown in table 2, for exemplary illustration, the first relationship value is set to 2, and the second relationship value is set to 1 in table 2, but in actual application, the first relationship value and the second relationship value may be set according to actual requirements, which is not limited herein.
TABLE 2
Factors of far and near fields Nail armor Second step Polypropylene (C)
Microphone 1 2 1 1
Microphone 2 2 1 1
Microphone 3 2 1 1
Microphone 4 1 2 1
Microphone 5 1 1 2
As in the example above, three persons a, b, c are in a meeting in conference room 1, the person a is seated in the presenter's position, which is provided with 3 side-by-side microphones, microphone 1, microphone 2 and microphone 3, respectively, the position of b corresponds to microphone 4, and the position of c corresponds to microphone 5. For each microphone, the audio data of frame 1 may be calculated, for example, if the average volume of the audio data of frame 1 in the microphone 1 is 0.3 and the average volume of the audio data of frame 1 is 0.2, the volume of the audio data of frame 1 in the microphone 1 is 0.3/0.2, that is, 1.5, and the volumes corresponding to the other microphones are obtained, which is not described in detail herein, and refer to table 3.
TABLE 3 Table 3
Volume of sound Nail armor Second step Polypropylene (C)
Microphone 1 1.5 0 0
Microphone 2 1.5 0 0
Microphone 3 1.5 0 0
Microphone 4 0 2 0
Microphone 5 0 0 2.5
Next, the following will briefly explain, with reference to specific examples, three schemes for determining a target user corresponding to audio data, which may occur in the present application.
Assuming that in the above scenario 1, the values of the voiceprint similarity between each microphone and each user determined by applying the scheme of the present application are shown in table 4 below.
TABLE 4 Table 4
Voiceprint similarity Nail armor Second step Polypropylene (C)
Microphone 1 0.1 0.5 0.4
Microphone 2 0.1 0.5 0.3
Microphone 3 0.1 0.5 0.3
Microphone 4 0.2 0.7 0.6
Microphone 5 0.2 0.5 0.4
The similarity value corresponding to the table 4 and the far-near field factor, that is, the data in the table 2, can be weighted and summed to determine the target user by combining the reference factors in the case 1. For the user a, after multiplying the similarity value of the voiceprint data of each microphone and the user a by the corresponding relation value of the user a, adding and calculating to determine the score value of the first corresponding to the 1 st frame of audio data, that is, the calculation of other users 1.0 (0.1×2+0.1×2+0.1×2+0.2×1+0.2×1), which will not be described in detail herein, reference may be made to table 5, and the target user corresponding to the 1 st frame of audio data may be known to be b according to the data in table 5.
TABLE 5
Nail armor Second step Polypropylene (C)
Microphone 1 0.1*2 0.5*1 0.4*1
Microphone 2 0.1*2 0.5*1 0.3*1
Microphone 3 0.1*2 0.5*1 0.3*1
Microphone 4 0.2*1 0.7*2 0.6*1
Microphone 5 0.2*1 0.5*1 0.4*2
1.0 3.4 2.4
Assuming that in the above scenario 1, the similarity value and the volume are weighted by combining the reference factors in the case 2, and the target user corresponding to the audio data is determined. For the user b, after multiplying the similarity value of the voiceprint data of each microphone and the voiceprint data of the user a by the corresponding volume factor of the user a, performing addition calculation to determine the score value of the first corresponding to the 1 st frame of audio data, that is, the calculation of other users 1.4 (0.5×0+0.5×0+0.5×0+0.7×2+0.5×0), which will not be described in detail herein, reference may be made to table 6, and the target user corresponding to the 1 st frame of audio data may be known to be the user b according to the data in table 6.
TABLE 6
Assuming that in the above scenario 1, the scheme of the present application is applied, and the reference factors in the case 3 are combined, the value of the voiceprint similarity of each frame of audio data and the far-near field factor of the recording device corresponding to each frame of audio data are weighted and summed, so as to determine a first weighted value; multiplying the first weighted value by the volume factor of each frame of audio data to determine a second weighted value; and taking the user with the largest second weighting value as a target user of audio data of each frame of N recording devices. That is, on the basis of the calculation result obtained in the above table 5, multiplication is performed on the volume factor to determine the target user, and multiplication is performed on the first weighted value 2.4 of the similarity and the relation value and the corresponding volume characteristic value 2.5 of the user c to obtain the second weighted value, that is, 2.4×2.5, where the second weighted values corresponding to other users are not described in detail herein, and the data in table 7 can be referred to. And determining the target user corresponding to the 1 st frame of audio data as the user B because the second weighting value corresponding to the user B is highest.
TABLE 7
Nail armor Second step Polypropylene (C)
Microphone 1 0.1*2 0.5*1 0.4*1
Microphone 2 0.1*2 0.5*1 0.3*1
Microphone 3 0.1*2 0.5*1 0.3*1
Microphone 4 0.2*1 0.7*2 0.6*1
Microphone 5 0.2*1 0.5*1 0.4*2
First weighted value 1.0 3.4 2.4
Second weighting value 1.5(1.0*1.5) 6.8(3.4*2) 6(2.4*2.5)
After determining the target users corresponding to the audio data of each frame of each recording device, each frame of audio data corresponding to the first user in the N recording devices may be spliced to determine the speaking content of the first user. By the method, the speaking contents corresponding to the users can be sorted out so as to better sort conference summary and the like, such as: in the scene 1, the audio data of the microphones 1-5 are divided into 5 frames, the audio processing method determines that the 1 st frame of audio data is spoken by the user B, the 2 nd frame is spoken by the user A, the 3 rd frame is spoken by the user C, the 4 th frame is spoken by the user A, and the 5 th frame is spoken by the user C, then the audio data of the 2 nd frame and the 4 th frame can be arranged into speaking contents of the user A; the audio data of the 1 st frame is arranged into the speaking content of the user B; and (5) arranging the audio data of the 3 rd frame and the 5 th frame into the speaking content of the user C. The speech content of users a, b, and c may then be consolidated into a meeting summary to better record the meeting content.
Based on the same concept, an embodiment of the present application provides an audio processing apparatus, as shown in fig. 3, including an acquisition module 31, an audio frame data determination module 32, and a target user determination module 33.
The acquiring module 31 is configured to acquire audio data of N recording devices; the N is an integer; an audio frame data determining module 32, configured to determine, for audio data of any recording device N i, a voiceprint similarity of any frame of the framed audio data, and a reference factor; the i is any integer from 1 to N; the reference factors include: the sound recording equipment N i is a far-near field factor or a volume factor; and the target user determining module 33 is configured to determine target users corresponding to audio data of each frame of the N recording devices according to the voiceprint similarity and the reference factors.
After the audio data of the N recording devices are obtained, the voiceprint similarity and the reference factors of any frame of the audio data after framing are determined, and the target users corresponding to the audio frame data are determined based on the voiceprint similarity and the reference factors.
In an alternative embodiment, the sound recording device N i near-far field factor characterizes a correspondence between the sound recording device N i and a user; the recording device N i is configured to record audio data of a user i, where a far-near field factor between the recording device N i and the user i is a first relationship value, and a far-near field factor between the recording device N i and other users except the user i is a second relationship value; the first relationship value is greater than the second relationship value.
When the target users corresponding to the audio data of each frame are determined, the target users corresponding to the audio data of each frame are determined based on the far and near field factors and the voiceprint similarity on the basis of the far and near field factors of the reference recording equipment, so that the accuracy of audio data identification can be improved.
In an alternative embodiment, the volume factor is indicated by the volume of each frame of audio data of the recording device N i; wherein the audio frame data determining module 32 is configured to: proportional calculation is carried out on the average volume value and the first volume value of each frame of audio data, and the volume of each frame of audio data is determined; the first volume value is an average volume value of audio data of a recording device to which each frame of audio data belongs.
According to the application, when the target user corresponding to each frame of audio data is determined, the target user corresponding to each frame of audio data is determined based on the volume factor and the voiceprint similarity on the basis of the volume factor of the reference recording equipment, so that the accuracy of audio data identification can be improved.
In an alternative implementation manner, for the audio data of any recording device N i, frame-by-frame processing is performed according to a preset time interval, and M frames of audio frame data are determined; m is an integer; the audio frame data determining module 32 is configured to compare the t frame audio data of the recording device N i with the voiceprint data of the kth user one by one, and determine a similarity score a i t k of the t frame audio data of the recording device N i and the voiceprint data of the kth user; the t is any integer from 1 to M; and k is any integer from 1 to N.
By the method, the corresponding relation between each audio frame data and the user voiceprint data can be accurately achieved.
In an alternative embodiment, the audio frame data determining module 32 is configured to perform weighted summation calculation on the value of the voiceprint similarity and the reference factor, to determine target users corresponding to audio data of each frame of the N recording devices.
The accuracy of the audio data identification can be improved by carrying out weighted summation calculation on the value of the voiceprint similarity and the reference factors.
In an alternative embodiment, the reference data includes: the sound recording device N i is a far-near field factor and the volume factor; the audio frame data determining module 32 is configured to perform weighted summation on a value of voiceprint similarity of each frame of audio data and a far-near field factor of a recording device corresponding to each audio data, and determine a first weighted value; then multiplying the first weighted value with the volume factor of each frame of audio data to determine a second weighted value; and finally, taking the user with the largest second weighting value as a target user of the audio data of each frame of the N recording devices.
When the target users corresponding to the audio data of each frame are determined, the far and near field factors and the volume factors of the recording equipment are considered, and the target users corresponding to the audio data of each frame are determined based on the far and near field factors, the volume factors and the voiceprint similarity, so that the accuracy of the identification of the audio data can be improved.
In an optional implementation manner, the apparatus further includes a data splicing module, configured to splice frames of audio data corresponding to a first user in the N recording devices, and determine speaking content of the first user.
In this way, the content of the user's speech can be stitched together and the meeting record can be better consolidated.
Having described the audio processing method, apparatus in an exemplary embodiment of the present application, next, an audio processing device in another exemplary embodiment of the present application is described.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
In some possible embodiments, an audio processing device according to the application may comprise at least one processor and at least one memory. Wherein the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps in the audio processing method according to various exemplary embodiments of the application described in the present specification. For example, the processor may perform steps 201-203 as shown in fig. 2.
An audio processing device 40 according to this embodiment of the present application is described below with reference to fig. 4. The audio processing device 40 shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present application. As shown in fig. 4, the audio processing device 40 is in the form of a general-purpose intelligent terminal. Components of audio processing device 40 may include, but are not limited to: the at least one processor 41, the at least one memory 42, a bus 43 connecting the different system components, including the memory 42 and the processor 41.
Bus 43 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures. Memory 42 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423. Memory 42 may also include a program/utility 424 having a set (at least one) of program modules 424, such program modules 424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The audio processing device 40 may also communicate with one or more external devices 44 (e.g., keyboard, pointing device, etc.), and/or with any device (e.g., router, modem, etc.) that enables the audio processing device 40 to communicate with one or more other intelligent terminals. Such communication may occur through an input/output (I/O) interface 44. Also, the audio processing device 40 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 46. As shown, the network adapter 46 communicates with other modules for the audio processing device 40 over the bus 43. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in connection with audio processing device 40, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
In some possible embodiments, aspects of the audio processing method provided by the present application may also be implemented in the form of a program product comprising a computer program for causing a computer device to carry out the steps of the audio processing method according to the various exemplary embodiments of the application as described in the present specification when the program product is run on the computer device. For example, the processor may perform steps 201-203 as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product for audio processing of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may run on a smart terminal. The program product of the present application is not limited thereto, but in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (7)

1. An audio processing method, comprising:
Acquiring audio data of N recording devices existing in the same space; the N is an integer; the users are in one-to-one correspondence with the recording devices;
For the audio data of any recording device N i, determining the voiceprint similarity of any frame of the audio data after framing and reference factors; the i is any integer from 1 to N; the reference factors include: the sound recording device N i is far and near field factors and volume factors; the sound recording device N i far and near field factors represent the corresponding relation between the sound recording device N i and a user; the recording device N i is configured to record audio data of a user i, where a far-near field factor between the recording device N i and the user i is a first relationship value, and a far-near field factor between the recording device N i and other users except the user i is a second relationship value; the first relationship value is greater than the second relationship value;
Weighting and summing the value of the voiceprint similarity of each frame of audio data and the far-near field factors of the recording equipment corresponding to each frame of audio data, and determining a first weighting value;
multiplying the first weighted value by the volume factor of each frame of audio data to determine a second weighted value;
and taking the user with the largest second weighting value as a target user of the audio data of each frame of the N recording devices.
2. The method of claim 1, wherein the volume factor is indicated by a volume of each frame of audio data of the recording device N i; the volume of each frame of audio data of the recording device N i is indicated by the following manner:
Proportional calculation is carried out on the average volume value and the first volume value of each frame of audio data, and the volume of each frame of audio data is determined; the first volume value is an average volume value of audio data of a recording device to which each frame of audio data belongs.
3. The method of claim 2, wherein the determining the voiceprint similarity of any frame of the audio data for any recording device N i comprises:
For the audio data of any recording device N i, carrying out frame-by-frame processing according to a preset time interval, and determining M frames of audio frame data; m is an integer;
Comparing the t frame audio data of the recording device N i with the voiceprint data of the kth user one by one, and determining the similarity score of the t frame audio data of the recording device N i and the voiceprint data of the kth user The t is any integer from 1 to M; and k is any integer from 1 to N.
4. A method according to any one of claims 1-3, wherein the method further comprises:
and splicing all frames of audio data corresponding to the first user in the N recording devices, and determining the speaking content of the first user.
5. An audio processing apparatus, comprising:
The acquisition module is used for acquiring audio data of N recording devices existing in the same space; the N is an integer; the users are in one-to-one correspondence with the recording devices;
The audio frame data determining module is used for determining the voiceprint similarity of any frame of the audio data after framing and reference factors aiming at the audio data of any recording device N i; the i is any integer from 1 to N; the reference factors include: the sound recording device N i is far and near field factors and volume factors; the sound recording equipment Ni far-near field factors represent the corresponding relation between the sound recording equipment Ni and a user; the recording device Ni is configured to record audio data of a user i, where a far-near field factor between the recording device Ni and the user i is a first relationship value, and a far-near field factor between the recording device Ni and other users except the user i is a second relationship value; the first relationship value is greater than the second relationship value;
The target user determining module is used for carrying out weighted summation on the value of the voiceprint similarity of each frame of audio data and the far-near field factors of the recording equipment corresponding to each frame of audio data, and determining a first weighted value;
multiplying the first weighted value by the volume factor of each frame of audio data to determine a second weighted value;
and taking the user with the largest second weighting value as a target user of the audio data of each frame of the N recording devices.
6. An audio processing apparatus, characterized in that the audio processing apparatus comprises: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the audio processing method according to any of claims 1-4.
7. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the audio processing method according to any of claims 1-4.
CN202011072474.9A 2020-10-09 2020-10-09 Audio processing method, device, computing equipment and storage medium Active CN112365895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011072474.9A CN112365895B (en) 2020-10-09 2020-10-09 Audio processing method, device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011072474.9A CN112365895B (en) 2020-10-09 2020-10-09 Audio processing method, device, computing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112365895A CN112365895A (en) 2021-02-12
CN112365895B true CN112365895B (en) 2024-04-19

Family

ID=74507810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011072474.9A Active CN112365895B (en) 2020-10-09 2020-10-09 Audio processing method, device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112365895B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793615B (en) * 2021-09-15 2024-02-27 北京百度网讯科技有限公司 Speaker recognition method, model training method, device, equipment and storage medium
CN117118956B (en) * 2023-10-25 2024-01-19 腾讯科技(深圳)有限公司 Audio processing method, device, electronic equipment and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012023268A1 (en) * 2010-08-16 2012-02-23 日本電気株式会社 Multi-microphone talker sorting device, method, and program
CN106599866A (en) * 2016-12-22 2017-04-26 上海百芝龙网络科技有限公司 Multidimensional user identity identification method
WO2017114307A1 (en) * 2015-12-30 2017-07-06 中国银联股份有限公司 Voiceprint authentication method capable of preventing recording attack, server, terminal, and system
CN108604449A (en) * 2015-09-30 2018-09-28 苹果公司 speaker identification
CN108962260A (en) * 2018-06-25 2018-12-07 福来宝电子(深圳)有限公司 A kind of more human lives enable audio recognition method, system and storage medium
CN110400566A (en) * 2019-06-27 2019-11-01 联想(北京)有限公司 Recognition methods and electronic equipment
CN111640437A (en) * 2020-05-25 2020-09-08 中国科学院空间应用工程与技术中心 Voiceprint recognition method and system based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020139121A1 (en) * 2018-12-28 2020-07-02 Ringcentral, Inc., (A Delaware Corporation) Systems and methods for recognizing a speech of a speaker

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012023268A1 (en) * 2010-08-16 2012-02-23 日本電気株式会社 Multi-microphone talker sorting device, method, and program
CN108604449A (en) * 2015-09-30 2018-09-28 苹果公司 speaker identification
WO2017114307A1 (en) * 2015-12-30 2017-07-06 中国银联股份有限公司 Voiceprint authentication method capable of preventing recording attack, server, terminal, and system
CN106599866A (en) * 2016-12-22 2017-04-26 上海百芝龙网络科技有限公司 Multidimensional user identity identification method
CN108962260A (en) * 2018-06-25 2018-12-07 福来宝电子(深圳)有限公司 A kind of more human lives enable audio recognition method, system and storage medium
CN110400566A (en) * 2019-06-27 2019-11-01 联想(北京)有限公司 Recognition methods and electronic equipment
CN111640437A (en) * 2020-05-25 2020-09-08 中国科学院空间应用工程与技术中心 Voiceprint recognition method and system based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于聚类分析与说话人识别的语音跟踪;郝敏;刘航;李扬;简单;王俊影;;计算机与现代化(04);全文 *
车载语音交互技术发展现状及趋势展望;李育贤;李玓;臧金环;;智能网联汽车(06);全文 *

Also Published As

Publication number Publication date
CN112365895A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
US11483434B2 (en) Method and apparatus for adjusting volume of user terminal, and terminal
US10424317B2 (en) Method for microphone selection and multi-talker segmentation with ambient automated speech recognition (ASR)
CN107015781B (en) Speech recognition method and system
CN105405439B (en) Speech playing method and device
US7995732B2 (en) Managing audio in a multi-source audio environment
CN112365895B (en) Audio processing method, device, computing equipment and storage medium
CN108417201B (en) Single-channel multi-speaker identity recognition method and system
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
CN104078045A (en) Identifying method and electronic device
CN109785846A (en) The role recognition method and device of the voice data of monophonic
CN111325082A (en) Personnel concentration degree analysis method and device
CN111223487B (en) Information processing method and electronic equipment
CN114677634A (en) Surface label identification method and device, electronic equipment and storage medium
CN114240342A (en) Conference control method and device
CN113591678A (en) Classroom attention determination method, device, equipment, storage medium and program product
CN112487246A (en) Method and device for identifying speakers in multi-person video
CN113571082A (en) Voice call control method and device, computer readable medium and electronic equipment
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server
CN112216285B (en) Multi-user session detection method, system, mobile terminal and storage medium
CN110633066B (en) Voice acquisition method, system, mobile terminal and storage medium
CN113707149A (en) Audio processing method and device
Inoue et al. Speaker diarization using eye-gaze information in multi-party conversations
CN112750448A (en) Sound scene recognition method, device, equipment and storage medium
Jie et al. Recognize the most dominant person in multi-party meetings using nontraditional features
CN112542169B (en) Voice recognition processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant