WO2022012215A1 - 一种识别说话对象的方法、装置、设备及可读存储介质 - Google Patents

一种识别说话对象的方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2022012215A1
WO2022012215A1 PCT/CN2021/098659 CN2021098659W WO2022012215A1 WO 2022012215 A1 WO2022012215 A1 WO 2022012215A1 CN 2021098659 W CN2021098659 W CN 2021098659W WO 2022012215 A1 WO2022012215 A1 WO 2022012215A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
frame
speaking
value
preset
Prior art date
Application number
PCT/CN2021/098659
Other languages
English (en)
French (fr)
Inventor
谭聪慧
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2022012215A1 publication Critical patent/WO2022012215A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present invention relates to the technical field of voiceprint recognition in financial technology (Fintech), and in particular, to a method, device, device and readable storage medium for recognizing speaking objects.
  • the related technology uses recording during the meeting. Acquire an audio file, and perform speech recognition processing on the audio file, so as to determine the object speaking in the conference and the speaking content of the object.
  • the similarity between the voice information in the audio file and the voices of multiple speakers is only determined, and the speaker corresponding to the voice information is directly determined according to the similarity.
  • the speaker's voice changes due to the change of the speaker's voice. or similar, resulting in poor speaker recognition accuracy in audio files.
  • the main purpose of the present invention is to provide a method, device, device and readable storage medium for recognizing speaking objects, aiming to solve the technical problem of poor speaker identification accuracy in the related art.
  • the present invention provides a method for recognizing a speaking object, and the method for recognizing a speaking object includes:
  • the preset dynamic programming model determines the decision value corresponding to the speaking object corresponding to any frame file of the target audio file; wherein, the preset dynamic programming model is used to The speaking objects corresponding to the preceding and following frame files of a frame file determine the decision value of the speaking object corresponding to any one of the frame files;
  • the speaking object corresponding to the maximum value in the decision value is determined as the speaking object of any frame file, so as to identify the speaking objects corresponding to all frame files in the target audio file.
  • determining the decision value corresponding to the speaking object corresponding to any frame file of the target audio file according to the determined similarity value and the preset dynamic programming model including:
  • the similarity of the M speaking objects corresponding to the any frame file is processed according to the preset state transition condition, and the correlation value of the M speaking objects corresponding to the any frame file is determined; wherein, the preset state transition The condition is used to recursively deduce the associated value of the any frame file according to the associated value of the previous frame file of the any frame file, wherein M is a positive integer greater than 1;
  • the decision value corresponding to the M speaking objects corresponding to the any frame file is determined.
  • the determining, according to the correlation value and the preset dynamic programming model, the decision values corresponding to the M speaking objects corresponding to the any frame file including:
  • the first preset dynamic programming sub-model is determined from the preset dynamic programming model, and the M speeches corresponding to the any frame file are determined.
  • a second preset dynamic programming sub-model is determined from the preset dynamic programming model, and according to the M corresponding to the any file.
  • the associated values of the speaking objects and the second preset sub-dynamic programming model determine the decision values corresponding to the M speaking objects corresponding to any one of the frame files.
  • the first preset sub-dynamic programming model is used for outputting the correlation values of the M speaking objects as decision values corresponding to the M speaking objects corresponding to the last frame file.
  • the association value of any speaking object corresponding to the any frame file, the similarity of the speaking object of the next frame file of the any frame file, and the speaking object of the next frame file and the any speaking The indicator value and preset weight value corresponding to whether the objects are the same, input the second preset sub-dynamic programming model, and obtain the decision value corresponding to any speaking object corresponding to the any frame file, so as to determine the any frame file.
  • the second preset sub-dynamic programming model is used to add the correlation value of any speaking object and the similarity of the speaking object of the next frame file to obtain the first processing value;
  • the index value and the preset weight value determine a second processing value, and a decision value is determined according to the first processing value and the second processing value.
  • the method before the determining the similarity value of the speaking object of the target audio file, the method further includes:
  • the audio file is divided based on the preset frame length to determine a target audio file
  • the first file that has been recorded in the recording file is divided and processed based on the preset frame length to determine the first audio file; the first audio file is divided into as the target audio file to determine the target audio file.
  • the method further includes:
  • the second file that has been recorded in the recording file is divided and processed based on the preset frame length to determine the second audio document;
  • the second file is used to represent the audio file in which the recording of the recording file has been completed and the speaking object has not been recognized.
  • the present invention also provides a device for recognizing speaking objects, the device for recognizing speaking objects includes:
  • a first determining unit used for determining the similarity value of the speaking object in the target audio file
  • the second determination unit is configured to determine the decision value corresponding to the speaking object corresponding to any frame file of the target audio file according to the determined similarity value and the preset dynamic programming model; wherein, the preset dynamic programming The model is used to determine the decision value of the speaking object corresponding to the any frame file according to the speaking object corresponding to the preceding and following frame files of the any frame file;
  • the identification unit is configured to determine the speaking object corresponding to the maximum value in the decision value as the speaking object of any frame file, so as to identify the speaking objects corresponding to all frame files in the target audio file.
  • the present invention also provides a device for recognizing a speaking object
  • the device for recognizing a speaking object includes a memory, a processor, and a speaking object stored on the memory and running on the processor
  • the recognition program when the program for recognizing speaking objects is executed by the processor, implements the steps of the method for recognizing speaking objects.
  • the present invention also provides a computer-readable storage medium, on which a program for recognizing speaking objects is stored, and when the program for recognizing speaking objects is executed by a processor, the above-mentioned program is realized. The steps described in identifying the speaking object.
  • the present invention also provides a computer program product containing instructions
  • the computer program product includes a computer program stored on a computer-readable storage medium
  • the computer program includes program instructions, when the program The instructions, when executed by a computer, cause the computer, when executed, to implement the steps of recognizing speaking objects as described above.
  • the similarity value of the speaking objects in the target audio file can be determined, and then the speech corresponding to any frame file is determined according to the determined similarity value of the speaking objects in the target audio file and the preset dynamic programming model The decision value corresponding to the object is determined, and the speaking object corresponding to the maximum value in the decision value is determined as the speaking object of any frame file.
  • the method provided in the embodiment of the present invention not only determines the similarity value of the speaking object corresponding to any frame file, but also combines the preset dynamic programming model, that is, comprehensively determines any The decision value corresponding to the speaking object corresponding to the frame file, and then selecting the speaking object corresponding to the maximum value in the decision value as the speaking object corresponding to any frame file can more accurately recognize the speaking object corresponding to the target audio file.
  • FIG. 1 is a schematic flowchart of a method for identifying a speaking object in an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a method for identifying a speaking object in an audio file that has been recorded in an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a method for identifying a speaking object in an audio file being recorded in an embodiment of the present invention
  • FIG. 4 is a functional schematic block diagram of a preferred embodiment of the apparatus for recognizing speaking objects of the present invention.
  • FIG. 5 is a schematic structural diagram of a hardware operating environment involved in the solution of an embodiment of the present invention.
  • an embodiment of the present invention provides a method for recognizing a speaking object, by which the similarity value of a frame file in an audio file can be adjusted, so as to determine the corresponding speaking object of the frame file according to the speaking object corresponding to the adjusted maximum value speaking object, so that the recognition accuracy of the speaking object corresponding to the audio file can be improved.
  • the embodiments of the present invention can be applied to any scenario involving multi-person voice interaction, such as a company meeting, a group meeting, a seminar, an online meeting, a group discussion, and the like.
  • the collected voice information can be fed back to the device that recognizes the speaking object in time through the collecting device, so that the speaking object corresponding to the voice information can be recognized by the device that recognizes the speaking object.
  • the collection device may be any device that can collect speech, such as a voice recorder, a voice recorder, and the like.
  • the acquisition device and the device for recognizing the speaking object may be connected through one or more networks for communication.
  • the network may be a wired network or a wireless network.
  • the wireless network may be a mobile cellular network, or may be a wireless fidelity (WIreless-Fidelity, WIFI) network, and of course other possible networks.
  • WIFI wireless fidelity
  • the device for recognizing the speaking object may be a server or a terminal
  • the terminal may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a handheld computer, a personal digital assistant (Personal Digital Assistant, PDA), and a mobile terminal such as a Fixed terminals such as digital TV and desktop computers.
  • Servers may include, for example, personal computers, medium to large computers, clusters of computers, and the like.
  • FIG. 1 is a flowchart of a method for recognizing a speaking object provided by an embodiment of the present invention, which specifically includes:
  • Step 101 Determine the target audio file.
  • the audio file of the speaking object to be identified may be determined, and then the target audio file of the speaking object to be identified may be determined.
  • the target audio file in the embodiment of the present invention is used to represent and record M speaking objects A file of voice interaction information, where M is a positive integer greater than 1; wherein, the target audio file includes a plurality of frame files obtained by dividing and processing according to a preset frame length, and the preset frame length is determined correspondingly according to the collection scene corresponding to the target audio file .
  • the target audio file may be an audio file that has been recorded, or may be an audio file that is being recorded.
  • the following describes in detail how to determine the target audio file for the audio file that has been recorded and the audio file that is being recorded.
  • the method of determining the target audio file is as follows:
  • Step a1 Determine attribute information of the audio file, wherein the attribute information is used to represent whether the audio file is recorded.
  • Step a2 If it is determined according to the attribute information that the audio file is a recorded file, the audio file is divided based on the preset frame length to determine the target audio file.
  • an audio file collected by a collection device may be acquired, and when it is determined that the audio file is a recorded audio file according to attribute information of the audio file, the audio file may be directly acquired.
  • the attribute information of the audio file can be correspondingly added to the audio file in the form of an identification field.
  • the identification "1" indicates that the audio file is an audio file that has been recorded
  • the identification "0" indicates that the audio file is being recorded.
  • the recorded audio file when the identifier "1" is parsed in the audio file, it is determined that the audio file is an audio file that has been recorded, and the audio file is obtained.
  • the audio file after acquiring the audio file, the audio file can be divided and processed according to the preset frame length, so that multiple frame files can be obtained, so that the processed audio file can be determined as the target audio file, wherein,
  • the preset frame length is determined according to the acquisition scene corresponding to the audio file.
  • the average speaking rate of the speaking object in the scene can be matched correspondingly, so that the preset frame length for the audio file can be correspondingly determined.
  • the collection scene corresponding to the audio file is a conventional conference, that is, a conference scene in which multiple speaking objects communicate at a normal speaking rate
  • the The preset frame length is set to 1 second. For example, if the frame length of the audio file divided according to the frame length of 1 second is 1 second, and the recording duration corresponding to the audio file is 3 seconds, the audio file can be divided into 3 frame files.
  • the preset frame length can be set to 0.5 seconds. For example, if the frame length of the audio file divided according to the frame length of 1 second is 0.5 seconds, and the recording duration corresponding to the audio file is 3 seconds, the audio file can be divided into 6 frame files.
  • the target audio file in the embodiment of the present invention is a file that has been divided and processed according to the preset frame length, and since the preset frame length is determined according to the acquisition scene corresponding to the target audio file, that is, the division granularity of the audio file can be more In line with the basic requirements of the current audio file to recognize the speaking object, it can be realized that each frame file only contains the voice information of one speaking object, so as to provide better recognition for the subsequent object audio file corresponding to the speaking object processing basis. And, the audio files are divided by using a finer division granularity, so that the voice information corresponding to the obtained frame files is relatively single, that is, each frame file corresponds to a speaking object.
  • the method of determining the target audio file is as follows:
  • Step b1 Determine attribute information of the audio file, wherein the attribute information is used to represent whether the recording of the audio file is completed.
  • Step b2 if it is determined that the audio file is the recording file, then the first file that the recording file has been recorded is divided and processed based on the preset frame length, to determine the first audio file; The first audio file is used as the target audio file, to determine the target audio file.
  • Step b3 if the speaking object corresponding to each frame file in the first audio file has been identified, then to the second file that the recording file has been recorded, divide processing based on the preset frame length to determine the second audio file;
  • Step b4 The second audio file is used as the target audio file to determine the target audio file; wherein, the second file is used to represent the audio file in which the recording file has been recorded and the speaking object has not been recognized.
  • the attribute information of the audio file can be determined, and when the audio file is determined to be the audio file being recorded according to the attribute information of the audio file, the process of identifying the speaking object can be performed on the currently recorded file without the need for Wait until the entire recording of the audio file is completed before identifying the speaking object, that is, the technical solution provided by the embodiment of the present invention can support the identification of the speaking object of the audio file being recorded, and expand the identification range of the audio file.
  • the first file that has been recorded in the recording file can be divided and processed based on the preset frame length, so that the first audio file can be determined, Then, the first audio file is used as the target audio file, so that the target audio file can be obtained.
  • the second file that has been recorded in the recording file can be divided based on the preset frame length to determine the second audio file. file, and then use the second audio file as the target audio file to obtain the target audio file.
  • the second file is used to represent the audio file in which the recording file has been recorded and the speaking object has not been recognized. That is to say, in the technical solution provided by the embodiments of the present invention, it is not necessary to wait for the end of the recording to perform the recognition of the speaking object corresponding to the audio file, and it may be determined to perform the recognition processing of the speaking object once for the audio file that has been recorded.
  • a speaking object recognition process may also be performed once by determining several frames of an audio file that has been recorded. For example, if the preset frame number is 10 frames, the audio file can be divided into frame files with a frame number of 10 based on the preset frame length, and then the frame files corresponding to 1-10 frames are processed to identify the speaking object, and the 11- 20 frames for the recognition of the speaking object.
  • This approach can solve the problem that the prior art can only process the entire audio file recorded, that is, the audio file that has been recorded can be recognized while the audio file is being recorded, and the user experience can be improved.
  • the file being recorded can be divided into a preset number of frames based on the preset frame length, and the target audio file can be determined, until all the aforementioned files being recorded are recorded, that is, the speaking object in the scene corresponding to the file stops speaking interact.
  • Step 102 Determine the similarity value of the speaking objects in the target audio file.
  • the X-vector can be used to identify M similarity values of the M speaking objects corresponding to each frame file, wherein the similarity values are expressed as It is used to characterize the similarity between the speech corresponding to the frame file and the speech of the speaking object.
  • X-vector can be used to identify the similarity value of the first speaking object corresponding to the first frame file is 35%, the similarity value of the second speaking object is 30%, and the similarity value of the third speaking object is 40%.
  • the similarity value of the fourth speaking object is 60%, and the similarity value of the fifth speaking object is 82%, so that the similarity values corresponding to the five speaking objects corresponding to the frame file can be determined respectively.
  • Step 103 According to the determined similarity value and the preset dynamic programming model, determine the decision value corresponding to the speaking object corresponding to any frame file of the target audio file; The speaking object corresponding to the frame file determines the decision value of the speaking object corresponding to any frame file.
  • the similarity value of the M speaking objects corresponding to any frame file and the preset dynamic programming model can be used to determine The decision values corresponding to the M speaking objects corresponding to any frame file are determined, and the speaking object corresponding to the maximum value of the decision values is determined as the speaking object of any frame file.
  • Step c1 process the similarity of the M speaking objects corresponding to any frame file according to the preset state transition condition, determine the association value of the M speaking objects corresponding to any frame file; Wherein, the preset state transition condition is used for The association value of any frame file is recursively deduced according to the association value of the previous frame file of any frame file, wherein M is a positive integer greater than 1.
  • Step c2 According to the correlation value and the preset dynamic programming model, determine the decision value corresponding to the M speaking objects corresponding to any frame file.
  • step c2 when step c2 is performed, the following steps may be adopted but not limited to:
  • Step c20 Determine whether any frame file is the last frame file among the multiple frame files, and the multiple frame files are arranged in the order of division.
  • Step c21 if any frame file is the last frame file, then determine the first preset dynamic programming sub-model from the preset dynamic programming model, and according to the associated value of the M speaking objects corresponding to any frame file and the first A sub-dynamic programming model is preset to determine the decision values corresponding to the M speaking objects corresponding to any frame file.
  • the correlation values of the M speaking objects corresponding to the last frame file are input into the first preset sub-dynamic programming model, and the decision values corresponding to the M speaking objects corresponding to the last frame file are obtained; wherein, The first preset sub-dynamic programming model is used to output the correlation values of the M speaking objects as decision values corresponding to the M speaking objects corresponding to the last frame file.
  • the speaking object of the last frame file can be determined by combining the similarity between the speaking object corresponding to the preceding frame file and the last frame file, so that It can be considered that the frame file is determined by a finer division granularity, so it may be a part of a speech of a speaking object, that is, there is an association relationship with the frame file before it, so it is combined with the frame file before it.
  • the speaking object has a certain reference for the recognition of the speaking object corresponding to the last frame file, which improves the recognition accuracy of the corresponding speaking object in the last frame file.
  • Step c22 determine that any frame file is not the last frame file, then determine the second preset dynamic programming sub-model from the preset dynamic programming model, and according to the associated value of the M speaking objects corresponding to any file and the second.
  • a sub-dynamic programming model is preset to determine the decision values corresponding to the M speaking objects corresponding to any frame file.
  • the association value of any speaking object corresponding to any frame file, the similarity of the speaking object of the next frame file of any frame file, and the speaking object of the next frame file and any speaking object Whether the corresponding index value and preset weight value are the same input the second preset sub-dynamic programming model, obtain the decision value corresponding to any speaking object corresponding to any frame file, and determine the M speaking objects corresponding to any frame file Corresponding decision value; wherein, the second preset sub-dynamic programming model is used to add the correlation value of any speaking object and the similarity of the speaking object of the next frame file to obtain the first processing value; according to the index The value and the preset weight value determine the second processing value, and the decision value is determined according to the first processing value and the second processing value.
  • any frame file if any frame file is not the last frame file, any frame can be determined by combining the similarity between the speaking object corresponding to the preceding frame file and the following frame file and any frame file.
  • the speaking object of the file in this way, it can be considered that the frame file is determined by a finer division granularity, so it may be a part of a paragraph of a speaking object, that is, there is an associated relationship with the previous frame file, so it is combined in the
  • the speaking objects of the frame files before and after it have a certain auxiliary effect on the identification of the speaking objects corresponding to any frame file, which improves the recognition accuracy of the corresponding speaking objects of any frame file.
  • argmax y f(N, y) is used to characterize the function corresponding to y when the function f(N, y) takes the maximum value, where y belongs to any characterizing value in Y .
  • the last frame file can be recorded as the T-th frame, and any speaking object can be recorded as i, then the decision value of the speaking object corresponding to the last frame file is expressed as f(T, i), and according to The preset state transition condition is determined.
  • next frame file corresponds to f( t+1,i) is determined according to f(t,i) of the previous frame file, that is, the decision value f corresponding to the last frame file can be determined according to f(t,i) corresponding to the frame file before the last frame file (T,i), and then determine the speaking object corresponding to the maximum value from the M decision values as the speaking object corresponding to the last frame file.
  • the second frame file corresponding to any frame file is determined from the aforementioned model for determining the speaking object corresponding to any frame file. Preset the sub-dynamic programming model and determine the speaking object of any frame file.
  • the preset dynamic programming model is a model corresponding to the foregoing expression, it may be determined that the second preset sub-dynamic programming model is
  • the preset state transition conditions i.e. Determine f(t,i) corresponding to any frame file before the last frame file, that is, the correlation value, and then calculate the decision value corresponding to any speaking object i in each frame file according to the second preset sub-dynamic programming model.
  • Step 104 Determine the speaking object corresponding to the maximum value in the decision value as the speaking object of any frame file, so as to identify the speaking objects corresponding to all frame files in the target audio file.
  • the similarity value of the speaking objects in the target audio file can be determined, and then the speech corresponding to any frame file is determined according to the determined similarity value of the speaking objects in the target audio file and the preset dynamic programming model The decision value corresponding to the object is determined, and the speaking object corresponding to the maximum value in the decision value is determined as the speaking object of any frame file.
  • the method provided in the embodiment of the present invention not only determines the similarity value of the speaking object corresponding to any frame file, but also combines the preset dynamic programming model, that is, comprehensively determines any The decision value corresponding to the speaking object corresponding to the frame file, and then selecting the speaking object corresponding to the maximum value in the decision value as the speaking object corresponding to any frame file can more accurately recognize the speaking object corresponding to the target audio file.
  • the divided target audio files have a total of 5 frame files, which are frame file 1, frame file 2, frame file 3, frame file 4 and frame file 5 in sequence, and there are two target audio files in the corresponding acquisition scene.
  • the speaking objects are the speaking object A and the speaking object B, respectively. It should be noted that, in order to facilitate understanding, the speaking object A is written as: speaker A, and the speaking object B is written as: speaker B. Then, X-vector can be used to identify the corresponding similarity values of speaker A and speaker B in each of the five frame files, as shown in Table 1 below:
  • f(t,i) corresponding to frame file 4 is:
  • ⁇ 5 argmax i f (5 , i), i.e., from "2.5” and Select the speaking object corresponding to the maximum value in f(5,i) in "2.9", and it can be determined that the speaking object corresponding to frame file 5 is B.
  • the decision value corresponding to the speaking object A corresponding to the frame file 3 is:
  • the decision value corresponding to the speaking object B corresponding to the frame file 3 is:
  • the decision value corresponding to the speaking object A corresponding to frame file 2 is:
  • the decision value corresponding to the speaking object B corresponding to the frame file 2 is:
  • the decision value corresponding to the speaking object A corresponding to the frame file 1 is:
  • the decision value corresponding to the speaking object B corresponding to the frame file 1 is:
  • the solution for determining the speaking object of the frame file in the target audio file is determined by the sequence from the last frame file to the first frame file, so that In the method, the speaking object of each frame file can be determined correspondingly in combination with the preceding and following frame files corresponding to each frame file, and the accuracy of determining the speaking object can be improved.
  • Step 201 Acquire an audio file, where the audio file is used to represent a file that records information about the voice interaction of M speaking objects, where M is a positive integer greater than 1.
  • Step 202 Divide and process the audio file according to the preset frame length, and obtain multiple frame files to determine the target audio file, wherein the preset frame length is determined correspondingly according to the collection scene corresponding to the audio file.
  • Step 203 Determine M similarity values of the M speaking objects corresponding to each frame file, wherein the similarity values are used to represent the similarity between the speech corresponding to the frame file and the speaking object's speech.
  • Step 204 According to the similarity values of the M speaking objects corresponding to any frame file and the preset dynamic programming model, determine the decision values corresponding to the M speaking objects corresponding to any frame file, and assign the maximum value of the decision values to the corresponding decision value.
  • the speaking object is determined as the speaking object of any frame file.
  • step 201 and step 202 can refer to the specific implementation of the previous steps a1-a2
  • step 203 and step 204 can refer to the specific implementation of the previous step 103 and step 104.
  • the implementation manner is not repeated here.
  • the audio file is divided according to the finer division granularity determined by the collection scene corresponding to the audio file, so that only one speaking object can be included in the frame file to a large extent. voice information. Further, because the division granularity is relatively fine, the voice information of any frame file and the voice information of the frames before and after it may be a piece of voice of the same person, so it can be combined with whether the frames before and after the frame file are the same speaking object and the similarity of the frame file.
  • the speaking object corresponding to the frame file can be more accurately determined according to the decision value, and the accuracy of the recognition of the speaking object corresponding to the frame file can be improved, that is, the accuracy of the recognition of the speaking object of the target audio file that has been recorded is improved.
  • Step 301 Determine a target audio file, wherein the target audio file is determined according to the files that have been recorded and the files being recorded are correspondingly determined.
  • Step 302 Determine M similarity values of the M speaking objects corresponding to each frame file, wherein the similarity values are used to represent the similarity between the speech corresponding to the frame file and the speaking object's speech, and M is a positive integer greater than 1.
  • Step 303 According to the similarity values of the M speaking objects corresponding to any frame file and the preset dynamic programming model, determine the decision values corresponding to the M speaking objects corresponding to any frame file, and assign the maximum value of the decision values to the corresponding decision value.
  • the speaking object is determined as the speaking object of any frame file.
  • Step 304 Determine whether the second file that is being recorded has been recorded and is divided based on whether the preset frame length is sufficient. If yes, go to Step 301 ; if not, go to Step 304 .
  • step 301 can be implemented by referring to the implementation contents of the preceding steps b1-b4, and the implementation contents of steps 302 and 303 are implemented, which will not be repeated here.
  • step 304 can be executed, that is, a new target audio file is determined again, and the new target audio file is processed Recognition processing of speaking objects.
  • the previous frame file corresponding to the first frame is the last frame file in the first file, In this way, it can be avoided that the last frame file of the target audio file intercepted by the first file and the first frame file of the target audio file corresponding to the second file, and different frames of a sentence corresponding to a speaking object lead to misjudgment If the situation occurs, the speaking object corresponding to the entire file being recorded can be identified more accurately.
  • an embodiment of the present invention also provides a device for recognizing a speaking object.
  • the device for recognizing a speaking object includes:
  • the first determining unit 401 is used to determine the similarity value of the speaking object in the target audio file
  • the second determination unit 402 is configured to determine the decision value corresponding to the speaking object corresponding to any frame file of the target audio file according to the determined similarity value and the preset dynamic programming model;
  • the speaking objects corresponding to the preceding and following frame files of a frame file determine the decision value of the speaking object corresponding to any frame file;
  • the identification unit 403 is configured to determine the speaking object corresponding to the maximum value in the decision value as the speaking object of any frame file, so as to identify the speaking objects corresponding to all frame files in the target audio file.
  • the second determining unit 402 is configured to:
  • the similarity of the M speaking objects corresponding to the any frame file is processed according to the preset state transition condition, and the correlation value of the M speaking objects corresponding to the any frame file is determined; wherein, the preset state transition The condition is used to recursively deduce the associated value of the any frame file according to the associated value of the previous frame file of the any frame file, wherein M is a positive integer greater than 1;
  • the decision value corresponding to the M speaking objects corresponding to the any frame file is determined.
  • the second determining unit 402 is configured to:
  • the first preset dynamic programming sub-model is determined from the preset dynamic programming model, and the M speeches corresponding to the any frame file are determined.
  • a second preset dynamic programming sub-model is determined from the preset dynamic programming model, and according to the M corresponding to the any file.
  • the associated values of the speaking objects and the second preset sub-dynamic programming model determine the decision values corresponding to the M speaking objects corresponding to any one of the frame files.
  • the second determining unit 402 is configured to:
  • the first preset sub-dynamic programming model is used for outputting the correlation values of the M speaking objects as decision values corresponding to the M speaking objects corresponding to the last frame file.
  • the second determining unit 402 is configured to:
  • the association value of any speaking object corresponding to the any frame file, the similarity of the speaking object of the next frame file of the any frame file, and the speaking object of the next frame file and the any speaking The indicator value and preset weight value corresponding to whether the objects are the same, input the second preset sub-dynamic programming model, and obtain the decision value corresponding to any speaking object corresponding to the any frame file, so as to determine the any frame file.
  • the second preset sub-dynamic programming model is used to add the correlation value of any speaking object and the similarity of the speaking object of the next frame file to obtain the first processing value;
  • the index value and the preset weight value determine a second processing value, and a decision value is determined according to the first processing value and the second processing value.
  • the apparatus further includes a processing unit for:
  • the audio file is divided based on the preset frame length to determine a target audio file
  • the first file that has been recorded in the recording file is divided and processed based on the preset frame length to determine the first audio file; the first audio file is divided into as the target audio file to determine the target audio file.
  • the processing unit is further configured to:
  • the second file that has been recorded in the recording file is divided and processed based on the preset frame length to determine the second audio document;
  • the second file is used to represent the audio file in which the recording of the recording file has been completed and the speaking object has not been recognized.
  • the specific implementation manner of the apparatus for recognizing a speaking object provided by the embodiment of the present invention is basically the same as that of the above-mentioned embodiments of the method for recognizing a speaking object, and details are not described herein again.
  • FIG. 5 is a schematic structural diagram of a hardware operating environment involved in an embodiment of the present invention.
  • FIG. 5 can be a schematic structural diagram of a hardware operating environment of a device for recognizing speaking objects.
  • the device for identifying the speaking object in the embodiment of the present invention may be the aforementioned server or terminal.
  • the device for recognizing speaking objects may include: a processor 501 , such as a CPU, a memory 505 , a user interface 503 , a network interface 504 , and a communication bus 502 .
  • the communication bus 502 is used to realize the connection and communication between these components.
  • the user interface 503 may include a display screen (Display) and an input unit such as a keyboard (Keyboard).
  • the user interface 503 may also include a standard wired interface and a wireless interface.
  • the network interface 504 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 505 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory.
  • the memory 505 may also be a storage device independent of the aforementioned processor 501 .
  • the structure of the device for recognizing speaking objects shown in FIG. 5 does not constitute a limitation on the device for recognizing speaking objects, and may include more or less components than those shown in the figure, or combine certain components, Or a different component arrangement.
  • the memory 505 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a program for recognizing speaking objects.
  • the operating system is a program that manages and controls the hardware and software resources of the device that recognizes the speaking object, and supports the operation of the program that recognizes the speaking object and other software or programs.
  • the user interface 503 is mainly used to connect the acquisition device, perform data communication with the acquisition device, and collect audio files and/or obtain audio files through the acquisition device;
  • the network interface 504 is mainly used for The backend server performs data communication with the backend server;
  • the processor 501 may be configured to call the program for recognizing the speaking object stored in the memory 505, and execute the steps of the above-mentioned method for recognizing the speaking object.
  • an embodiment of the present invention also provides a computer-readable storage medium, where a program for recognizing a speaking object is stored thereon, and the program for recognizing a speaking object is executed by a processor to realize the above-mentioned recognition The steps of the method of the speaking object.
  • the specific implementation manner of the computer-readable storage medium of the present invention is basically the same as the above-mentioned embodiments of the method for recognizing a speaking object, and details are not described herein again.
  • the present invention also provides a computer program product comprising instructions, the computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer , so that when the computer is executed, the steps of recognizing the speaking object as described above are realized.
  • embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种识别说话对象的方法、装置、设备及可读存储介质,该方法包括:确定目标音频文件中说话对象的相似度值(102);根据确定的相似度值和预设动态规划模型,确定目标音频文件的任一帧文件对应的说话对象对应的决策值,其中,预设动态规划模型用于根据任一帧文件的前后帧文件对应的说话对象确定任一帧文件对应的说话对象的决策值(103);将决策值中最大值对应的说话对象确定为任一帧文件的说话对象,以识别目标音频文件中所有帧文件对应的说话对象(104)。

Description

一种识别说话对象的方法、装置、设备及可读存储介质
相关申请的交叉引用
本申请要求在2020年07月17日提交中国专利局、申请号为202010693717.4、申请名称为“一种识别说话对象的方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及金融科技(Fintech)的声纹识别技术领域,尤其涉及一种识别说话对象的方法、装置、设备及可读存储介质。
背景技术
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,语音识别技术也不例外。但由于金融行业的安全性、实时性要求,也对声纹识别技术提出的更高的要求。
目前,在各种公司会议或者是研讨会等场景中,通常会出现多个人说话的场景,而为了更准确高效的记录会议中发言内容和对应的发言人,相关技术中通过在会议中进行录音获取音频文件,并对音频文件进行语音识别处理,从而确定会议中说话的对象以及该对象的说话内容。
然而,相关技术中仅是通过确定音频文件中语音信息与多个说话人的语音的相似度,并根据相似度直接确定语音信息对应的说话人,这样的方案,可能会存在由于说话人声音改变或相近,导致对音频文件中说话人识别的准确性较差。
可见,相关技术中存在说话人识别的准确性较差的技术问题。
发明内容
本发明的主要目的在于提供一种识别说话对象的方法、装置、设备及可读存储介质,旨在解决相关技术中说话人识别准确性较差的技术问题。
为实现上述目的,本发明提供一种识别说话对象的方法,所述识别说话对象的方法包括:
确定目标音频文件中说话对象的相似度值;
根据所述确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,所述预设动态规划模型用于根据所述任一帧文件的 前后帧文件对应的说话对象确定所述任一帧文件对应的说话对象的决策值;
将决策值中最大值对应的说话对象确定为所述任一帧文件的说话对象,以识别所述目标音频文件中所有帧文件对应的说话对象。
在一种可能的实施方式中,所述根据所述确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值,包括:
根据预设状态转移条件对所述任一帧文件对应的M个说话对象的相似度进行处理,确定所述任一帧文件对应的M个说话对象的关联值;其中,所述预设状态转移条件用于根据所述任一帧文件的前一个帧文件的关联值递推所述任一帧文件的关联值,其中,M为大于1的正整数;
根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
在一种可能的实施方式中,所述根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
确定所述任一帧文件是否为所述多个帧文件中最后一个帧文件,所述多个帧文件按照划分顺序排列;
若所述任一帧文件为所述最后一个帧文件,则从所述预设动态规划模型中确定第一预设动态规划子模型,并根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值;以及,
若确定所述任一帧文件不为所述最后一个帧文件,则从所述预设动态规划模型中确定第二预设动态规划子模型,并根据所述任一文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
在一种可能的实施方式中,所述根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
将所述最后一个帧文件对应的所述M个说话对象的关联值,输入所述第一预设子动态规划模型,获得所述最后一个帧文件对应的M个说话对象对应的决策值;
其中,所述第一预设子动态规划模型用于将所述M个说话对象的关联值作为所述最后一个帧文件对应的M个说话对象对应的决策值输出。
在一种可能的实施方式中,所述根据所述任一帧文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
将所述任一帧文件对应的任一说话对象的关联值、所述任一帧文件的后一个帧文件的说话对象的相似度以及所述后一个帧文件的说话对象与所述任一说话对象是否相同对应的指标值和预设权重值,输入所述第二预设子动态规划模型,获得所述任一帧文件对应的任一说话对象对应的决策值,以确定所述任一帧文件对应的M个说话对象对应的决策值;
其中,所述第二预设子动态规划模型用于对所述任一说话对象的关联值和所述后一个帧文件的说话对象的相似度进行相加处理,获得第一处理值;以根据所述指标值和所述预设权重值确定第二处理值,并根据所述第一处理值和所述第二处理值,确定决策值。
在一种可能的实施方式中,所述确定目标音频文件的说话对象的相似度值之前,所述方法还包括:
确定音频文件的属性信息,其中,所述属性信息用于表征所述音频文件是否录制完成;
若根据所述属性信息确定所述音频文件为录制完成文件,则基于所述预设帧长度对所述音频文件进行划分处理,以确定目标音频文件;以及,
若确定所述音频文件为正在录制文件,则对所述正在录制文件已录制完成的第一文件基于所述预设帧长度进行划分处理,以确定第一音频文件;将所述第一音频文件作为目标音频文件,以确定所述目标音频文件。
在一种可能的实施方式中,所述方法还包括:
若已识别所述第一音频文件中每个帧文件对应的说话对象,则对所述正在录制文件已录制完成的第二文件,基于所述预设帧长度进行划分处理,以确定第二音频文件;
将所述第二音频文件作为目标音频文件,以确定所述目标音频文件;
其中,所述第二文件用于表征所述正在录制文件已录制完成且未识别说话对象的音频文件。
此外,为实现上述目的,本发明还提供一种识别说话对象的装置,所述识别说话对象的装置包括:
第一确定单元,用于确定目标音频文件中说话对象的相似度值;
第二确定单元,用于根据所述确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,所述预设动态规划模型用于根据所述任一帧文件的前后帧文件对应的说话对象确定所述任一帧文件对应的说话对象的决策值;
识别单元,用于将决策值中最大值对应的说话对象确定为所述任一帧文件的说话对象,以识别所述目标音频文件中所有帧文件对应的说话对象。
此外,为实现上述目的,本发明还提供一种识别说话对象的设备,所述识别说话对象 的设备包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的说话对象的识别程序,所述识别说话对象的程序被所述处理器执行时实现如识别说话对象的方法的步骤。
此外,为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有识别说话对象的程序,所述识别说话对象的程序被处理器执行时实现如上所述的识别说话对象的步骤。
此外,为实现上述目的,本发明还提供一种包含指令的计算机程序产品,所述计算机程序产品包括存储在计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行时实现如上所述的识别说话对象的步骤。
在本发明实施例中,可以确定目标音频文件中说话对象的相似度值,然后根据确定的目标音频文件中的说话对象的相似度值和预设动态规划模型,确定任一帧文件对应的说话对象对应的决策值,并将决策值中最大值对应的说话对象确定为任一帧文件的说话对象。即本发明实施例中提供的方法,不仅仅确定任一帧文件对应说话对象的相似度值,还结合预设动态规划模型,即根据任一帧文件的前后帧文件的说话对象综合确定任一帧文件对应的说话对象对应的决策值,然后选择决策值中最大值对应的说话对象作为任一帧文件对应的说话对象,可以更为准确的实现对目标音频文件对应的说话对象的识别。
附图说明
图1是本发明实施例中识别说话对象的方法的流程示意图;
图2是本发明实施例中对已经录制完成的音频文件识别说话对象的方法的流程示意图;
图3是本发明实施例中对正在录制的音频文件识别说话对象的方法的流程示意图;
图4是本发明识别说话对象的装置较佳实施例的功能示意图模块图;
图5是本发明实施例方案涉及的硬件运行环境的结构示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例,然而应当理解,可以用各种形式实现本发明而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本发明,并且能够将本发明的范围完整的传达给本领域的技术人员。
如前所述,相关技术中存在对音频文件中语音信息对应的说话人识别准确度较低的问题。鉴于此,本发明实施例提供一种识别说话对象的方法,通过该方法可以对音频文件中的帧文件的相似度值进行调整,以根据调整后的最大值对应的说话对象确定帧文件对应的说话对象,从而可以提高对音频文件对应的说话对象的识别的准确度。
介绍完本发明实施例的设计思想之后,下面对本发明实施例中的技术方案适用的应用场景做一些简单介绍,需要说明的是,本发明实施例描述的应用场景是为了更加清楚的说明本发明实施例的技术方案,并不构成对于本发明实施例提供的技术方案的限定,本领域普通技术人员可知,随着新应用场景的出现,本发明实施例提供的技术方案对于类似的技术问题,同样适用。
本发明实施例可以应用于任一涉及多人语音交互的场景,例如公司会议、小组会议、研讨会、线上会议、小组讨论等。在前述场景中,可以通过采集设备将采集到的语音信息及时反馈给识别说话对象的设备,从而可以通过识别说话对象的设备对语音信息对应的说话对象进行识别。具体的,该采集设备可以是任一可以采集语音的设备,例如录音机、录音笔等。
在具体的实施过程中,采集设备与识别说话对象的设备之间可以通过一个或者多个网络进行通信连接。该网络可以是有线网络,也可以是无线网络,例如无线网络可以是移动蜂窝网络,或者可以是无线保真(WIreless-Fidelity,WIFI)网络,当然还可以是其他可能的网络,本发明实施例对此不做限定。
在本发明实施例中,识别说话对象的设备可以是为服务器或者终端,终端可以包括诸如手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)等移动终端,以及诸如数字TV、台式计算机等固定终端。服务器可以包括诸如个人计算机、大中型计算机、计算机集群,等等。
为进一步说明本发明实施例提供的识别说话对象的方法的方案,下面结合附图以及具体实施方式对此进行详细的说明。虽然本发明实施例提供了如下述实施例或附图所示的方法操作步骤,但基于常规或者无需创造性的劳动在所述方法中可以包括更多或者更少的操作步骤。在逻辑上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本发明实施例提供的执行顺序。所述方法在实际的处理过程中或者装置执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的应用环境)。
请参见图1,图1为本发明实施例提供的识别说话对象的方法流程图,具体包括:
步骤101:确定目标音频文件。
在本发明实施例中,可以确定待识别说话对象的音频文件,然后确定待识别说话对象 的目标音频文件,需要说明的是,本发明实施例中的目标音频文件用于表征记录M个说话对象语音交互的信息的文件,M为大于1的正整数;其中,目标音频文件包括按照预设帧长度进行划分处理获得的多个帧文件,预设帧长度根据目标音频文件对应的采集场景对应确定。
具体的,该目标音频文件可以是已经录制完成的音频文件,也可以是正在录制的音频文件。下面分别针对已经录制完成的音频文件和正在录制的音频文件确定目标音频文件的方式进行详细的说明。
在本发明实施例中,针对已经录制完成的音频文件,确定目标音频文件的方式具体如下:
步骤a1:确定音频文件的属性信息,其中,属性信息用于表征音频文件是否录制完成。
步骤a2:若根据属性信息确定音频文件为录制完成文件,则基于预设帧长度对音频文件进行划分处理,以确定目标音频文件。
在本发明实施例中,可以获取采集设备采集的音频文件,当根据音频文件的属性信息,确定音频文件为录制完成的音频文件时,可以直接获取该音频文件。
在具体的实施过程中,音频文件的属性信息可以用标识字段的方式对应增加到音频文件中,例如标识“1”表示音频文件为已经录制完成的音频文件,标识“0”标识音频文件为正在录制的音频文件,当在音频文件中解析出标识“1”,则确定该音频文件为已经录制完成的音频文件,并获取该音频文件。
在本发明实施例中,当获取音频文件之后,可以按照预设帧长度对音频文件进行划分处理,从而可以获得多个帧文件,从而可以将处理后的音频文件确定为目标音频文件,其中,预设帧长度根据音频文件对应的采集场景对应确定。
在本发明实施例中,根据音频文件对应的采集场景,可以对应匹配该场景中说话对象的平均说话语速,从而可以对应确定针对该音频文件的预设帧长度。
在本发明实施例中,若音频文件对应的采集场景为常规的会议,即多个说话对象正常语速交流的会议场景,则考虑到实际应用中该场景中说话对象的说话语速,可以将预设帧长度设置为1秒。例如,对音频文件按照1秒的帧长度进行划分的帧长度为1秒,音频文件对应的录音时长为3秒,则可以将音频文件划分为3个帧文件。
在本发明实施例中,若音频文件对应的采集场景为头脑风暴的多人语音交互场景,即多个说话对象快速语速交流的会议场景,则考虑到实际应用中该场景中说话对象的说话语速,可以将预设帧长度设置为0.5秒。例如,对音频文件按照1秒的帧长度进行划分的帧长度为0.5秒,音频文件对应的录音时长为3秒,则可以将音频文件划分为6个帧文件。
即本发明实施例中的目标音频文件是经过按照预设帧长度划分处理的文件,且由于预设帧长度是根据目标音频文件对应的采集场景对应确定的,即针对音频文件的划分粒度可以更贴合当前的音频文件的识别说话对象的基础需求,可以较大程度实现每个帧文件中仅包含一个说话对象的语音信息,从而为后续对目标音频文件对应的说话对象的识别提供更较好的处理基础。以及,通过采用较细的划分粒度对音频文件进行划分,从而使得获得的帧文件对应的语音信息较单一,即每一个帧文件对应一个说话对象。
在本发明实施例中,针对正在录制的音频文件,确定目标音频文件的方式具体如下:
步骤b1:确定音频文件的属性信息,其中,属性信息用于表征音频文件是否录制完成。
步骤b2:若确定音频文件为正在录制文件,则对正在录制文件已录制完成的第一文件基于预设帧长度进行划分处理,以确定第一音频文件;将第一音频文件作为目标音频文件,以确定目标音频文件。
步骤b3:若已识别第一音频文件中每个帧文件对应的说话对象,则对正在录制文件已录制完成的第二文件,基于预设帧长度进行划分处理,以确定第二音频文件;
步骤b4:将第二音频文件作为目标音频文件,以确定目标音频文件;其中,第二文件用于表征正在录制文件已录制完成且未识别说话对象的音频文件。
在本发明实施例中,可以确定音频文件的属性信息,当根据音频文件的属性信息确定音频文件为正在录制的音频文件时,可以对当前已经录制完成的文件进行识别说话对象的处理,而无需等到该音频文件整个录制完成再进行说话对象的识别,即本发明实施例提供的技术方案可以支持对正在录制的音频文件的说话对象的识别,扩大了对音频文件的识别范围。
在本发明实施例中,若确定当前获取的音频文件为正在录制文件,则可以对正在录制文件已经录制完成的第一文件,基于预设帧长度进行划分处理,从而可以确定第一音频文件,然后将第一音频文件作为目标音频文件,从而可以获取目标音频文件。
在本发明实施例中,若已识别第一音频文件中帧文件对应的说话对象,则可以对正在录制文件已录制完成的第二文件,基于预设帧长度进行划分处理,以确定第二音频文件,然后将第二音频文件作为目标音频文件,以获取目标音频文件。其中,从第二文件用于表征正在录制文件已录制完成且未识别说话对象的音频文件。也就是说,本发明实施例提供的技术方案,可以不用等待录音结束再执行对音频文件对应的说话对象的识别,可以是对已经录制完成的音频文件确定执行一次说话对象的识别处理。
在本发明实施例中,还可以是对已经录制完成的音频文件确定若干帧的方式执行一次说话对象的识别处理。例如,若预设帧数为10帧,则可以基于预设帧长度对音频文件划分 帧数为10的帧文件,然后对1-10帧对应的帧文件进行说话对象的识别处理,对11-20帧进行说话对象的识别处理。这样的方式,可以解决现有技术中只能针对整个音频文件录制完成的文件进行处理的问题,即可以一边录制一边对已经录制完成的音频文件进行说话对象的识别处理,提升用户的使用体验。
在具体的实施过程中,可以对正在录制文件基于预设帧长度划分预设帧数,确定目标音频文件,直到前述的正在录制文件全部录制完成,即该文件对应的场景内的说话对象停止语音交互。
步骤102:确定目标音频文件中说话对象的相似度值。
在本发明实施例中,当对目标音频文件进行划分获得多个帧文件之后,可以利用X-vector识别每个帧文件对应的M个说话对象的M个相似度值,其中,相似度值用于表征帧文件对应的语音与说话对象语音的相似度。
例如,确定目标音频文件对应的采集场景中的说话对象为5个,分别为第1个说话对象,第2个说话对象、第3个说话对象、第4个说话对象以及第5个说话对象,然后可以利用X-vector识别第1个帧文件对应的第1个说话对象的相似度值为35%、第2个说话对象的相似度值30%、第3个说话对象的相似度值40%、第4个说话对象的相似度值60%以及第5个说话对象的相似度值82%,即可以确定该帧文件对应的5个说话对象分别对应的相似度值。
步骤103:根据确定的相似度值和预设动态规划模型,确定目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,预设动态规划模型用于根据任一帧文件的前后帧文件对应的说话对象确定任一帧文件对应的说话对象的决策值。
在本发明实施例中,当确定每个帧文件对应的M个说话对象的M个相似度值后,可以根据任一帧文件对应的M个说话对象的相似度值和预设动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值,并将决策值中的最大值对应的说话对象确定为任一帧文件的说话对象。
具体的,可以采用但不限于以下步骤确定目标音频文件的任一帧文件对应的说话对象对应的决策值:
步骤c1:根据预设状态转移条件对任一帧文件对应的M个说话对象的相似度进行处理,确定任一帧文件对应的M个说话对象的关联值;其中,预设状态转移条件用于根据任一帧文件的前一个帧文件的关联值递推任一帧文件的关联值,其中,M为大于1的正整数。
步骤c2:根据关联值和预设动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值。
具体的,执行步骤c2时,可以采用但不限于以下步骤:
步骤c20:确定任一帧文件是否为多个帧文件中最后一个帧文件,多个帧文件按照划分顺序排列。
步骤c21:若任一帧文件为最后一个帧文件,则从预设动态规划模型中确定第一预设动态规划子模型,并根据任一帧文件对应的M个说话对象的关联值和第一预设子动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值。
在本发明实施例中,将最后一个帧文件对应的M个说话对象的关联值,输入第一预设子动态规划模型,获得最后一个帧文件对应的M个说话对象对应的决策值;其中,第一预设子动态规划模型用于将M个说话对象的关联值作为最后一个帧文件对应的M个说话对象对应的决策值输出。
在本发明实施例中,若任一帧文件为最后一个帧文件,则可以结合在其之前的帧文件对应的说话对象和最后一个帧文件的相似度,确定最后一个帧文件的说话对象,这样的方式,可以考虑到帧文件由于采用较细的划分粒度确定的,因而其可能是一个说话对象的一段话的一部分,即与其之前的帧文件存在关联关系,因而结合在其之前的帧文件的说话对象,对最后一个帧文件对应的说话对象的识别有一定的参考,提高了对最后一个帧文件的对应的说话对象的识别的准确度。
步骤c22:确定任一帧文件不为最后一个帧文件,则从预设动态规划模型中确定第二预设动态规划子模型,并根据任一文件对应的M个说话对象的关联值和第二预设子动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值。
在本发明实施例中,将任一帧文件对应的任一说话对象的关联值、任一帧文件的后一个帧文件的说话对象的相似度以及后一个帧文件的说话对象与任一说话对象是否相同对应的指标值和预设权重值,输入第二预设子动态规划模型,获得任一帧文件对应的任一说话对象对应的决策值,以确定任一帧文件对应的M个说话对象对应的决策值;其中,第二预设子动态规划模型用于对任一说话对象的关联值和后一个帧文件的说话对象的相似度进行相加处理,获得第一处理值;以根据指标值和预设权重值确定第二处理值,并根据第一处理值和第二处理值,确定决策值。
在本发明实施例中,若任一帧文件不为最后一个帧文件,则可以结合在其之前的帧文件和之后的帧文件对应的说话对象和任一帧文件的相似度,确定任一帧文件的说话对象,这样的方式,可以考虑到帧文件由于采用较细的划分粒度确定的,因而其可能是一个说话对象的一段话的一部分,即与其之前的帧文件存在关联关系,因而结合在其之前和之后的帧文件的说话对象,对任一帧文件对应的说话对象的判别有一定的辅助作用,提高了对任一帧文件的对应的说话对象的识别的准确度。
在本发明实施例中,若确定任一帧文件对应的说话对象的模型表示为:
Figure PCTCN2021098659-appb-000001
其中,
Figure PCTCN2021098659-appb-000002
1 i≠j为指标函数,用于表征当所述任一帧文件t对应的任一说话对象i和所述任一帧文件t的前一帧文件对应的任一说话对象j不相同时值为1以及相同时值为0,
Figure PCTCN2021098659-appb-000003
用于表征任一帧文件t对应的任一说话对象i的相似度值;p为预设参数,p的取值范围为大于0,f(0,j)=0;
Figure PCTCN2021098659-appb-000004
为指标函数,用于表征当所述任一帧文件t对应的任一说话对象i和所述任一帧文件t的后一帧文件t+1的说话对象不相同时值为1以及相同时值为0;
Figure PCTCN2021098659-appb-000005
用于表征所述任一帧文件t的后一帧文件t+1的说话对象的相似度值;
Figure PCTCN2021098659-appb-000006
对应的值用于表征所述任一帧文件t对应的任一说话对象i的决策值;当t=T时,argmax i用于表征f(T,i)取最大值时对应的i;当t<T,argmax i用于表征取
Figure PCTCN2021098659-appb-000007
中最大值时对应的说话对象i。
需要说明的是,在具体的实施过程中,argmax y f(N,y)用于表征当函数f(N,y)取最大值对应y的函数,其中,y属于Y中的任一个表征值。
在本发明实施例中,若任一帧文件为多个帧文件中的最后一个帧文件,且当确定任一帧文件对应的说话对象的模型表示为前述表达式所对应的模型时,即当t=T时,则可以确定第一预设子动态规划模型可以理解为
Figure PCTCN2021098659-appb-000008
在具体的实施过程中,可以将最后一个帧文件记为第T帧,任一说话对象记为i,则最后一个帧文件对应的说话对象的决策值表示为f(T,i),且根据预设状态转移条件确定,具体的,该预设状态转移条件表示为
Figure PCTCN2021098659-appb-000009
对该方程进行初始化,即设置f(0,j)=0,则可以从第1个帧文件开始计算每个帧文件对应的f(t,j),具体地,后一个帧文件对应f(t+1,i)根据前一个帧文件的f(t,i)对应确定,即可以根据最后一个帧文件之前的帧文件对应的f(t,i)确定最后一个帧文件对应的决策值f(T,i),然后从M个决策值中确定最大值对应的说话对象作为最后一个帧文件对应的说话对象。
在本发明实施例中,若任一帧文件不为多个帧文件中的最后一个帧文件,则从前述的确定任一帧文件对应的说话对象的模型中确定任一帧文件对应的第二预设子动态规划模型,并确定任一帧文件的说话对象。在本发明实施例中,当预设动态规划模型为前述表达式所对应的模型时,则可以确定第二预设子动态规划模型为
Figure PCTCN2021098659-appb-000010
Figure PCTCN2021098659-appb-000011
在具体的实施过程中,可以根据预设状态转移条件即
Figure PCTCN2021098659-appb-000012
Figure PCTCN2021098659-appb-000013
确定最后一个帧文件之前的任一个帧文件对应的f(t,i)即关联值,然后根据第二预设子动态规划模型计算每个帧文件中任一说话对象i对应的决策值。
步骤104:将决策值中最大值对应的说话对象确定为任一帧文件的说话对象,以识别目标音频文件中所有帧文件对应的说话对象。
在本发明实施例中,可以确定目标音频文件中说话对象的相似度值,然后根据确定的 目标音频文件中的说话对象的相似度值和预设动态规划模型,确定任一帧文件对应的说话对象对应的决策值,并将决策值中最大值对应的说话对象确定为任一帧文件的说话对象。即本发明实施例中提供的方法,不仅仅确定任一帧文件对应说话对象的相似度值,还结合预设动态规划模型,即根据任一帧文件的前后帧文件的说话对象综合确定任一帧文件对应的说话对象对应的决策值,然后选择决策值中最大值对应的说话对象作为任一帧文件对应的说话对象,可以更为准确的实现对目标音频文件对应的说话对象的识别。
在本发明实施例中,为了更好的对本发明实施例中提供确定目标音频文件中帧文件的说话对象的方案进行说明,下面以一个具体的例子进行说明。
假定划分处理后的目标音频文件共有5个帧文件,按照顺序分别为帧文件1、帧文件2、帧文件3、帧文件4以及帧文件5,且目标音频文件对应采集场景中一共有两个说话对象,分别为说话对象A和说话对象B。需要说明的是,为了便于理解,后文中将说话对象A写为:说话人A,以及将说话对象B写为:说话人B。然后可以利用X-vector识别5个帧文件中每个帧文件中说话人A和说话人B的对应的相似度值,具体如下表1所示:
Figure PCTCN2021098659-appb-000014
表1
然后可以进行计算。具体的,可以假定预设参数p=0.5,并根据
Figure PCTCN2021098659-appb-000015
Figure PCTCN2021098659-appb-000016
进行计算,确定初始化状态时,f(0,A)=f(0,B)=0,然后可以确定帧文件1对应的f(t,i),即可以确定f(1,A)=0.8,f(1,B)=0.2。进一步地,可以确定帧文件2对应的f(t,i)为:
Figure PCTCN2021098659-appb-000017
Figure PCTCN2021098659-appb-000018
进一步地,可以确定帧文件3对应的f(t,i)为:
Figure PCTCN2021098659-appb-000019
Figure PCTCN2021098659-appb-000020
进一步地,可以确定帧文件4对应的f(t,i)为:
Figure PCTCN2021098659-appb-000021
Figure PCTCN2021098659-appb-000022
进一步地,可以确定帧文件5对应的f(t,i)为:
Figure PCTCN2021098659-appb-000023
Figure PCTCN2021098659-appb-000024
具体的,将上述各个帧文件对应的f(t,i)进行汇总,得到以下表2:
f(t,i) 帧文件1 帧文件2 帧文件3 帧文件4 帧文件5
说话人A 0.8 1.5 1.7 2.2 2.5
说话人B 0.2 0.7 1.6 2.0 2.9
表2
当确定针对最后一个帧文件即帧文件5的说话对象时,则可以按第一预设动态规划模型确定决策值,即计算π 5=argmax i f(5,i),即从“2.5”和“2.9”中选取f(5,i)中最大值对应的说话对象,可以确定帧文件5对应的说话对象为B。
进一步地,可以根据第二预设动态规划模型确定决策值,从而计算前4帧文件对应的说话对象。具体的,针对帧文件4,即当t=4,则可以确定帧文件4对应的说话对象A对应的决策值为:
Figure PCTCN2021098659-appb-000025
以及帧文件4对应的说话对象B对应的决策值为:
Figure PCTCN2021098659-appb-000026
很显然,说话对象B对应的决策值最大,可以确定帧文件4对应的说话对象为B,即π 4=B。
具体地,针对帧文件3,即当t=3时,则可以确定帧文件3对应的说话对象A对应的决策值为:
Figure PCTCN2021098659-appb-000027
以及帧文件3对应的说话对象B对应的决策值为:
Figure PCTCN2021098659-appb-000028
很显然,说话对象B对应的决策值最大,可以确定帧文件3对应的说话对象为B,即π 3=B。
具体地,针对帧文件2,即当t=2时,则可以确定帧文件2对应的说话对象A对应的决策值为:
Figure PCTCN2021098659-appb-000029
以及帧文件2对应的说话对象B对应的决策值为:
Figure PCTCN2021098659-appb-000030
很显然,说话对象A对应的决策值最大,可以确定帧文件2对应的说话对象为A,即π 2=A。
具体地,针对帧文件1,即当t=1时,则可以确定帧文件1对应的说话对象A对应的决策值为:
Figure PCTCN2021098659-appb-000031
以及帧文件1对应的说话对象 B对应的决策值为:
Figure PCTCN2021098659-appb-000032
很显然,说话对象A对应的决策值最大,可以确定帧文件1对应的说话对象为A,即π 1=A。
根据上述具体的例子,可以很明显的知晓,本发明实施例提供的确定目标音频文件中帧文件的说话对象的方案是通过从最后一个帧文件到第一个帧文件的顺序对应确定的,这样的方式,可以结合每一个帧文件对应的前后帧文件,对应确定每一个帧文件的说话对象,提高确定说话对象的准确度。
下面对已经录制完成的音频文件的说话对象的识别过程进行说明。
请参见图2,在本发明实施例中,识别说话对象的具体流程如下:
步骤201:获取音频文件,其中,音频文件用于表征记录M个说话对象语音交互的信息的文件,M为大于1的正整数。
步骤202:对音频文件按照预设帧长度进行划分处理,获得多个帧文件,以确定目标音频文件,其中,预设帧长度根据音频文件对应的采集场景对应确定。
步骤203:确定每个帧文件对应的M个说话对象的M个相似度值,其中,相似度值用于表征帧文件对应的语音与说话对象语音的相似度。
步骤204:根据任一帧文件对应的M个说话对象的相似度值和预设动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值,并将决策值中的最大值对应的说话对象确定为任一帧文件的说话对象。
在本发明实施例中,步骤201和步骤202的具体实施方式可参见前面的步骤a1-a2的具体实施方式,步骤203和步骤204可以的具体实施方式可参见前面的步骤103和步骤104的具体实施方式,这里不再赘述。
在本发明实施例中,针对已经录制完成的目标音频文件,通过与音频文件对应的采集场景确定的较细的划分粒度对音频文件的划分,可以较大程度实现帧文件中仅包含一个说话对象的语音信息。进一步地,由于划分粒度较细,任一帧文件的语音信息与其前后帧的语音信息可能是同一个人的一段语音,因而可以结合帧文件的前后帧是否为同一说话对象和帧文件的相似度综合确定决策值,可以较为准确根据决策值确定帧文件对应的说话对象,提高对帧文件对应的说话对象的识别的准确度,即提高了对已经录制完成的目标音频文件的说话对象的识别的准确度。
下面对正在录制的音频文件的说话对象的识别过程进行说明。
请参见图3,在本发明实施例中,识别说话对象的具体流程如下:
步骤301:确定目标音频文件,其中,目标音频文件根据正在录制文件已录制完成的文件对应确定。
步骤302:确定每个帧文件对应的M个说话对象的M个相似度值,其中,相似度值用于表征帧文件对应的语音与说话对象语音的相似度,M为大于1的正整数。
步骤303:根据任一帧文件对应的M个说话对象的相似度值和预设动态规划模型,确定任一帧文件对应的M个说话对象对应的决策值,并将决策值中的最大值对应的说话对象确定为任一帧文件的说话对象。
步骤304:判断正在录制文件已录制完成的第二文件,基于预设帧长度是否足够划分,若是,则执行步骤301;若不是,则执行步骤304。
在本发明实施例中,步骤301的具体实施方式可以参见前面的步骤b1-b4的实施内容实施,步骤302和步骤303的实施内容实施,这里不再赘述。具体的,当对正在录制的音频文件中,第一次确定的待识别说话对象的目标音频文件识别完成后,可以执行步骤304,即再次确定新的目标音频文件,对新的目标音频文件进行说话对象的识别处理。
在本发明实施例中,当执行第一文件之后的任一文件(例如第二文件)的说话对象识别时,其第一帧对应的前一个帧文件为第一文件中的最后一个帧文件,这样的方式,可以避免第一文件截取的目标音频文件的最后一个帧文件和第二文件对应获取的目标音频文件中的第一个帧文件,对应一个说话对象的一个句子的不同帧导致误判的情况出现,可以较为准确的将正在录制文件整个文件对应的说话对象进行识别。
此外,本发明实施例还提供一种识别说话对象的装置,参照图4,所述识别说话对象的装置包括:
第一确定单元401,用于确定目标音频文件中说话对象的相似度值;
第二确定单元402,用于根据确定的相似度值和预设动态规划模型,确定目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,预设动态规划模型用于根据任一帧文件的前后帧文件对应的说话对象确定任一帧文件对应的说话对象的决策值;
识别单元403,用于将决策值中最大值对应的说话对象确定为任一帧文件的说话对象,以识别目标音频文件中所有帧文件对应的说话对象。
在一种可能的实施方式中,所述第二确定单元402用于:
根据预设状态转移条件对所述任一帧文件对应的M个说话对象的相似度进行处理,确定所述任一帧文件对应的M个说话对象的关联值;其中,所述预设状态转移条件用于根据所述任一帧文件的前一个帧文件的关联值递推所述任一帧文件的关联值,其中,M为大于1的正整数;
根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
在一种可能的实施方式中,所述第二确定单元402用于:
确定所述任一帧文件是否为所述多个帧文件中最后一个帧文件,所述多个帧文件按照划分顺序排列;
若所述任一帧文件为所述最后一个帧文件,则从所述预设动态规划模型中确定第一预设动态规划子模型,并根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值;以及,
若确定所述任一帧文件不为所述最后一个帧文件,则从所述预设动态规划模型中确定第二预设动态规划子模型,并根据所述任一文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
在一种可能的实施方式中,所述第二确定单元402用于:
将所述最后一个帧文件对应的所述M个说话对象的关联值,输入所述第一预设子动态规划模型,获得所述最后一个帧文件对应的M个说话对象对应的决策值;
其中,所述第一预设子动态规划模型用于将所述M个说话对象的关联值作为所述最后一个帧文件对应的M个说话对象对应的决策值输出。
在一种可能的实施方式中,所述第二确定单元402用于:
将所述任一帧文件对应的任一说话对象的关联值、所述任一帧文件的后一个帧文件的说话对象的相似度以及所述后一个帧文件的说话对象与所述任一说话对象是否相同对应的指标值和预设权重值,输入所述第二预设子动态规划模型,获得所述任一帧文件对应的任一说话对象对应的决策值,以确定所述任一帧文件对应的M个说话对象对应的决策值;
其中,所述第二预设子动态规划模型用于对所述任一说话对象的关联值和所述后一个帧文件的说话对象的相似度进行相加处理,获得第一处理值;以根据所述指标值和所述预设权重值确定第二处理值,并根据所述第一处理值和所述第二处理值,确定决策值。
在一种可能的实施方式中,所述装置还包括处理单元,用于:
确定音频文件的属性信息,其中,所述属性信息用于表征所述音频文件是否录制完成;
若根据所述属性信息确定所述音频文件为录制完成文件,则基于所述预设帧长度对所述音频文件进行划分处理,以确定目标音频文件;以及,
若确定所述音频文件为正在录制文件,则对所述正在录制文件已录制完成的第一文件基于所述预设帧长度进行划分处理,以确定第一音频文件;将所述第一音频文件作为目标音频文件,以确定所述目标音频文件。
在一种可能的实施方式中,所述处理单元,还用于:
若已识别所述第一音频文件中每个帧文件对应的说话对象,则对所述正在录制文件已 录制完成的第二文件,基于所述预设帧长度进行划分处理,以确定第二音频文件;
将所述第二音频文件作为目标音频文件,以确定所述目标音频文件;
其中,所述第二文件用于表征所述正在录制文件已录制完成且未识别说话对象的音频文件。
本发明实施例提供的识别说话对象的装置具体实施方式与上述识别说话对象的方法各实施例基本相同,在此不再赘述。
此外,本发明实施例还提供一种识别说话对象的设备。如图5所示,图5是本发明实施例方案涉及的硬件运行环境的结构示意图。
需要说明的是,图5即可为识别说话对象的设备的硬件运行环境的结构示意图。本发明实施例识别说话对象的设备可以是前述的服务器或者是终端。
如图5所示,该识别说话对象的设备可以包括:处理器501,例如CPU,存储器505,用户接口503,网络接口504,通信总线502。其中,通信总线502用于实现这些组件之间的连接通信。用户接口503可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选的,用户接口503还可以包括标准的有线接口、无线接口。网络接口504可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器505可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。可选的,存储器505还可以是独立于前述处理器501的存储装置。
本领域技术人员可以理解,图5中示出的识别说话对象的设备结构并不构成对识别说话对象的设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图5所示,作为一种计算机存储介质的存储器505中可以包括操作系统、网络通信模块、用户接口模块以及识别说话对象的程序。其中,操作系统是管理和控制识别说话对象的设备硬件和软件资源的程序,支持识别说话对象的程序以及其它软件或程序的运行。
在图5所示的识别说话对象的设备中,用户接口503主要用于连接采集设备,与采集设备进行数据通信,可通过采集设备采集音频文件和/或获取音频文件;网络接口504主要用于后台服务器,与后台服务器进行数据通信;处理器501可以用于调用存储器505中存储的识别说话对象的程序,并执行如上所述的识别说话对象的方法的步骤。
本发明识别说话对象的设备具体实施方式与上述识别说话对象的方法各实施例基本相同,在此不再赘述。
此外,本发明实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有识别说话对象的程序,所述识别说话对象的程序被处理器执行时实现如上所述的识别 说话对象的方法的步骤。
本发明计算机可读存储介质具体实施方式与上述识别说话对象的方法各实施例基本相同,在此不再赘述。
此外,本发明还提供一种包含指令的计算机程序产品,所述计算机程序产品包括存储在计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行时实现如上所述的识别说话对象的步骤。
本发明计算机程序产品具体实施方式与上述识别说话对象的方法各实施例基本相同,在此不再赘述。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除 在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM,磁碟,光盘)中,包括若干指令用以使得一台终端设备或服务器(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (17)

  1. 一种识别说话对象的方法,其特征在于,所述方法包括:
    确定目标音频文件中说话对象的相似度值;
    根据所述确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,所述预设动态规划模型用于根据所述任一帧文件的前后帧文件对应的说话对象确定所述任一帧文件对应的说话对象的决策值;
    将决策值中最大值对应的说话对象确定为所述任一帧文件的说话对象,以识别所述目标音频文件中所有帧文件对应的说话对象。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值,包括:
    根据预设状态转移条件对所述任一帧文件对应的M个说话对象的相似度进行处理,确定所述任一帧文件对应的M个说话对象的关联值;其中,所述预设状态转移条件用于根据所述任一帧文件的前一个帧文件的关联值递推所述任一帧文件的关联值,其中,M为大于1的正整数;
    根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
    确定所述任一帧文件是否为所述多个帧文件中最后一个帧文件,所述多个帧文件按照划分顺序排列;
    若所述任一帧文件为所述最后一个帧文件,则从所述预设动态规划模型中确定第一预设动态规划子模型,并根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值;以及,
    确定所述任一帧文件不为所述最后一个帧文件,则从所述预设动态规划模型中确定第二预设动态规划子模型,并根据所述任一文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
  4. 如权利要求3所述的方法,其特征在于,所述根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
    将所述最后一个帧文件对应的所述M个说话对象的关联值,输入所述第一预设子动态规划模型,获得所述最后一个帧文件对应的M个说话对象对应的决策值;
    其中,所述第一预设子动态规划模型用于将所述M个说话对象的关联值作为所述最后一个帧文件对应的M个说话对象对应的决策值输出。
  5. 如权利要求3所述的方法,其特征在于,所述根据所述任一帧文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值,包括:
    将所述任一帧文件对应的任一说话对象的关联值、所述任一帧文件的后一个帧文件的说话对象的相似度以及所述后一个帧文件的说话对象与所述任一说话对象是否相同对应的指标值和预设权重值,输入所述第二预设子动态规划模型,获得所述任一帧文件对应的任一说话对象对应的决策值,以确定所述任一帧文件对应的M个说话对象对应的决策值;
    其中,所述第二预设子动态规划模型用于对所述任一说话对象的关联值和所述后一个帧文件的说话对象的相似度进行相加处理,获得第一处理值;以根据所述指标值和所述预设权重值确定第二处理值,并根据所述第一处理值和所述第二处理值,确定决策值。
  6. 如权利要求1所述的方法,其特征在于,所述确定目标音频文件的说话对象的相似度值之前,所述方法还包括:
    确定音频文件的属性信息,其中,所述属性信息用于表征所述音频文件是否录制完成;
    若根据所述属性信息确定所述音频文件为录制完成文件,则基于所述预设帧长度对所述音频文件进行划分处理,以确定目标音频文件;以及,
    若确定所述音频文件为正在录制文件,则对所述正在录制文件已录制完成的第一文件基于所述预设帧长度进行划分处理,以确定第一音频文件;将所述第一音频文件作为目标音频文件,以确定所述目标音频文件。
  7. 如权利要求6所述的方法,其特征在于,所述方法还包括:
    若已识别所述第一音频文件中每个帧文件对应的说话对象,则对所述正在录制文件已录制完成的第二文件,基于所述预设帧长度进行划分处理,以确定第二音频文件;
    将所述第二音频文件作为目标音频文件,以确定所述目标音频文件;
    其中,所述第二文件用于表征所述正在录制文件已录制完成且未识别说话对象的音频文件。
  8. 一种识别说话对象的装置,其特征在于,包括:
    第一确定单元,用于确定目标音频文件中说话对象的相似度值;
    第二确定单元,用于根据确定的相似度值和预设动态规划模型,确定所述目标音频文件的任一帧文件对应的说话对象对应的决策值;其中,所述预设动态规划模型用于根据所述任一帧文件的前后帧文件对应的说话对象确定所述任一帧文件对应的说话对象的决策 值;
    识别单元,用于将决策值中最大值对应的说话对象确定为所述任一帧文件的说话对象,以识别所述目标音频文件中所有帧文件对应的说话对象。
  9. 如权利要求8所述的装置,其特征在于,所述第二确定单元用于:
    根据预设状态转移条件对所述任一帧文件对应的M个说话对象的相似度进行处理,确定所述任一帧文件对应的M个说话对象的关联值;其中,所述预设状态转移条件用于根据所述任一帧文件的前一个帧文件的关联值递推所述任一帧文件的关联值,其中,M为大于1的正整数;
    根据所述关联值和所述预设动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
  10. 如权利要求9所述的装置,其特征在于,所述第二确定单元用于:
    确定所述任一帧文件是否为所述多个帧文件中最后一个帧文件,所述多个帧文件按照划分顺序排列;
    若所述任一帧文件为所述最后一个帧文件,则从所述预设动态规划模型中确定第一预设动态规划子模型,并根据所述任一帧文件对应的所述M个说话对象的关联值和所述第一预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值;以及,
    确定所述任一帧文件不为所述最后一个帧文件,则从所述预设动态规划模型中确定第二预设动态规划子模型,并根据所述任一文件对应的所述M个说话对象的关联值和第二预设子动态规划模型,确定所述任一帧文件对应的M个说话对象对应的决策值。
  11. 如权利要求10所述的装置,其特征在于,所述第二确定单元用于:
    将所述最后一个帧文件对应的所述M个说话对象的关联值,输入所述第一预设子动态规划模型,获得所述最后一个帧文件对应的M个说话对象对应的决策值;
    其中,所述第一预设子动态规划模型用于将所述M个说话对象的关联值作为所述最后一个帧文件对应的M个说话对象对应的决策值输出。
  12. 如权利要求10所述的装置,其特征在于,所述第二确定单元用于:
    将所述任一帧文件对应的任一说话对象的关联值、所述任一帧文件的后一个帧文件的说话对象的相似度以及所述后一个帧文件的说话对象与所述任一说话对象是否相同对应的指标值和预设权重值,输入所述第二预设子动态规划模型,获得所述任一帧文件对应的任一说话对象对应的决策值,以确定所述任一帧文件对应的M个说话对象对应的决策值;
    其中,所述第二预设子动态规划模型用于对所述任一说话对象的关联值和所述后一个帧文件的说话对象的相似度进行相加处理,获得第一处理值;以根据所述指标值和所述预 设权重值确定第二处理值,并根据所述第一处理值和所述第二处理值,确定决策值。
  13. 如权利要求8所述的装置,其特征在于,所述装置还包括处理单元,用于:
    确定音频文件的属性信息,其中,所述属性信息用于表征所述音频文件是否录制完成;
    若根据所述属性信息确定所述音频文件为录制完成文件,则基于所述预设帧长度对所述音频文件进行划分处理,以确定目标音频文件;以及,
    若确定所述音频文件为正在录制文件,则对所述正在录制文件已录制完成的第一文件基于所述预设帧长度进行划分处理,以确定第一音频文件;将所述第一音频文件作为目标音频文件,以确定所述目标音频文件。
  14. 如权利要求13所述的装置,其特征在于,所述处理单元还用于:
    若已识别所述第一音频文件中每个帧文件对应的说话对象,则对所述正在录制文件已录制完成的第二文件,基于所述预设帧长度进行划分处理,以确定第二音频文件;
    将所述第二音频文件作为目标音频文件,以确定所述目标音频文件;
    其中,所述第二文件用于表征所述正在录制文件已录制完成且未识别说话对象的音频文件。
  15. 一种识别说话对象的设备,其特征在于,所述识别说话对象的设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的识别说话对象的程序,所述识别说话对象的程序被所述处理器执行时实现如权利要求1至7中任一项所述的识别说话对象的方法的步骤。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的识别说话对象的方法的步骤。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至7中任一项所述识别说话对象的方法的步骤。
PCT/CN2021/098659 2020-07-17 2021-06-07 一种识别说话对象的方法、装置、设备及可读存储介质 WO2022012215A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010693717.4A CN111883138A (zh) 2020-07-17 2020-07-17 一种识别说话对象的方法、装置、设备及可读存储介质
CN202010693717.4 2020-07-17

Publications (1)

Publication Number Publication Date
WO2022012215A1 true WO2022012215A1 (zh) 2022-01-20

Family

ID=73154810

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098659 WO2022012215A1 (zh) 2020-07-17 2021-06-07 一种识别说话对象的方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN111883138A (zh)
WO (1) WO2022012215A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883138A (zh) * 2020-07-17 2020-11-03 深圳前海微众银行股份有限公司 一种识别说话对象的方法、装置、设备及可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350196A (zh) * 2007-07-19 2009-01-21 丁玉国 任务相关的说话人身份确认片上系统及其确认方法
CN101436405A (zh) * 2008-12-25 2009-05-20 北京中星微电子有限公司 说话人识别方法和系统
US20110276323A1 (en) * 2010-05-06 2011-11-10 Senam Consulting, Inc. Speech-based speaker recognition systems and methods
CN108899033A (zh) * 2018-05-23 2018-11-27 出门问问信息科技有限公司 一种确定说话人特征的方法及装置
CN109564759A (zh) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 说话人识别
CN109785846A (zh) * 2019-01-07 2019-05-21 平安科技(深圳)有限公司 单声道的语音数据的角色识别方法及装置
CN111883138A (zh) * 2020-07-17 2020-11-03 深圳前海微众银行股份有限公司 一种识别说话对象的方法、装置、设备及可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350196A (zh) * 2007-07-19 2009-01-21 丁玉国 任务相关的说话人身份确认片上系统及其确认方法
CN101436405A (zh) * 2008-12-25 2009-05-20 北京中星微电子有限公司 说话人识别方法和系统
US20110276323A1 (en) * 2010-05-06 2011-11-10 Senam Consulting, Inc. Speech-based speaker recognition systems and methods
CN109564759A (zh) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 说话人识别
CN108899033A (zh) * 2018-05-23 2018-11-27 出门问问信息科技有限公司 一种确定说话人特征的方法及装置
CN109785846A (zh) * 2019-01-07 2019-05-21 平安科技(深圳)有限公司 单声道的语音数据的角色识别方法及装置
CN111883138A (zh) * 2020-07-17 2020-11-03 深圳前海微众银行股份有限公司 一种识别说话对象的方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN111883138A (zh) 2020-11-03

Similar Documents

Publication Publication Date Title
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
CN108463849B (zh) 一种计算机实现的方法和计算系统
CN108630193B (zh) 语音识别方法及装置
US20220254348A1 (en) Automatically generating a meeting summary for an information handling system
US10956480B2 (en) System and method for generating dialogue graphs
US20020143532A1 (en) Method and system for collaborative speech recognition for small-area network
WO2020007129A1 (zh) 基于语音交互的上下文获取方法及设备
WO2020238209A1 (zh) 音频处理的方法、系统及相关设备
WO2021218069A1 (zh) 基于场景动态配置的交互处理方法、装置、计算机设备
EP3115907A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
US10973458B2 (en) Daily cognitive monitoring of early signs of hearing loss
US20230139628A1 (en) Supporting automation of customer service
US11341959B2 (en) Conversation sentiment identifier
KR102476099B1 (ko) 회의록 열람 문서 생성 방법 및 그 장치
CN105893351B (zh) 语音识别方法及装置
US8868419B2 (en) Generalizing text content summary from speech content
WO2019148585A1 (zh) 会议摘要生成方法以及装置
WO2022037600A1 (zh) 摘要记录方法、装置、计算机设备和存储介质
US20200403816A1 (en) Utilizing volume-based speaker attribution to associate meeting attendees with digital meeting content
US11380309B2 (en) Meeting support system, meeting support device, and meeting support program
WO2022012215A1 (zh) 一种识别说话对象的方法、装置、设备及可读存储介质
KR20170126667A (ko) 회의 기록 자동 생성 방법 및 그 장치
KR102030551B1 (ko) 인스턴트 메신저 구동 장치 및 그 동작 방법
WO2022178933A1 (zh) 基于上下文的语音情感检测方法、装置、设备及存储介质
CN113055751B (zh) 数据处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21843227

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21843227

Country of ref document: EP

Kind code of ref document: A1