WO2023083142A1 - Sentence segmentation method and apparatus, storage medium, and electronic device - Google Patents

Sentence segmentation method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2023083142A1
WO2023083142A1 PCT/CN2022/130352 CN2022130352W WO2023083142A1 WO 2023083142 A1 WO2023083142 A1 WO 2023083142A1 CN 2022130352 W CN2022130352 W CN 2022130352W WO 2023083142 A1 WO2023083142 A1 WO 2023083142A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
speaker
target audio
period
speech recognition
Prior art date
Application number
PCT/CN2022/130352
Other languages
French (fr)
Chinese (zh)
Inventor
孙修松
刘艺
何怡
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023083142A1 publication Critical patent/WO2023083142A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Definitions

  • the present disclosure relates to the technical field of speech recognition, and in particular, to a sentence segmentation method, device, storage medium and electronic equipment.
  • a sentence-phrasing device including:
  • An acquisition module configured to acquire target audio data
  • An extraction module configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module and the first period corresponding to each recognized character in the speech recognition text in the target audio data;
  • a first segmentation module configured to perform speaker segmentation on the target audio data acquired by the acquisition module, to obtain a second period corresponding to each speech segment in the target audio data
  • the second segmentation module is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module and the second period corresponding to each of the speech segments obtained by the first segmentation module. time period, performing speaker segmentation on the speech recognition text to obtain sentence segmentation results.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method provided in the first aspect of the present disclosure are implemented.
  • an electronic device including:
  • One or more processing devices configured to execute the one or more computer programs in the storage device to implement the steps of the method provided in the first aspect of the present disclosure.
  • the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result.
  • the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
  • Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment.
  • Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment.
  • Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment.
  • Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment.
  • Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment. As shown in FIG. 1, the method may include S101-S104.
  • target audio data is acquired.
  • the target audio data may include a plurality of speaker voice segments.
  • the target audio data may be a multi-speaker dialogue recording, or may be an audio segment in a multi-speaker dialogue scene video.
  • the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted.
  • automatic speech recognition technology can be used to perform speech recognition on the target audio data, so as to obtain speech recognition text and each character in the speech recognition text (herein referred to as recognition characters) in the target audio data
  • recognition characters each character in the speech recognition text (herein referred to as recognition characters) in the target audio data
  • the corresponding start time and end time in that is, the first period.
  • speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.
  • performing speaker segmentation on the target audio refers to detecting a speaker transition point in the target audio data, and taking the speech between two adjacent speaker transition points as a speech segment.
  • the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result.
  • the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
  • speaker segmentation can be performed through various implementations.
  • the above-mentioned target audio data can be divided into speakers by receiving manually input segmentation marks, so as to determine each of the above-mentioned target audio data.
  • the start time and end time of the speaking segment that is, the second period corresponding to each speaking segment in the target audio data.
  • the target audio data may be input into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data.
  • speech fragments of different speakers can be automatically segmented, which is convenient and quick, thereby improving the efficiency of sentence segmentation of subsequent speech recognition texts.
  • the speaker transition point characters belong to the previous clause.
  • the speech recognition text is: Have you eaten? Have you eaten yet? I haven't.
  • the speaker converts the characters into "?” ", "I have eaten you”, "I haven't”.
  • the specific implementation of determining the speaker conversion point characters from the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each speech segment in the above step (1) will be described in detail below.
  • the recognized characters in the speech recognition text corresponding to the first period including the end time of the second period may be determined as speaker transition point characters.
  • the speaker conversion point characters can be determined through the following steps (21) to (25):
  • (21) Input the speech recognition text into the pre-trained semantic model, and obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point, wherein, the greater the probability of a recognized character belonging to a semantic break point, it indicates that the recognized character The more likely a period break is for semantics.
  • the end time of each second time period is extended back and forth respectively. Specifically, for each second period, the end time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms to obtain the speaker transition interval [end_time-N, end_time+M] corresponding to the second period , where end_time is the end time of the second period.
  • the start time of each second period is extended forward and backward respectively. Specifically, for each second period, the start time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms, to obtain the speaker transition interval [start_time-N, start_time+M] corresponding to the second period , wherein, start_time is the start time of the second period.
  • M and N may or may not be equal, which is not specifically limited in the present disclosure.
  • the preset time is one of the start time of the first period and the end time of the first period.
  • the character whose start time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.
  • the character whose end time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.
  • the pause duration of the candidate conversion point character is equal to the time interval between the start time of the recognized character after and adjacent to the candidate conversion point character in the speech recognition text and the end time of the candidate conversion point character.
  • the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point can be determined as the conversion point candidate
  • the probability that the character belongs to the speaker transition point is determined as the speaker transition point character.
  • Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 2, the above method further includes the following S105.
  • the speech recognition text can be accurately divided into sentences according to the speaker and semantic information.
  • Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 3 , the above method further includes the following S106.
  • each clause is obtained by accurately segmenting the speech recognition text according to the speaker and semantic information, it is possible to reasonably and effectively segment the speaker transition, thereby avoiding the occurrence of single-screen subtitles containing different speakers at the same time
  • the content of the speech improves the user experience.
  • Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment. As shown in Figure 4, the device 400 includes:
  • An acquisition module 401 configured to acquire target audio data
  • An extraction module 402 configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module 401 and the first period corresponding to each recognized character in the speech recognition text in the target audio data ;
  • the first segmentation module 403 is configured to perform speaker segmentation on the target audio data acquired by the acquisition module 401, to obtain a second period corresponding to each speech segment in the target audio data;
  • the second segmentation module 404 is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module 402 and the corresponding to each of the speech segments obtained by the first segmentation module 403 In the second period, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
  • the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result.
  • the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
  • the second segmentation module 403 is configured to perform speaker segmentation on the above target audio data by receiving manually input segmentation marks, so as to determine the start time and end of each speech segment in the above target audio data Time, that is, the second period corresponding to each speech segment in the target audio data.
  • the second segmentation module 403 includes:
  • a first determining submodule configured to determine a speaker transition point from the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speech segments character;
  • the segmentation sub-module is used to perform speaker segmentation on the speech recognition text based on the speaker conversion point characters.
  • the first determining submodule includes:
  • the second determining submodule is used to input the speech recognition text into the pre-trained semantic model to obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point;
  • the extension submodule is used to expand the end time or start time of the second time period for each second time period to obtain the speaker switching interval corresponding to the second time period; the third determination submodule is used to Among each of the identified characters, the character whose corresponding preset time is within the speaker conversion interval corresponding to the second time period is determined as a conversion point candidate character, wherein the preset time is the first time period One of the start time and the end time of the first period; the fourth determination submodule is used to determine the pause duration of each of the conversion point candidate characters corresponding to the second period; the fifth determination submodule is used According to the pause duration of each of the transition point candidate characters corresponding to the second period and the probability that the transition point candidate character belongs to a semantic break point, determine the speaker transition point character from the transition point candidate characters corresponding to the second period .
  • the fifth determining submodule includes:
  • the sixth determination submodule is used to determine the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point for each of the conversion point candidate characters corresponding to the second time period as The probability that the transition point candidate character belongs to the speaker transition point;
  • the seventh determination sub-module is used to determine the candidate character of the transition point with the highest probability of belonging to the speaker transition point among the transition point candidate characters corresponding to the second time period as the speaker transition point character.
  • the first segmentation module 402 is configured to input the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the target audio The second period corresponding to each utterance segment in the data.
  • the device 400 also includes:
  • the third segmentation module is configured to, for each clause in the clause result, segment the clause according to semantics to obtain multiple clauses.
  • the device 400 also includes:
  • a generating module configured to generate subtitle text corresponding to the target audio data according to multiple clauses.
  • the present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the above-mentioned clause method provided by the present disclosure are realized.
  • FIG. 5 it shows a schematic structural diagram of an electronic device (such as a terminal device or a server) 500 suitable for implementing an embodiment of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 500 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 501, which may be randomly accessed according to a program stored in a read-only memory (ROM) 502 or loaded from a storage device 508.
  • ROM read-only memory
  • RAM random access memory
  • various appropriate actions and processes are executed by programs in the memory (RAM) 503 .
  • RAM random access memory
  • various programs and data necessary for the operation of the electronic device 500 are also stored.
  • the processing device 501, ROM 502, and RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to the bus 504 .
  • the following devices can be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 507 such as a computer; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 509.
  • the communication means 509 may allow the electronic device 500 to perform wireless or wired communication with other devices to exchange data. While FIG. 5 shows electronic device 500 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 509, or from storage means 508, or from ROM 502.
  • the processing device 501 When the computer program is executed by the processing device 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the target audio data; extracts the speech recognition text corresponding to the target audio data and the In the voice recognition text, the first period corresponding to each recognized character in the target audio data; the target audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data time period: performing speaker segmentation on the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments to obtain sentence segmentation results.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the obtaining module may also be described as "a module for obtaining target audio data".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a sentence segmentation method, including: acquiring target audio data; extracting the speech recognition text corresponding to the target audio data and extracting each recognized The first period corresponding to the character in the target audio data; performing speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data; according to each of the recognized characters Perform speaker segmentation on the speech recognition text corresponding to the first period of time and the second period of time corresponding to each of the utterance segments to obtain sentence segmentation results.
  • Example 2 provides the method of Example 1, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments time period, performing speaker segmentation on the voice recognition text, including: according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speaking segments, from the voice recognition Determining speaker transition point characters in the text; performing speaker segmentation on the speech recognition text based on the speaker transition point characters.
  • Example 3 provides the method of Example 2, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments Period, determining the speaker conversion point characters from the speech recognition text, comprising: inputting the speech recognition text into a pre-trained semantic model, and obtaining that each recognized character in the speech recognition text belongs to a semantic breakpoint probability; for each of the second time periods, the end time or start time of the second time period is extended back and forth to obtain the speaker conversion interval corresponding to the second time period; in each of the recognized characters, the corresponding A character whose preset time is within the speaker conversion interval corresponding to the second period is determined as a candidate character for a conversion point, wherein the preset time is between the start time of the first period and the end time of the first period One of; Determine the pause duration of each of the conversion point candidate characters corresponding to the second period; According to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the
  • Example 4 provides the method of Example 3, according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belonging to the semantic break point
  • the probability of determining the speaker’s transition point characters from the transition point candidate characters corresponding to the second period includes: for each of the transition point candidate characters corresponding to the second period, the pause duration and the duration of the transition point candidate characters
  • the weighted sum of the probability that this conversion point candidate character belongs to the semantic break point is determined as the probability that this conversion point candidate character belongs to the speaker conversion point; among the conversion point candidate characters corresponding to the second period, the probability of belonging to the speaker conversion point is the largest
  • the transition point candidate characters of are determined as speaker transition point characters.
  • Example 5 provides the method of Example 1, performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data, include:
  • Example 6 provides the method described in any one of Examples 1-5, the method further comprising: for each clause in the clause result, for the clause Sentences are segmented semantically to obtain multiple clauses.
  • Example 7 provides the method of Example 6, and the method further includes: generating subtitle text corresponding to the target audio data according to a plurality of clauses.
  • Example 8 provides a sentence segmentation device, including: an acquisition module, used to acquire target audio data; an extraction module, used to extract the target audio acquired by the acquisition module The speech recognition text corresponding to the data and the first period corresponding to each recognized character in the target audio data in the speech recognition text; the first segmentation module is used to analyze the target obtained by the acquisition module The audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data; the second segmentation module is used to extract the recognition characters corresponding to each of the identified characters extracted by the extraction module. The first period and the second period corresponding to each speech segment obtained by the first segmentation module perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
  • Example 9 provides the device of Example 8, the second segmentation module includes: a first determination submodule, configured to, according to the first The period of time and the second period corresponding to each of the speech segments determine the speaker transition point characters from the speech recognition text; the segmentation submodule is used to segment based on the speaker transition point characters The speech recognition text is subjected to speaker segmentation.
  • a first determination submodule configured to, according to the first The period of time and the second period corresponding to each of the speech segments determine the speaker transition point characters from the speech recognition text
  • the segmentation submodule is used to segment based on the speaker transition point characters
  • the speech recognition text is subjected to speaker segmentation.
  • Example 10 provides the apparatus of Example 9, the first determination submodule includes: a second determination submodule, configured to input the speech recognition text into the pre-trained semantic In the model, the probability that each recognized character in the speech recognition text belongs to a semantic break point is obtained; the extension submodule is used to expand the end time or start time of the second period for each second period , to obtain the speaker conversion interval corresponding to the second time period; the third determination submodule is used to identify characters whose corresponding preset time is within the speaker conversion interval corresponding to the second time period among each of the recognized characters Determined as a conversion point candidate character, wherein the preset time is one of the start time of the first period and the end time of the first period; the fourth determining submodule is used to determine the second period The corresponding pause duration of each of the conversion point candidate characters; the fifth determination submodule is used for according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the semantic break
  • Example 11 provides the apparatus of Example 10, the fifth determining submodule includes: a sixth determining submodule, configured to target each of the conversion points corresponding to the second period of time Candidate character, the weighted sum of the pause duration of the candidate character of the conversion point and the probability that the candidate character of the conversion point belongs to the semantic break point is determined as the probability that the candidate character of the conversion point belongs to the speaker conversion point; the seventh determination submodule is used to Among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.
  • Example 12 provides the apparatus of Example 8, the first segmentation module is used to input the target audio data into a pre-trained speaker recognition model, so that the Speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.
  • Example 13 provides the device according to any one of Examples 8-12, the device further comprising: a third segmentation module, configured for each segment in the sentence result Sentence, the clause is segmented according to semantics to obtain multiple clauses.
  • Example 14 provides the apparatus of Example 13, the apparatus further comprising: a generation module configured to generate subtitle text corresponding to the target audio data according to multiple clauses.
  • Example 15 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
  • Example 16 provides an electronic device, including: storage means, on which one or more computer programs are stored; one or more processing means, for executing the The one or more computer programs in to implement the steps of any one of the methods in Examples 1-7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a sentence segmentation method and apparatus, a storage medium, and an electronic device. The method comprises: acquiring target audio data (S101); extracting a speech recognition text corresponding to the target audio data, and a first time period in the target audio data corresponding to each recognized character in the speech recognition text (S102); performing speaker segmentation on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data (S103); and performing speaker segmentation on the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each speech segment to obtain a sentence segmentation result (S104).

Description

分句方法、装置、存储介质及电子设备Clause method, device, storage medium and electronic equipment
本公开要求于2021年11月10日提交的,申请名称为“分句方法、装置、存储介质及电子设备”的、中国专利申请号为“202111327536.0”的优先权,该中国专利申请的全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application number "202111327536.0" filed on November 10, 2021, with the application name "Sentence Method, Device, Storage Medium, and Electronic Equipment", and the entire content of the Chinese patent application Incorporated by reference in this disclosure.
技术领域technical field
本公开涉及语音识别技术领域,具体地,涉及一种分句方法、装置、存储介质及电子设备。The present disclosure relates to the technical field of speech recognition, and in particular, to a sentence segmentation method, device, storage medium and electronic equipment.
背景技术Background technique
在视频字幕场景语音识别应用中,需要对识别出的文本进行分句以进行分屏显示。而且为了保证字幕的可读性,往往要求单个分句只包含一个说话人,避免出现单屏字幕同时包含不同说话人说话内容的情况。常规的分句方法只是结合了语音识别文本的语义信息,对语义转折处进行切分。这种方法对于单说话人的视频有很好的效果,但是对于多说话人的对话场景视频,单纯使用语义信息会导致在说话人转换处切分效果较差,出现单个分句包含多个说话人说话内容的情况。In the speech recognition application of the video subtitle scene, it is necessary to segment the recognized text for split-screen display. Moreover, in order to ensure the readability of subtitles, it is often required that a single clause contains only one speaker, so as to avoid the situation where a single screen subtitle contains the content of different speakers at the same time. The conventional sentence segmentation method only combines the semantic information of the speech recognition text to segment the semantic turning point. This method has a good effect on single-speaker videos, but for multi-speaker dialogue scene videos, purely using semantic information will lead to poor segmentation at speaker transitions, and a single sentence contains multiple speeches. The content of the person's speech.
发明内容Contents of the invention
提供该部分内容以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该部分内容并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This section is provided to introduce concepts in a simplified form that are described in detail later in the Detailed Description. This part of the content is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
第一方面,本公开提供一种分句方法,包括:In a first aspect, the present disclosure provides a method for clauses, including:
获取目标音频数据;Obtain target audio data;
提取所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;Extracting the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the speech recognition text in the target audio data;
对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data;
根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。According to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
第二方面,本公开提供一种分句装置,包括:In a second aspect, the present disclosure provides a sentence-phrasing device, including:
获取模块,用于获取目标音频数据;An acquisition module, configured to acquire target audio data;
提取模块,用于提取所述获取模块获取到的所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;An extraction module, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module and the first period corresponding to each recognized character in the speech recognition text in the target audio data;
第一分割模块,用于对所述获取模块获取到的所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;A first segmentation module, configured to perform speaker segmentation on the target audio data acquired by the acquisition module, to obtain a second period corresponding to each speech segment in the target audio data;
第二分割模块,用于根据所述提取模块提取到的每一所述识别字符所对应的所述第一时段和所述第一分割模块得到的每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。The second segmentation module is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module and the second period corresponding to each of the speech segments obtained by the first segmentation module. time period, performing speaker segmentation on the speech recognition text to obtain sentence segmentation results.
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面提供的所述方法的步骤。In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method provided in the first aspect of the present disclosure are implemented.
第四方面,本公开提供一种电子设备,包括:In a fourth aspect, the present disclosure provides an electronic device, including:
存储装置,其上存储有一个或多个计算机程序;storage means on which one or more computer programs are stored;
一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现本公开第一方面提供的所述方法的步骤。One or more processing devices configured to execute the one or more computer programs in the storage device to implement the steps of the method provided in the first aspect of the present disclosure.
在上述技术方案中,获取目标音频数据;提取目标音频数据对应的语音识别文本以及语音识别文本中、每一识别字符在目标音频数据中所对应的第一时段;同时,对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段;然后,根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,对语音识别文本进行说话人分割,得到分句结果。由此,能够有效利用说话人时段信息和语音识别文本中每一字符在目标音频数据中所对应的时段,对语音识别文本进行说话人分割,实现对说话人转换处进行合理有效的切分,避免单个分句包含多个说话人说话内容的情况,提升了分句效果。In the above technical solution, the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result. As a result, the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale. In the attached picture:
图1是根据一示例性实施例示出的一种分句方法的流程图。Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment.
图2是根据另一示例性实施例示出的一种分句方法的流程图。Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment.
图3是根据另一示例性实施例示出的一种分句方法的流程图。Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment.
图4是根据一示例性实施例示出的一种分句装置的框图。Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment.
图5是根据一示例性实施例示出的一种电子设备的框图。Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或 并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.
图1是根据一示例性实施例示出的一种分句方法的流程图。如图1所示,该方法可以包括S101~S104。Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment. As shown in FIG. 1, the method may include S101-S104.
在S101中,获取目标音频数据。In S101, target audio data is acquired.
在本公开中,目标音频数据可以包括多个说话人语音片段。示例地,目标音频数据可以为多说话人的对话录音,也可以是多说话人的对话场景视频中的音频片段。In the present disclosure, the target audio data may include a plurality of speaker voice segments. For example, the target audio data may be a multi-speaker dialogue recording, or may be an audio segment in a multi-speaker dialogue scene video.
在S102中,提取目标音频数据对应的语音识别文本以及语音识别文本中、每一识别字符在目标音频数据中所对应的第一时段。In S102, the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted.
在本公开中,可以利用自动语音识别技术(Automatic Speech Recognition,ASR)对目标音频数据进行语音识别,以得到语音识别文本和语音识别文本中每一字符(这里称为识别字符)在目标音频数据中所对应的起始时间和结束时间,即第一时段。In the present disclosure, automatic speech recognition technology (Automatic Speech Recognition, ASR) can be used to perform speech recognition on the target audio data, so as to obtain speech recognition text and each character in the speech recognition text (herein referred to as recognition characters) in the target audio data The corresponding start time and end time in , that is, the first period.
在S103中,对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段。In S103, speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.
在本公开中,对目标音频进行说话人分割是指检测目标音频数据中的说话人转换点,将相邻两个说话人转换点之间的语音作为一个说话片段。In the present disclosure, performing speaker segmentation on the target audio refers to detecting a speaker transition point in the target audio data, and taking the speech between two adjacent speaker transition points as a speech segment.
在S104中,根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,对语音识别文本进行说话人分割,得到分句结果。In S104, according to the first period corresponding to each recognized character and the second period corresponding to each utterance segment, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
在上述技术方案中,获取目标音频数据;提取目标音频数据对应的语音识别文本以及语音识别文本中、每一识别字符在目标音频数据中所对应的第一时段;同时,对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段;然后,根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,对语音识别文本进行说话人分割, 得到分句结果。由此,能够有效利用说话人时段信息和语音识别文本中每一字符在目标音频数据中所对应的时段,对语音识别文本进行说话人分割,实现对说话人转换处进行合理有效的切分,避免单个分句包含多个说话人说话内容的情况,提升了分句效果。In the above technical solution, the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result. As a result, the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
下面针对上述S103中的对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段的具体实施方式进行详细说明。具体来说,可以通过多种实施方式来进行说话人分割,在一种实施方式中,可以通过接收人工输入的分割标记对上述目标音频数据进行说话人分割,从而确定上述目标音频数据中每个说话片段的起始时间和结束时间,即目标音频数据中每个说话片段对应的第二时段。The specific implementation manner of performing speaker segmentation on the target audio data in S103 to obtain the second time period corresponding to each utterance segment in the target audio data will be described in detail below. Specifically, speaker segmentation can be performed through various implementations. In one implementation, the above-mentioned target audio data can be divided into speakers by receiving manually input segmentation marks, so as to determine each of the above-mentioned target audio data. The start time and end time of the speaking segment, that is, the second period corresponding to each speaking segment in the target audio data.
在另一种实施方式中,可以将目标音频数据输入到预先训练好的说话人识别模型中,以对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段。这样,可以自动分割出不同说话人的说话片段,方便快捷,从而提升了后续语音识别文本的分句效率。In another embodiment, the target audio data may be input into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data. In this way, speech fragments of different speakers can be automatically segmented, which is convenient and quick, thereby improving the efficiency of sentence segmentation of subsequent speech recognition texts.
下面针对上述S104中的根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,对语音识别文本进行说话人分割的具体实施方式进行详细说明。具体来说,可以通过以下步骤(1)和步骤(2)来实现:The specific implementation manner of performing speaker segmentation on the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each utterance segment in S104 will be described in detail below. Specifically, it can be achieved through the following steps (1) and (2):
(1)根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,从语音识别文本中确定说话人转换点字符。(1) According to the first period corresponding to each recognized character and the second period corresponding to each speech segment, determine the speaker transition point character from the speech recognition text.
(2)以说话人转换点字符为切分依据,对语音识别文本进行说话人分割。(2) Based on the speaker conversion point characters, perform speaker segmentation on the speech recognition text.
其中,说话人转换点字符归属于前一分句。Among them, the speaker transition point characters belong to the previous clause.
示例地,语音识别文本为:你吃饭了吗已经吃过了你呢我还没,其中,说话人转换字符为“吗”和“呢”,因此,得到的分句结果为:“你吃饭了吗”、“已经吃过了你呢”“我还没”。For example, the speech recognition text is: Have you eaten? Have you eaten yet? I haven't. The speaker converts the characters into "?" ", "I have eaten you", "I haven't".
下面针对上述步骤(1)中的根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,从语音识别文本中确定说话人转换点字符的具体实施方式进行详细说明。在一种实施方式中,可以针对每一第二时段,将上述语音识别文本中、所对应的第一时段包括该第二时段的结束时间的识别字符确定为说话人转换点字符。The specific implementation of determining the speaker conversion point characters from the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each speech segment in the above step (1) will be described in detail below. In an implementation manner, for each second period, the recognized characters in the speech recognition text corresponding to the first period including the end time of the second period may be determined as speaker transition point characters.
在另一种实施方式中,可以通过以下步骤(21)~步骤(25)来确定说话人转换点字符:In another embodiment, the speaker conversion point characters can be determined through the following steps (21) to (25):
(21)将语音识别文本输入到预先训练好的语义模型中,得到语音识别文本中每一识别字符属于语义断句点的概率,其中,识别字符属于语义断句点的概率越大,表明该识别字符越可能为语义断句点。(21) Input the speech recognition text into the pre-trained semantic model, and obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point, wherein, the greater the probability of a recognized character belonging to a semantic break point, it indicates that the recognized character The more likely a period break is for semantics.
(22)针对每一第二时段,对该第二时段的结束时间或开始时间做前后扩展,得到该第二时段对应的说话人转换区间。(22) For each second period, the end time or start time of the second period is extended back and forth to obtain the speaker switching interval corresponding to the second period.
在一种实施方式中,对每个第二时段的结束时间分别做前后扩展。具体来说,针对每一第二时段,对该第二时段的结束时间向前扩展Nms、同时向后扩展Mms,得到该第二时段对应的说话人转换区间[end_time-N,end_time+M],其中,end_time为该第二时段的结束时间。In an implementation manner, the end time of each second time period is extended back and forth respectively. Specifically, for each second period, the end time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms to obtain the speaker transition interval [end_time-N, end_time+M] corresponding to the second period , where end_time is the end time of the second period.
在另一种实施方式中,对每个第二时段的开始时间分别做前后扩展。具体来说,针对每一第二时段,对该第二时段的开始时间向前扩展Nms、同时向后扩展Mms,得到该第二时段对应的说话人转换区间[start_time-N,start_time+M],其中,start_time为该第二时段的开始时间。In another implementation manner, the start time of each second period is extended forward and backward respectively. Specifically, for each second period, the start time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms, to obtain the speaker transition interval [start_time-N, start_time+M] corresponding to the second period , wherein, start_time is the start time of the second period.
其中,需要说明的是,M与N可以相等,也可以不相等,本公开不作具体限定。Wherein, it should be noted that M and N may or may not be equal, which is not specifically limited in the present disclosure.
(23)将每一识别字符中、所对应的预设时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符。(23) Among each recognized character, a character whose corresponding preset time is within the speaker transition interval corresponding to the second time period is determined as a transition point candidate character.
在本公开中,预设时间为第一时段的开始时间、第一时段的结束时间中的一者。在一种实施方式中,可以将每一识别字符中、所对应的第一时段的开始时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符。In the present disclosure, the preset time is one of the start time of the first period and the end time of the first period. In an implementation manner, among each recognized character, the character whose start time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.
在另一种实施方式中,可以将每一识别字符中、所对应的第一时段的结束时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符。In another implementation manner, among each recognized character, the character whose end time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.
(24)确定该第二时段对应的每一转换点候选字符的停顿时长。(24) Determine the pause duration of each transition point candidate character corresponding to the second time period.
在本公开中,转换点候选字符的停顿时长等于上述语音识别文本中、该转换点候选字符后且与其相邻的识别字符的开始时间与该转换点候选字符的结束时间之间的时间间隔。In the present disclosure, the pause duration of the candidate conversion point character is equal to the time interval between the start time of the recognized character after and adjacent to the candidate conversion point character in the speech recognition text and the end time of the candidate conversion point character.
(25)根据该第二时段对应的每一转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符。(25) According to the pause duration of each conversion point candidate character corresponding to the second period and the probability that the conversion point candidate character belongs to a semantic break point, determine the speaker conversion point character from the conversion point candidate characters corresponding to the second period .
具体来说,可以先针对该第二时段对应的每一转换点候选字符,将该转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率的加权和确定为该转换点候选字符属于说话人转换点的概率;然后,将该第二时段对应的转换点候选字符中、属于说话人转换点的概率最大的转换点候选字符确定为说话人转换点字符。Specifically, for each conversion point candidate character corresponding to the second time period, the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point can be determined as the conversion point candidate The probability that the character belongs to the speaker transition point; then, among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.
图2是根据另一示例性实施例示出的一种分句方法的流程图。如图2所示,上述方法还包括以下S105。Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 2, the above method further includes the following S105.
在S105中,针对分句结果中的每一分句,对该分句按照语义进行切分,得到多个子句。In S105, for each clause in the clause result, the clause is segmented according to semantics to obtain multiple clauses.
这样,可以对语音识别文本按照说话人、语义信息进行精确分句。In this way, the speech recognition text can be accurately divided into sentences according to the speaker and semantic information.
图3是根据另一示例性实施例示出的一种分句方法的流程图。如图3所示,上述方法还包括以下S106。Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 3 , the above method further includes the following S106.
在S106中,根据多个子句,生成与目标音频数据对应的字幕文本。In S106, generate subtitle text corresponding to the target audio data according to the plurality of clauses.
由于各个子句是通过对语音识别文本按照说话人、语义信息进行精确分句得到的,由此,能够对说话人转换处进行合理有效的切分,从而避免出现单屏字幕同时包含不同说话人说话内容的情况,提升了用户体验。Since each clause is obtained by accurately segmenting the speech recognition text according to the speaker and semantic information, it is possible to reasonably and effectively segment the speaker transition, thereby avoiding the occurrence of single-screen subtitles containing different speakers at the same time The content of the speech improves the user experience.
图4是根据一示例性实施例示出的一种分句装置的框图。如图4所示,该装置400包括:Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment. As shown in Figure 4, the device 400 includes:
获取模块401,用于获取目标音频数据;An acquisition module 401, configured to acquire target audio data;
提取模块402,用于提取所述获取模块401获取到的所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;An extraction module 402, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module 401 and the first period corresponding to each recognized character in the speech recognition text in the target audio data ;
第一分割模块403,用于对所述获取模块401获取到的所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;The first segmentation module 403 is configured to perform speaker segmentation on the target audio data acquired by the acquisition module 401, to obtain a second period corresponding to each speech segment in the target audio data;
第二分割模块404,用于根据所述提取模块402提取到的每一所述识别字符所对应的所述第一时段和所述第一分割模块403得到的每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。The second segmentation module 404 is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module 402 and the corresponding to each of the speech segments obtained by the first segmentation module 403 In the second period, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
在上述技术方案中,获取目标音频数据;提取目标音频数据对应的语音识别文本以及语音识别文本中、每一识别字符在目标音频数据中所对应的第一时段;同时,对目标音频数据进行说话人分割,得到目标音频数据中每个说话片段对应的第二时段;然后,根据每一识别字符所对应的第一时段和每一说话片段对应的第二时段,对语音识别文本进行说话人分割,得到分句结果。由此,能够有效利用说话人时段信息和语音识别文本中每一字符在目标音频数据中所对应的时段,对语音识别文本进行说话人分割,实现对说话人转换处进行合理有效的切分,避免单个分句包含多个说话人说话内容的情况,提升了分句效果。In the above technical solution, the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result. As a result, the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.
在一种实施方式中,所述第二分割模块403用于通过接收人工输入的分割标记对上述目标音频数据进行说话人分割,从而确定上述目标音频数据中每个说话片段的起始时间和结束时间,即目标音频数据中每个说话片段对应的第二时段。In one embodiment, the second segmentation module 403 is configured to perform speaker segmentation on the above target audio data by receiving manually input segmentation marks, so as to determine the start time and end of each speech segment in the above target audio data Time, that is, the second period corresponding to each speech segment in the target audio data.
在另一种实施方式中,所述第二分割模块403包括:In another implementation manner, the second segmentation module 403 includes:
第一确定子模块,用于根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符;A first determining submodule, configured to determine a speaker transition point from the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speech segments character;
分割子模块,用于以所述说话人转换点字符为切分依据,对所述语音识别文本进行说话人分割。The segmentation sub-module is used to perform speaker segmentation on the speech recognition text based on the speaker conversion point characters.
这样,可以自动分割出不同说话人的说话片段,方便快捷,从而提升了后续语音识别文本的分句效率。In this way, speech fragments of different speakers can be automatically segmented, which is convenient and quick, thereby improving the efficiency of sentence segmentation of subsequent speech recognition texts.
在一些实施例中,所述第一确定子模块包括:In some embodiments, the first determining submodule includes:
第二确定子模块,用于将所述语音识别文本输入到预先训练好的语义模型中,得到所述语音识别文本中每一识别字符属于语义断句点的概率;The second determining submodule is used to input the speech recognition text into the pre-trained semantic model to obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point;
扩展子模块,用于针对每一所述第二时段,对该第二时段的结束时间或开始时间做前后扩展,得到该第二时段对应的说话人转换区间;第三确定子模块,用于将每一所述识别字符中、所对应的预设时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符,其中,所述预设时间为所述第一时段的开始时间、所述第一时段的结束时间中的一者;第四确定子模块,用于确定该第二时段对应的每一所述转换点候选字符的停顿时长;第五确定子模块,用于根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符。The extension submodule is used to expand the end time or start time of the second time period for each second time period to obtain the speaker switching interval corresponding to the second time period; the third determination submodule is used to Among each of the identified characters, the character whose corresponding preset time is within the speaker conversion interval corresponding to the second time period is determined as a conversion point candidate character, wherein the preset time is the first time period One of the start time and the end time of the first period; the fourth determination submodule is used to determine the pause duration of each of the conversion point candidate characters corresponding to the second period; the fifth determination submodule is used According to the pause duration of each of the transition point candidate characters corresponding to the second period and the probability that the transition point candidate character belongs to a semantic break point, determine the speaker transition point character from the transition point candidate characters corresponding to the second period .
在一些实施例中,所述第五确定子模块包括:In some embodiments, the fifth determining submodule includes:
第六确定子模块,用于针对该第二时段对应的每一所述转换点候选字符,将该转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率的加权和确定为该转换点候选字符属于说话人转换点的概率;The sixth determination submodule is used to determine the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point for each of the conversion point candidate characters corresponding to the second time period as The probability that the transition point candidate character belongs to the speaker transition point;
第七确定子模块,用于将该第二时段对应的转换点候选字符中、属于说话人转换点的概率最大的转换点候选字符确定为说话人转换点字符。The seventh determination sub-module is used to determine the candidate character of the transition point with the highest probability of belonging to the speaker transition point among the transition point candidate characters corresponding to the second time period as the speaker transition point character.
在一些实施例中,所述第一分割模块402用于将所述目标音频数据输入到预先训练好的说话人识别模型中,以对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段。In some embodiments, the first segmentation module 402 is configured to input the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the target audio The second period corresponding to each utterance segment in the data.
在一些实施例中,所述装置400还包括:In some embodiments, the device 400 also includes:
第三分割模块,用于针对所述分句结果中的每一分句,对该分句按照语义进行切分,得到多个子句。The third segmentation module is configured to, for each clause in the clause result, segment the clause according to semantics to obtain multiple clauses.
在一些实施例中,所述装置400还包括:In some embodiments, the device 400 also includes:
生成模块,用于根据多个子句,生成与所述目标音频数据对应的字幕文本。A generating module, configured to generate subtitle text corresponding to the target audio data according to multiple clauses.
本公开还提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开提供的上述分句方法的步骤。The present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the above-mentioned clause method provided by the present disclosure are realized.
下面参考图5,其示出了适于用来实现本公开实施例的电子设备(例如终端设备或服务器)500的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 5 , it shows a schematic structural diagram of an electronic device (such as a terminal device or a server) 500 suitable for implementing an embodiment of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
如图5所示,电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501,其可以根据存储在只读存储器(ROM)502中的程序或者从存储装置508加载到随机访问存 储器(RAM)503中的程序而执行各种适当的动作和处理。在RAM 503中,还存储有电子设备500操作所需的各种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。As shown in FIG. 5, an electronic device 500 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 501, which may be randomly accessed according to a program stored in a read-only memory (ROM) 502 or loaded from a storage device 508. Various appropriate actions and processes are executed by programs in the memory (RAM) 503 . In the RAM 503, various programs and data necessary for the operation of the electronic device 500 are also stored. The processing device 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to the bus 504 .
通常,以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置507;包括例如磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备500,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 507 such as a computer; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to perform wireless or wired communication with other devices to exchange data. While FIG. 5 shows electronic device 500 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 509, or from storage means 508, or from ROM 502. When the computer program is executed by the processing device 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形 式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取目标音频数据;提取所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the target audio data; extracts the speech recognition text corresponding to the target audio data and the In the voice recognition text, the first period corresponding to each recognized character in the target audio data; the target audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data time period: performing speaker segmentation on the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments to obtain sentence segmentation results.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“获取目标音频数据的模块”。The modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the obtaining module may also be described as "a module for obtaining target audio data".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限 制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例1提供了一种分句方法,包括:获取目标音频数据;提取所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。According to one or more embodiments of the present disclosure, Example 1 provides a sentence segmentation method, including: acquiring target audio data; extracting the speech recognition text corresponding to the target audio data and extracting each recognized The first period corresponding to the character in the target audio data; performing speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data; according to each of the recognized characters Perform speaker segmentation on the speech recognition text corresponding to the first period of time and the second period of time corresponding to each of the utterance segments to obtain sentence segmentation results.
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,包括:根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符;以所述说话人转换点字符为切分依据,对所述语音识别文本进行说话人分割。According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments time period, performing speaker segmentation on the voice recognition text, including: according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speaking segments, from the voice recognition Determining speaker transition point characters in the text; performing speaker segmentation on the speech recognition text based on the speaker transition point characters.
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符,包括:将所述语音识别文本输入到预先训练好的语义模型中,得到所述语音识别文本中每一识别字符属于语义断句点的概率;针对每一所述第二时段,对该第二时段的结束时间或开始时间做前后扩展,得到该第二时段对应的说话人转换区间;将每一所述识别字符中、所对应的预设时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符,其中,所述预设时间为所述第一时段的开始时间、所述第一时段的结束时间中的一者;确定该第二时段对应的每一所述转换点候选字符的停顿时长;根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符。According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments Period, determining the speaker conversion point characters from the speech recognition text, comprising: inputting the speech recognition text into a pre-trained semantic model, and obtaining that each recognized character in the speech recognition text belongs to a semantic breakpoint probability; for each of the second time periods, the end time or start time of the second time period is extended back and forth to obtain the speaker conversion interval corresponding to the second time period; in each of the recognized characters, the corresponding A character whose preset time is within the speaker conversion interval corresponding to the second period is determined as a candidate character for a conversion point, wherein the preset time is between the start time of the first period and the end time of the first period One of; Determine the pause duration of each of the conversion point candidate characters corresponding to the second period; According to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the semantic sentence the probability of the point, and determine the speaker transition point character from the transition point candidate characters corresponding to the second time period.
根据本公开的一个或多个实施例,示例4提供了示例3的方法,所述根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符,包括:针对该第二时段对应的每一所述转换点候选字符,将该转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率的加权和确定为该转换点候选字符属于说话人转换点的概率;将该第二时段对应的转换点候选字符中、属于说话人转换点的概率最大的转换点候选字符确定为说话人转换点字符。According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 3, according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belonging to the semantic break point The probability of determining the speaker’s transition point characters from the transition point candidate characters corresponding to the second period includes: for each of the transition point candidate characters corresponding to the second period, the pause duration and the duration of the transition point candidate characters The weighted sum of the probability that this conversion point candidate character belongs to the semantic break point is determined as the probability that this conversion point candidate character belongs to the speaker conversion point; among the conversion point candidate characters corresponding to the second period, the probability of belonging to the speaker conversion point is the largest The transition point candidate characters of are determined as speaker transition point characters.
根据本公开的一个或多个实施例,示例5提供了示例1的方法,所述对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段,包括:According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 1, performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data, include:
将所述目标音频数据输入到预先训练好的说话人识别模型中,以对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段。Inputting the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain a second period corresponding to each utterance segment in the target audio data.
根据本公开的一个或多个实施例,示例6提供了示例1-5中任一项所述的方法,所述方法还包括:针对所述分句结果中的每一分句,对该分句按照语义进行切分,得到多个子句。According to one or more embodiments of the present disclosure, Example 6 provides the method described in any one of Examples 1-5, the method further comprising: for each clause in the clause result, for the clause Sentences are segmented semantically to obtain multiple clauses.
根据本公开的一个或多个实施例,示例7提供了示例6的方法,所述方法还包括:根据多个子句,生成与所述目标音频数据对应的字幕文本。According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 6, and the method further includes: generating subtitle text corresponding to the target audio data according to a plurality of clauses.
根据本公开的一个或多个实施例,示例8提供了一种分句装置,包括:获取模块,用于获取目标音频数据;提取模块,用于提取所述获取模块获取到的所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;第一分割模块,用于对所述获取模块获取到的所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;第二分割模块,用于根据所述提取模块提取到的每一所述识别字符所对应的所述第一时段和所述第一分割模块得到的每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。According to one or more embodiments of the present disclosure, Example 8 provides a sentence segmentation device, including: an acquisition module, used to acquire target audio data; an extraction module, used to extract the target audio acquired by the acquisition module The speech recognition text corresponding to the data and the first period corresponding to each recognized character in the target audio data in the speech recognition text; the first segmentation module is used to analyze the target obtained by the acquisition module The audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data; the second segmentation module is used to extract the recognition characters corresponding to each of the identified characters extracted by the extraction module. The first period and the second period corresponding to each speech segment obtained by the first segmentation module perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
根据本公开的一个或多个实施例,示例9提供了示例8的装置,所述第二分割模块包括:第一确定子模块,用于根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符;分割子模块,用于以所述说话人转换点字符为切分依据,对所述语音识别文本进行说话人分割。According to one or more embodiments of the present disclosure, Example 9 provides the device of Example 8, the second segmentation module includes: a first determination submodule, configured to, according to the first The period of time and the second period corresponding to each of the speech segments determine the speaker transition point characters from the speech recognition text; the segmentation submodule is used to segment based on the speaker transition point characters The speech recognition text is subjected to speaker segmentation.
根据本公开的一个或多个实施例,示例10提供了示例9的装置,所述第一确定子模块包括:第二确定子模块,用于将所述语音识别文本输入到预先训练好的语义模型中,得到所述语音识别文本中每一识别字符属于语义断句点的概率;扩展子模块,用于针对每一所述第二时段,对该第二时段的结束时间或开始时间做前后扩展,得到该第二时段对应的说话人转换区间;第三确定子模块,用于将每一所述识别字符中、所对应的预设时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符,其中,所述预设时间为所述第一时 段的开始时间、所述第一时段的结束时间中的一者;第四确定子模块,用于确定该第二时段对应的每一所述转换点候选字符的停顿时长;第五确定子模块,用于根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符。According to one or more embodiments of the present disclosure, Example 10 provides the apparatus of Example 9, the first determination submodule includes: a second determination submodule, configured to input the speech recognition text into the pre-trained semantic In the model, the probability that each recognized character in the speech recognition text belongs to a semantic break point is obtained; the extension submodule is used to expand the end time or start time of the second period for each second period , to obtain the speaker conversion interval corresponding to the second time period; the third determination submodule is used to identify characters whose corresponding preset time is within the speaker conversion interval corresponding to the second time period among each of the recognized characters Determined as a conversion point candidate character, wherein the preset time is one of the start time of the first period and the end time of the first period; the fourth determining submodule is used to determine the second period The corresponding pause duration of each of the conversion point candidate characters; the fifth determination submodule is used for according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the semantic break point The probability of the speaker transition point is determined from the transition point candidate characters corresponding to the second time period.
根据本公开的一个或多个实施例,示例11提供了示例10的装置,所述第五确定子模块包括:第六确定子模块,用于针对该第二时段对应的每一所述转换点候选字符,将该转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率的加权和确定为该转换点候选字符属于说话人转换点的概率;第七确定子模块,用于将该第二时段对应的转换点候选字符中、属于说话人转换点的概率最大的转换点候选字符确定为说话人转换点字符。According to one or more embodiments of the present disclosure, Example 11 provides the apparatus of Example 10, the fifth determining submodule includes: a sixth determining submodule, configured to target each of the conversion points corresponding to the second period of time Candidate character, the weighted sum of the pause duration of the candidate character of the conversion point and the probability that the candidate character of the conversion point belongs to the semantic break point is determined as the probability that the candidate character of the conversion point belongs to the speaker conversion point; the seventh determination submodule is used to Among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.
根据本公开的一个或多个实施例,示例12提供了示例8的装置,所述第一分割模块用于将所述目标音频数据输入到预先训练好的说话人识别模型中,以对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段。According to one or more embodiments of the present disclosure, Example 12 provides the apparatus of Example 8, the first segmentation module is used to input the target audio data into a pre-trained speaker recognition model, so that the Speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.
根据本公开的一个或多个实施例,示例13提供了示例8-12中任一项的装置,所述装置还包括:第三分割模块,用于针对所述分句结果中的每一分句,对该分句按照语义进行切分,得到多个子句。According to one or more embodiments of the present disclosure, Example 13 provides the device according to any one of Examples 8-12, the device further comprising: a third segmentation module, configured for each segment in the sentence result Sentence, the clause is segmented according to semantics to obtain multiple clauses.
根据本公开的一个或多个实施例,示例14提供了示例13的装置,所述装置还包括:生成模块,用于根据多个子句,生成与所述目标音频数据对应的字幕文本。According to one or more embodiments of the present disclosure, Example 14 provides the apparatus of Example 13, the apparatus further comprising: a generation module configured to generate subtitle text corresponding to the target audio data according to multiple clauses.
根据本公开的一个或多个实施例,示例15提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一项所述方法的步骤。According to one or more embodiments of the present disclosure, Example 15 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
根据本公开的一个或多个实施例,示例16提供了一种电子设备,包括:存储装置,其上存储有一个或多个计算机程序;一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现示例1-7中任一项所述方法的步骤。According to one or more embodiments of the present disclosure, Example 16 provides an electronic device, including: storage means, on which one or more computer programs are stored; one or more processing means, for executing the The one or more computer programs in to implement the steps of any one of the methods in Examples 1-7.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principles. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式 实现在多个实施例中。In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Claims (10)

  1. 一种分句方法,其包括:A method of phrasing, comprising:
    获取目标音频数据;Obtain target audio data;
    提取所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;Extracting the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the speech recognition text in the target audio data;
    对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data;
    根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。According to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
  2. 根据权利要求1所述的方法,其中,所述根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,包括:The method according to claim 1, wherein the speech recognition text is performed according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speech segments. Speaker segmentation, including:
    根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符;determining a speaker transition point character from the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments;
    以所述说话人转换点字符为切分依据,对所述语音识别文本进行说话人分割。The speaker segmentation is performed on the speech recognition text based on the speaker conversion point characters.
  3. 根据权利要求2所述的方法,其中,所述根据每一所述识别字符所对应的所述第一时段和每一所述说话片段对应的所述第二时段,从所述语音识别文本中确定说话人转换点字符,包括:The method according to claim 2, wherein, according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speaking segments, from the speech recognition text Identify speaker transition point characters, including:
    将所述语音识别文本输入到预先训练好的语义模型中,得到所述语音识别文本中每一识别字符属于语义断句点的概率;The speech recognition text is input into the pre-trained semantic model to obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point;
    针对每一所述第二时段,对该第二时段的结束时间或开始时间做前后扩展,得到该第二时段对应的说话人转换区间;将每一所述识别字符中、所对应的预设时间位于该第二时段对应的说话人转换区间内的字符确定为转换点候选字符,其中,所述预设时间为所述第一时段的开始时间、所述第一时段的结束时间中的一者;确定该第二时段对应的每一所述转换点候选字符的停顿时长;根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符。For each of the second time periods, the end time or start time of the second time period is extended back and forth to obtain the speaker conversion interval corresponding to the second time period; Characters whose time is within the speaker conversion interval corresponding to the second period are determined as candidate characters for conversion points, wherein the preset time is one of the start time of the first period and the end time of the first period or; determine the length of pause of each of the candidate characters of the conversion point corresponding to the second period; according to the length of pause of each of the candidate characters of the conversion point corresponding to the second period of time and the candidate character of the conversion point belonging to the semantic break point probability, and determine the speaker transition point character from the transition point candidate characters corresponding to the second time period.
  4. 根据权利要求3所述的方法,其中,所述根据该第二时段对应的每一所述转换点候选字符的停顿时长和该转换点候选字符属于语义断句点的概率,从该第二时段对应的转换点候选字符中确定说话人转换点字符,包括:The method according to claim 3, wherein, according to the pause duration of each of the conversion point candidate characters corresponding to the second time period and the probability that the conversion point candidate character belongs to a semantic break point, the second time period corresponds to The speaker transition point characters are determined among the transition point candidate characters, including:
    针对该第二时段对应的每一所述转换点候选字符,将该转换点候选字符的停顿时长和该 转换点候选字符属于语义断句点的概率的加权和确定为该转换点候选字符属于说话人转换点的概率;For each of the conversion point candidate characters corresponding to the second period, the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point is determined as the conversion point candidate character belonging to the speaker the probability of switching points;
    将该第二时段对应的转换点候选字符中、属于说话人转换点的概率最大的转换点候选字符确定为说话人转换点字符。Among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.
  5. 根据权利要求1所述的方法,其中,所述对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段,包括:The method according to claim 1, wherein said performing speaker segmentation on said target audio data to obtain a second period corresponding to each speech segment in said target audio data comprises:
    将所述目标音频数据输入到预先训练好的说话人识别模型中,以对所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段。Inputting the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain a second period corresponding to each utterance segment in the target audio data.
  6. 根据权利要求1-5中任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    针对所述分句结果中的每一分句,对该分句按照语义进行切分,得到多个子句。For each clause in the clause result, the clause is segmented according to semantics to obtain multiple clauses.
  7. 根据权利要求6所述的方法,其中,所述方法还包括:The method according to claim 6, wherein the method further comprises:
    根据多个子句,生成与所述目标音频数据对应的字幕文本。According to the plurality of clauses, subtitle text corresponding to the target audio data is generated.
  8. 一种分句装置,其包括:A phrasing device comprising:
    获取模块,用于获取目标音频数据;An acquisition module, configured to acquire target audio data;
    提取模块,用于提取所述获取模块获取到的所述目标音频数据对应的语音识别文本以及所述语音识别文本中、每一识别字符在所述目标音频数据中所对应的第一时段;An extraction module, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module and the first period corresponding to each recognized character in the speech recognition text in the target audio data;
    第一分割模块,用于对所述获取模块获取到的所述目标音频数据进行说话人分割,得到所述目标音频数据中每个说话片段对应的第二时段;A first segmentation module, configured to perform speaker segmentation on the target audio data acquired by the acquisition module, to obtain a second period corresponding to each speech segment in the target audio data;
    第二分割模块,用于根据所述提取模块提取到的每一所述识别字符所对应的所述第一时段和所述第一分割模块得到的每一所述说话片段对应的所述第二时段,对所述语音识别文本进行说话人分割,得到分句结果。The second segmentation module is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module and the second period corresponding to each of the speech segments obtained by the first segmentation module. time period, performing speaker segmentation on the speech recognition text to obtain sentence segmentation results.
  9. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理装置执行时实现权利要求1-7中任一项所述方法的步骤。A computer-readable medium, on which a computer program is stored, wherein, when the program is executed by a processing device, the steps of the method according to any one of claims 1-7 are implemented.
  10. 一种电子设备,其包括:An electronic device comprising:
    存储装置,其上存储有一个或多个计算机程序;storage means on which one or more computer programs are stored;
    一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现权利要求1-7中任一项所述方法的步骤。One or more processing means for executing the one or more computer programs in the storage means to implement the steps of the method according to any one of claims 1-7.
PCT/CN2022/130352 2021-11-10 2022-11-07 Sentence segmentation method and apparatus, storage medium, and electronic device WO2023083142A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111327536.0A CN113889113A (en) 2021-11-10 2021-11-10 Sentence dividing method and device, storage medium and electronic equipment
CN202111327536.0 2021-11-10

Publications (1)

Publication Number Publication Date
WO2023083142A1 true WO2023083142A1 (en) 2023-05-19

Family

ID=79017794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130352 WO2023083142A1 (en) 2021-11-10 2022-11-07 Sentence segmentation method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN113889113A (en)
WO (1) WO2023083142A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975949A (en) * 2024-03-28 2024-05-03 杭州威灿科技有限公司 Event recording method, device, equipment and medium based on voice conversion
CN117975949B (en) * 2024-03-28 2024-06-07 杭州威灿科技有限公司 Event recording method, device, equipment and medium based on voice conversion

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889113A (en) * 2021-11-10 2022-01-04 北京有竹居网络技术有限公司 Sentence dividing method and device, storage medium and electronic equipment
CN114554238B (en) * 2022-02-23 2023-08-11 北京有竹居网络技术有限公司 Live broadcast voice simultaneous transmission method, device, medium and electronic equipment
CN117201876A (en) * 2022-05-31 2023-12-08 北京字跳网络技术有限公司 Subtitle generation method, subtitle generation device, electronic device, storage medium, and program
CN117113974B (en) * 2023-04-26 2024-05-24 荣耀终端有限公司 Text segmentation method, device, chip, electronic equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium
US20190156832A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Diarization Driven by the ASR Based Segmentation
CN111916053A (en) * 2020-08-17 2020-11-10 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
CN112201275A (en) * 2020-10-09 2021-01-08 深圳前海微众银行股份有限公司 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN113225612A (en) * 2021-04-14 2021-08-06 新东方教育科技集团有限公司 Subtitle generating method and device, computer readable storage medium and electronic equipment
CN113393845A (en) * 2021-06-11 2021-09-14 上海明略人工智能(集团)有限公司 Method and device for speaker recognition, electronic equipment and readable storage medium
CN113889113A (en) * 2021-11-10 2022-01-04 北京有竹居网络技术有限公司 Sentence dividing method and device, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium
US20190156832A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Diarization Driven by the ASR Based Segmentation
CN111916053A (en) * 2020-08-17 2020-11-10 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
CN112201275A (en) * 2020-10-09 2021-01-08 深圳前海微众银行股份有限公司 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN113225612A (en) * 2021-04-14 2021-08-06 新东方教育科技集团有限公司 Subtitle generating method and device, computer readable storage medium and electronic equipment
CN113393845A (en) * 2021-06-11 2021-09-14 上海明略人工智能(集团)有限公司 Method and device for speaker recognition, electronic equipment and readable storage medium
CN113889113A (en) * 2021-11-10 2022-01-04 北京有竹居网络技术有限公司 Sentence dividing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975949A (en) * 2024-03-28 2024-05-03 杭州威灿科技有限公司 Event recording method, device, equipment and medium based on voice conversion
CN117975949B (en) * 2024-03-28 2024-06-07 杭州威灿科技有限公司 Event recording method, device, equipment and medium based on voice conversion

Also Published As

Publication number Publication date
CN113889113A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
WO2023083142A1 (en) Sentence segmentation method and apparatus, storage medium, and electronic device
US11917344B2 (en) Interactive information processing method, device and medium
WO2023029904A1 (en) Text content matching method and apparatus, electronic device, and storage medium
CN112115706A (en) Text processing method and device, electronic equipment and medium
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
JP6681450B2 (en) Information processing method and device
WO2023125374A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN111368559A (en) Voice translation method and device, electronic equipment and storage medium
WO2022037419A1 (en) Audio content recognition method and apparatus, and device and computer-readable medium
CN111883107B (en) Speech synthesis and feature extraction model training method, device, medium and equipment
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
US20240119073A1 (en) Information processing method and apparatus, device and readable storage medium
WO2023142913A1 (en) Video processing method and apparatus, readable medium and electronic device
WO2023005729A1 (en) Speech information processing method and apparatus, and electronic device
WO2023071578A1 (en) Text-voice alignment method and apparatus, device and medium
CN110379406B (en) Voice comment conversion method, system, medium and electronic device
WO2022037383A1 (en) Voice processing method and apparatus, electronic device, and computer readable medium
WO2023143107A1 (en) Character recognition method and apparatus, device, and medium
WO2023000782A1 (en) Method and apparatus for acquiring video hotspot, readable medium, and electronic device
CN115967833A (en) Video generation method, device and equipment meter storage medium
CN111460214B (en) Classification model training method, audio classification method, device, medium and equipment
CN112699687A (en) Content cataloging method and device and electronic equipment
CN113223496A (en) Voice skill testing method, device and equipment
CN111582708A (en) Medical information detection method, system, electronic device and computer-readable storage medium
CN112542157A (en) Voice processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891936

Country of ref document: EP

Kind code of ref document: A1