WO2023083142A1

WO2023083142A1 - Sentence segmentation method and apparatus, storage medium, and electronic device

Info

Publication number: WO2023083142A1
Application number: PCT/CN2022/130352
Authority: WO
Inventors: 孙修松; 刘艺; 何怡; 马泽君
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-11-10
Filing date: 2022-11-07
Publication date: 2023-05-19
Also published as: CN113889113A

Abstract

Disclosed are a sentence segmentation method and apparatus, a storage medium, and an electronic device. The method comprises: acquiring target audio data (S101); extracting a speech recognition text corresponding to the target audio data, and a first time period in the target audio data corresponding to each recognized character in the speech recognition text (S102); performing speaker segmentation on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data (S103); and performing speaker segmentation on the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each speech segment to obtain a sentence segmentation result (S104).

Description

Clause method, device, storage medium and electronic equipment

This disclosure claims the priority of the Chinese patent application number "202111327536.0" filed on November 10, 2021, with the application name "Sentence Method, Device, Storage Medium, and Electronic Equipment", and the entire content of the Chinese patent application Incorporated by reference in this disclosure.

technical field

The present disclosure relates to the technical field of speech recognition, and in particular, to a sentence segmentation method, device, storage medium and electronic equipment.

Background technique

In the speech recognition application of the video subtitle scene, it is necessary to segment the recognized text for split-screen display. Moreover, in order to ensure the readability of subtitles, it is often required that a single clause contains only one speaker, so as to avoid the situation where a single screen subtitle contains the content of different speakers at the same time. The conventional sentence segmentation method only combines the semantic information of the speech recognition text to segment the semantic turning point. This method has a good effect on single-speaker videos, but for multi-speaker dialogue scene videos, purely using semantic information will lead to poor segmentation at speaker transitions, and a single sentence contains multiple speeches. The content of the person's speech.

Contents of the invention

This section is provided to introduce concepts in a simplified form that are described in detail later in the Detailed Description. This part of the content is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

In a first aspect, the present disclosure provides a method for clauses, including:

Obtain target audio data;

Extracting the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the speech recognition text in the target audio data;

performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data;

According to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.

In a second aspect, the present disclosure provides a sentence-phrasing device, including:

An acquisition module, configured to acquire target audio data;

An extraction module, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module and the first period corresponding to each recognized character in the speech recognition text in the target audio data;

A first segmentation module, configured to perform speaker segmentation on the target audio data acquired by the acquisition module, to obtain a second period corresponding to each speech segment in the target audio data;

The second segmentation module is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module and the second period corresponding to each of the speech segments obtained by the first segmentation module. time period, performing speaker segmentation on the speech recognition text to obtain sentence segmentation results.

In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method provided in the first aspect of the present disclosure are implemented.

In a fourth aspect, the present disclosure provides an electronic device, including:

storage means on which one or more computer programs are stored;

One or more processing devices configured to execute the one or more computer programs in the storage device to implement the steps of the method provided in the first aspect of the present disclosure.

In the above technical solution, the target audio data is obtained; the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted; at the same time, the target audio data is spoken People segmentation to obtain the second period corresponding to each speech segment in the target audio data; then, perform speaker segmentation on the speech recognition text according to the first period corresponding to each recognized character and the second period corresponding to each speech segment , to get the sentence result. As a result, the speaker segment information and the time period corresponding to each character in the speech recognition text in the target audio data can be effectively utilized to perform speaker segmentation on the speech recognition text, so as to achieve reasonable and effective segmentation of the speaker conversion point, Avoid the situation that a single clause contains the content of multiple speakers, and improve the effect of clauses.

Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

Description of drawings

The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale. In the attached picture:

Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment.

Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment.

Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment.

Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment.

Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.

Fig. 1 is a flow chart of a sentence clause method according to an exemplary embodiment. As shown in FIG. 1, the method may include S101-S104.

In S101, target audio data is acquired.

In the present disclosure, the target audio data may include a plurality of speaker voice segments. For example, the target audio data may be a multi-speaker dialogue recording, or may be an audio segment in a multi-speaker dialogue scene video.

In S102, the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the target audio data in the speech recognition text are extracted.

In the present disclosure, automatic speech recognition technology (Automatic Speech Recognition, ASR) can be used to perform speech recognition on the target audio data, so as to obtain speech recognition text and each character in the speech recognition text (herein referred to as recognition characters) in the target audio data The corresponding start time and end time in , that is, the first period.

In S103, speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.

In the present disclosure, performing speaker segmentation on the target audio refers to detecting a speaker transition point in the target audio data, and taking the speech between two adjacent speaker transition points as a speech segment.

In S104, according to the first period corresponding to each recognized character and the second period corresponding to each utterance segment, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.

The specific implementation manner of performing speaker segmentation on the target audio data in S103 to obtain the second time period corresponding to each utterance segment in the target audio data will be described in detail below. Specifically, speaker segmentation can be performed through various implementations. In one implementation, the above-mentioned target audio data can be divided into speakers by receiving manually input segmentation marks, so as to determine each of the above-mentioned target audio data. The start time and end time of the speaking segment, that is, the second period corresponding to each speaking segment in the target audio data.

In another embodiment, the target audio data may be input into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data. In this way, speech fragments of different speakers can be automatically segmented, which is convenient and quick, thereby improving the efficiency of sentence segmentation of subsequent speech recognition texts.

The specific implementation manner of performing speaker segmentation on the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each utterance segment in S104 will be described in detail below. Specifically, it can be achieved through the following steps (1) and (2):

(1) According to the first period corresponding to each recognized character and the second period corresponding to each speech segment, determine the speaker transition point character from the speech recognition text.

(2) Based on the speaker conversion point characters, perform speaker segmentation on the speech recognition text.

Among them, the speaker transition point characters belong to the previous clause.

For example, the speech recognition text is: Have you eaten? Have you eaten yet? I haven't. The speaker converts the characters into "?" ", "I have eaten you", "I haven't".

The specific implementation of determining the speaker conversion point characters from the speech recognition text according to the first time period corresponding to each recognized character and the second time period corresponding to each speech segment in the above step (1) will be described in detail below. In an implementation manner, for each second period, the recognized characters in the speech recognition text corresponding to the first period including the end time of the second period may be determined as speaker transition point characters.

In another embodiment, the speaker conversion point characters can be determined through the following steps (21) to (25):

(21) Input the speech recognition text into the pre-trained semantic model, and obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point, wherein, the greater the probability of a recognized character belonging to a semantic break point, it indicates that the recognized character The more likely a period break is for semantics.

(22) For each second period, the end time or start time of the second period is extended back and forth to obtain the speaker switching interval corresponding to the second period.

In an implementation manner, the end time of each second time period is extended back and forth respectively. Specifically, for each second period, the end time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms to obtain the speaker transition interval [end_time-N, end_time+M] corresponding to the second period , where end_time is the end time of the second period.

In another implementation manner, the start time of each second period is extended forward and backward respectively. Specifically, for each second period, the start time of the second period is extended forward by Nms, and at the same time, it is extended backward by Mms, to obtain the speaker transition interval [start_time-N, start_time+M] corresponding to the second period , wherein, start_time is the start time of the second period.

Wherein, it should be noted that M and N may or may not be equal, which is not specifically limited in the present disclosure.

(23) Among each recognized character, a character whose corresponding preset time is within the speaker transition interval corresponding to the second time period is determined as a transition point candidate character.

In the present disclosure, the preset time is one of the start time of the first period and the end time of the first period. In an implementation manner, among each recognized character, the character whose start time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.

In another implementation manner, among each recognized character, the character whose end time of the corresponding first period is within the speaker transition interval corresponding to the second period may be determined as a candidate character for the transition point.

(24) Determine the pause duration of each transition point candidate character corresponding to the second time period.

In the present disclosure, the pause duration of the candidate conversion point character is equal to the time interval between the start time of the recognized character after and adjacent to the candidate conversion point character in the speech recognition text and the end time of the candidate conversion point character.

(25) According to the pause duration of each conversion point candidate character corresponding to the second period and the probability that the conversion point candidate character belongs to a semantic break point, determine the speaker conversion point character from the conversion point candidate characters corresponding to the second period .

Specifically, for each conversion point candidate character corresponding to the second time period, the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point can be determined as the conversion point candidate The probability that the character belongs to the speaker transition point; then, among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.

Fig. 2 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 2, the above method further includes the following S105.

In S105, for each clause in the clause result, the clause is segmented according to semantics to obtain multiple clauses.

In this way, the speech recognition text can be accurately divided into sentences according to the speaker and semantic information.

Fig. 3 is a flow chart of a sentence clause method according to another exemplary embodiment. As shown in FIG. 3 , the above method further includes the following S106.

In S106, generate subtitle text corresponding to the target audio data according to the plurality of clauses.

Since each clause is obtained by accurately segmenting the speech recognition text according to the speaker and semantic information, it is possible to reasonably and effectively segment the speaker transition, thereby avoiding the occurrence of single-screen subtitles containing different speakers at the same time The content of the speech improves the user experience.

Fig. 4 is a block diagram of a device for sentence clause according to an exemplary embodiment. As shown in Figure 4, the device 400 includes:

An acquisition module 401, configured to acquire target audio data;

An extraction module 402, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module 401 and the first period corresponding to each recognized character in the speech recognition text in the target audio data ;

The first segmentation module 403 is configured to perform speaker segmentation on the target audio data acquired by the acquisition module 401, to obtain a second period corresponding to each speech segment in the target audio data;

The second segmentation module 404 is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module 402 and the corresponding to each of the speech segments obtained by the first segmentation module 403 In the second period, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.

In one embodiment, the second segmentation module 403 is configured to perform speaker segmentation on the above target audio data by receiving manually input segmentation marks, so as to determine the start time and end of each speech segment in the above target audio data Time, that is, the second period corresponding to each speech segment in the target audio data.

In another implementation manner, the second segmentation module 403 includes:

A first determining submodule, configured to determine a speaker transition point from the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speech segments character;

The segmentation sub-module is used to perform speaker segmentation on the speech recognition text based on the speaker conversion point characters.

In this way, speech fragments of different speakers can be automatically segmented, which is convenient and quick, thereby improving the efficiency of sentence segmentation of subsequent speech recognition texts.

In some embodiments, the first determining submodule includes:

The second determining submodule is used to input the speech recognition text into the pre-trained semantic model to obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point;

The extension submodule is used to expand the end time or start time of the second time period for each second time period to obtain the speaker switching interval corresponding to the second time period; the third determination submodule is used to Among each of the identified characters, the character whose corresponding preset time is within the speaker conversion interval corresponding to the second time period is determined as a conversion point candidate character, wherein the preset time is the first time period One of the start time and the end time of the first period; the fourth determination submodule is used to determine the pause duration of each of the conversion point candidate characters corresponding to the second period; the fifth determination submodule is used According to the pause duration of each of the transition point candidate characters corresponding to the second period and the probability that the transition point candidate character belongs to a semantic break point, determine the speaker transition point character from the transition point candidate characters corresponding to the second period .

In some embodiments, the fifth determining submodule includes:

The sixth determination submodule is used to determine the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point for each of the conversion point candidate characters corresponding to the second time period as The probability that the transition point candidate character belongs to the speaker transition point;

The seventh determination sub-module is used to determine the candidate character of the transition point with the highest probability of belonging to the speaker transition point among the transition point candidate characters corresponding to the second time period as the speaker transition point character.

In some embodiments, the first segmentation module 402 is configured to input the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain the target audio The second period corresponding to each utterance segment in the data.

In some embodiments, the device 400 also includes:

The third segmentation module is configured to, for each clause in the clause result, segment the clause according to semantics to obtain multiple clauses.

In some embodiments, the device 400 also includes:

A generating module, configured to generate subtitle text corresponding to the target audio data according to multiple clauses.

The present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the above-mentioned clause method provided by the present disclosure are realized.

Referring now to FIG. 5 , it shows a schematic structural diagram of an electronic device (such as a terminal device or a server) 500 suitable for implementing an embodiment of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.

As shown in FIG. 5, an electronic device 500 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 501, which may be randomly accessed according to a program stored in a read-only memory (ROM) 502 or loaded from a storage device 508. Various appropriate actions and processes are executed by programs in the memory (RAM) 503 . In the RAM 503, various programs and data necessary for the operation of the electronic device 500 are also stored. The processing device 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to the bus 504 .

Typically, the following devices can be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 507 such as a computer; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to perform wireless or wired communication with other devices to exchange data. While FIG. 5 shows electronic device 500 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 509, or from storage means 508, or from ROM 502. When the computer program is executed by the processing device 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the target audio data; extracts the speech recognition text corresponding to the target audio data and the In the voice recognition text, the first period corresponding to each recognized character in the target audio data; the target audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data time period: performing speaker segmentation on the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments to obtain sentence segmentation results.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the obtaining module may also be described as "a module for obtaining target audio data".

The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a sentence segmentation method, including: acquiring target audio data; extracting the speech recognition text corresponding to the target audio data and extracting each recognized The first period corresponding to the character in the target audio data; performing speaker segmentation on the target audio data to obtain the second period corresponding to each speech segment in the target audio data; according to each of the recognized characters Perform speaker segmentation on the speech recognition text corresponding to the first period of time and the second period of time corresponding to each of the utterance segments to obtain sentence segmentation results.

According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments time period, performing speaker segmentation on the voice recognition text, including: according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speaking segments, from the voice recognition Determining speaker transition point characters in the text; performing speaker segmentation on the speech recognition text based on the speaker transition point characters.

According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, according to the first period corresponding to each of the recognized characters and the second period corresponding to each of the speech segments Period, determining the speaker conversion point characters from the speech recognition text, comprising: inputting the speech recognition text into a pre-trained semantic model, and obtaining that each recognized character in the speech recognition text belongs to a semantic breakpoint probability; for each of the second time periods, the end time or start time of the second time period is extended back and forth to obtain the speaker conversion interval corresponding to the second time period; in each of the recognized characters, the corresponding A character whose preset time is within the speaker conversion interval corresponding to the second period is determined as a candidate character for a conversion point, wherein the preset time is between the start time of the first period and the end time of the first period One of; Determine the pause duration of each of the conversion point candidate characters corresponding to the second period; According to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the semantic sentence the probability of the point, and determine the speaker transition point character from the transition point candidate characters corresponding to the second time period.

According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 3, according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belonging to the semantic break point The probability of determining the speaker’s transition point characters from the transition point candidate characters corresponding to the second period includes: for each of the transition point candidate characters corresponding to the second period, the pause duration and the duration of the transition point candidate characters The weighted sum of the probability that this conversion point candidate character belongs to the semantic break point is determined as the probability that this conversion point candidate character belongs to the speaker conversion point; among the conversion point candidate characters corresponding to the second period, the probability of belonging to the speaker conversion point is the largest The transition point candidate characters of are determined as speaker transition point characters.

According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 1, performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data, include:

Inputting the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain a second period corresponding to each utterance segment in the target audio data.

According to one or more embodiments of the present disclosure, Example 6 provides the method described in any one of Examples 1-5, the method further comprising: for each clause in the clause result, for the clause Sentences are segmented semantically to obtain multiple clauses.

According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 6, and the method further includes: generating subtitle text corresponding to the target audio data according to a plurality of clauses.

According to one or more embodiments of the present disclosure, Example 8 provides a sentence segmentation device, including: an acquisition module, used to acquire target audio data; an extraction module, used to extract the target audio acquired by the acquisition module The speech recognition text corresponding to the data and the first period corresponding to each recognized character in the target audio data in the speech recognition text; the first segmentation module is used to analyze the target obtained by the acquisition module The audio data is divided into speakers to obtain the second period corresponding to each speech segment in the target audio data; the second segmentation module is used to extract the recognition characters corresponding to each of the identified characters extracted by the extraction module. The first period and the second period corresponding to each speech segment obtained by the first segmentation module perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.

According to one or more embodiments of the present disclosure, Example 9 provides the device of Example 8, the second segmentation module includes: a first determination submodule, configured to, according to the first The period of time and the second period corresponding to each of the speech segments determine the speaker transition point characters from the speech recognition text; the segmentation submodule is used to segment based on the speaker transition point characters The speech recognition text is subjected to speaker segmentation.

According to one or more embodiments of the present disclosure, Example 10 provides the apparatus of Example 9, the first determination submodule includes: a second determination submodule, configured to input the speech recognition text into the pre-trained semantic In the model, the probability that each recognized character in the speech recognition text belongs to a semantic break point is obtained; the extension submodule is used to expand the end time or start time of the second period for each second period , to obtain the speaker conversion interval corresponding to the second time period; the third determination submodule is used to identify characters whose corresponding preset time is within the speaker conversion interval corresponding to the second time period among each of the recognized characters Determined as a conversion point candidate character, wherein the preset time is one of the start time of the first period and the end time of the first period; the fourth determining submodule is used to determine the second period The corresponding pause duration of each of the conversion point candidate characters; the fifth determination submodule is used for according to the pause duration of each of the conversion point candidate characters corresponding to the second period and the conversion point candidate characters belong to the semantic break point The probability of the speaker transition point is determined from the transition point candidate characters corresponding to the second time period.

According to one or more embodiments of the present disclosure, Example 11 provides the apparatus of Example 10, the fifth determining submodule includes: a sixth determining submodule, configured to target each of the conversion points corresponding to the second period of time Candidate character, the weighted sum of the pause duration of the candidate character of the conversion point and the probability that the candidate character of the conversion point belongs to the semantic break point is determined as the probability that the candidate character of the conversion point belongs to the speaker conversion point; the seventh determination submodule is used to Among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.

According to one or more embodiments of the present disclosure, Example 12 provides the apparatus of Example 8, the first segmentation module is used to input the target audio data into a pre-trained speaker recognition model, so that the Speaker segmentation is performed on the target audio data to obtain a second time period corresponding to each speech segment in the target audio data.

According to one or more embodiments of the present disclosure, Example 13 provides the device according to any one of Examples 8-12, the device further comprising: a third segmentation module, configured for each segment in the sentence result Sentence, the clause is segmented according to semantics to obtain multiple clauses.

According to one or more embodiments of the present disclosure, Example 14 provides the apparatus of Example 13, the apparatus further comprising: a generation module configured to generate subtitle text corresponding to the target audio data according to multiple clauses.

According to one or more embodiments of the present disclosure, Example 15 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .

According to one or more embodiments of the present disclosure, Example 16 provides an electronic device, including: storage means, on which one or more computer programs are stored; one or more processing means, for executing the The one or more computer programs in to implement the steps of any one of the methods in Examples 1-7.

The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principles. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Claims

A method of phrasing, comprising:

Obtain target audio data;

Extracting the speech recognition text corresponding to the target audio data and the first period corresponding to each recognized character in the speech recognition text in the target audio data;

performing speaker segmentation on the target audio data to obtain a second period corresponding to each speech segment in the target audio data;

According to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments, perform speaker segmentation on the speech recognition text to obtain sentence segmentation results.
The method according to claim 1, wherein the speech recognition text is performed according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speech segments. Speaker segmentation, including:

determining a speaker transition point character from the speech recognition text according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the utterance segments;

The speaker segmentation is performed on the speech recognition text based on the speaker conversion point characters.
The method according to claim 2, wherein, according to the first time period corresponding to each of the recognized characters and the second time period corresponding to each of the speaking segments, from the speech recognition text Identify speaker transition point characters, including:

The speech recognition text is input into the pre-trained semantic model to obtain the probability that each recognized character in the speech recognition text belongs to a semantic break point;

For each of the second time periods, the end time or start time of the second time period is extended back and forth to obtain the speaker conversion interval corresponding to the second time period; Characters whose time is within the speaker conversion interval corresponding to the second period are determined as candidate characters for conversion points, wherein the preset time is one of the start time of the first period and the end time of the first period or; determine the length of pause of each of the candidate characters of the conversion point corresponding to the second period; according to the length of pause of each of the candidate characters of the conversion point corresponding to the second period of time and the candidate character of the conversion point belonging to the semantic break point probability, and determine the speaker transition point character from the transition point candidate characters corresponding to the second time period.
The method according to claim 3, wherein, according to the pause duration of each of the conversion point candidate characters corresponding to the second time period and the probability that the conversion point candidate character belongs to a semantic break point, the second time period corresponds to The speaker transition point characters are determined among the transition point candidate characters, including:

For each of the conversion point candidate characters corresponding to the second period, the weighted sum of the pause duration of the conversion point candidate character and the probability that the conversion point candidate character belongs to a semantic break point is determined as the conversion point candidate character belonging to the speaker the probability of switching points;

Among the transition point candidate characters corresponding to the second period, the transition point candidate character with the highest probability of belonging to the speaker transition point is determined as the speaker transition point character.
The method according to claim 1, wherein said performing speaker segmentation on said target audio data to obtain a second period corresponding to each speech segment in said target audio data comprises:

Inputting the target audio data into a pre-trained speaker recognition model to perform speaker segmentation on the target audio data to obtain a second period corresponding to each utterance segment in the target audio data.
The method according to any one of claims 1-5, wherein the method further comprises:

For each clause in the clause result, the clause is segmented according to semantics to obtain multiple clauses.
The method according to claim 6, wherein the method further comprises:

According to the plurality of clauses, subtitle text corresponding to the target audio data is generated.
A phrasing device comprising:

An acquisition module, configured to acquire target audio data;

An extraction module, configured to extract the speech recognition text corresponding to the target audio data obtained by the acquisition module and the first period corresponding to each recognized character in the speech recognition text in the target audio data;

A first segmentation module, configured to perform speaker segmentation on the target audio data acquired by the acquisition module, to obtain a second period corresponding to each speech segment in the target audio data;

The second segmentation module is configured to use the first period corresponding to each of the recognized characters extracted by the extraction module and the second period corresponding to each of the speech segments obtained by the first segmentation module. time period, performing speaker segmentation on the speech recognition text to obtain sentence segmentation results.
A computer-readable medium, on which a computer program is stored, wherein, when the program is executed by a processing device, the steps of the method according to any one of claims 1-7 are implemented.
An electronic device comprising:

storage means on which one or more computer programs are stored;

One or more processing means for executing the one or more computer programs in the storage means to implement the steps of the method according to any one of claims 1-7.