CN111832308B

CN111832308B - Speech recognition text consistency processing method and device

Info

Publication number: CN111832308B
Application number: CN202010694673.7A
Authority: CN
Inventors: 缪庆亮; 吴仁守; 朱钦佩; 朱少华
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-09-08
Anticipated expiration: 2040-07-17
Also published as: CN111832308A

Abstract

The application discloses a method and a device for processing consistency of speech recognition texts, wherein the method for processing consistency of speech recognition texts comprises the following steps: identifying a starting position of at least one key information in the voice recognition text; taking a plurality of sentences from the initial position, calculating second word embedding corresponding to the sentences according to the first word embedding of each word or each phrase in the sentences, and calculating third word embedding corresponding to the text fragments according to the second word embedding; calculating a similarity between the sentence and the other sentence, a distance decay between the sentence and the starting sentence, and a consistency between the sentence and the starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; constructing a semantic graph based on the similarity, and calculating the importance degree of sentences according to the semantic graph; and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance and the distance attenuation, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.

Description

Speech recognition text consistency processing method and device

Technical Field

The application belongs to the technical field of post-processing of voice recognition, and particularly relates to a method and a device for processing consistency of voice recognition texts.

Background

In the related art, the speech recognition system ASR (Automatic Speech Recognition) recognizes that there is an error in the sentence breaking of the sentence in the result, which causes problems in text analysis such as quality inspection and meeting abstract after speech transcription. Text analysis systems face problems such as ASR recognition result incoherence.

The current method for judging whether sentences are coherent mainly comprises the following steps:

method based on acoustic features: the prediction of the whole sentence is performed according to pauses or prosody (prosody) of a person speaking.

Text feature based method: language model modeling or sequence annotation modeling is used to predict whether a word is followed by a mark of sentence end.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing consistency of voice recognition texts, which are used for at least solving one of the technical problems.

In a first aspect, an embodiment of the present application provides a method for processing consistency of speech recognition text, including: identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; taking a plurality of sentences from the initial position, calculating second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculating third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding; calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph; and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.

In a second aspect, an embodiment of the present application provides a speech recognition text consistency processing apparatus, including: the recognition module is configured to recognize the starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; the embedding module is configured to take a plurality of sentences from the initial position, calculate second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculate third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding; a first calculation module configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; the construction module is configured to construct a semantic graph based on the similarity and calculate the importance degree of each sentence according to the semantic graph; and the second calculation module is configured to acquire one or more clustering centers by using a graph clustering algorithm, calculate the sum value of the similarity, the consistency, the importance and the distance attenuation of each clustering center, and take the sentence with the sum value of n ranked top as a coherent sentence sequence.

In a third aspect, there is provided a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the speech recognition text consistency processing method of the first aspect.

In a fourth aspect, an embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.

The method provided by the embodiment of the application identifies the initial position of key information through the preset classification template or the preset classification model, then starts to take a text segment from the initial position, makes words for each sentence and the text segment to be embedded, calculates the semantic similarity of each sentence and the text segment, combines the distance information between the sentences, gives certain semantic similarity attenuation, and finally selects N sentences as final results, thereby realizing the determination of the sentences belonging to the continuity in the text segment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for processing consistency of speech recognition text according to an embodiment of the present application;

FIG. 2 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;

FIG. 3 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;

FIG. 4 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;

FIG. 5 is a system flow diagram of one embodiment of a scheme for speech recognition text consistency processing in accordance with an embodiment of the present application;

FIG. 6 is a flow chart of a vector representation of sentences and documents of a specific embodiment of a scheme for speech recognition text consistency processing of an embodiment of the present application;

FIG. 7 is a flowchart of a sentence and text segment similarity output for one embodiment of a scheme for speech recognition text consistency processing in accordance with an embodiment of the present application;

FIG. 8 is a block diagram of a speech recognition text continuity processing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of an embodiment of a method for processing speech recognition text consistency according to the present application is shown,

as shown in fig. 1, in step 101, identifying a starting position of at least one key information in a speech recognition text by a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on a preset keyword, and the key information is content corresponding to the preset keyword;

in step 102, a plurality of sentences are taken from the initial position, second word embeddings corresponding to each sentence are calculated according to the first word embeddings of each word or each phrase in each sentence, and third word embeddings corresponding to text fragments formed by the plurality of sentences are calculated according to the second word embeddings;

calculating, in step 103, a similarity between said each sentence and other sentences, a distance decay between said each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on said first word embedding, said second word embedding, and said third word embedding;

in step 104, a semantic graph is constructed based on the similarity, and the importance degree of each sentence is calculated according to the semantic graph;

in step 105, one or more cluster centers are obtained by using a graph clustering algorithm, the sum of the similarity, the consistency, the importance and the distance attenuation of each cluster center is calculated, and the sentences with the sum of n top ranking are taken as the consecutive sentence sequences.

In this embodiment, for step 101, the speech recognition text consistency processing apparatus identifies a start position of key information of at least one content corresponding to a preset keyword in the speech recognition text by using a preset classification template formed based on the preset keyword or a preset classification model, for example, a preset classification template with a conference theme, a conference time, and a conference place is preset, for example, a speech recognition text is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the preset classification template can identify the first key information in the speech recognition text by using the preset key words: item a, second key information: 4 pm and third key information: meeting room, wherein, the initial position of first key information is: the starting position of the second key information is: the starting position of the third key information is: and (3) ground.

Then, for step 102, the speech recognition text consistency processing device starts to take a plurality of sentences from the starting position, calculates a second word embedding corresponding to each sentence according to the first word embedding of each word or each phrase in each sentence, and calculates a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding, wherein the word embedding is to convert each word or each phrase (word) in each sentence into a vector (vector) representation of each word or each phrase, for example, one text segment is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the method comprises the following steps of: today, we, discuss, let, A project, question, meeting, time, set, 4 pm, place, set, meeting room vector representation, then calculate from today, we, discuss, let, A project, and question vector representations of sentences in which we discuss A project today, calculate from meeting, time, set, and 4 pm vector representations of sentences in which we get meeting, set, meeting room vector representations of sentences in which we get place set, meeting room vector representations of sentences in which we get meeting, finally calculate from today's discussion of A project problem, set 4 pm vector representations in which we get today's discussion of A project, set, and set to meeting room vector representations of which we get today's discussion of A project, set to 4 pm, set to meeting room text segments.

Then, for step 103, the speech recognition text consistency processing device may calculate the similarity between each sentence and other sentences according to the first word embedding, the second word embedding and the third word embedding calculated above, and then calculate the distance attenuation between each sentence and the starting sentence and the consistency between each sentence and the starting sentence.

Then, for step 104, the speech recognition text consistency processing device constructs a semantic graph based on the similarity, and calculates the importance of each sentence according to the semantic graph, wherein the semantic graph model is a new research view of language type, which has been paid attention in recent years, and aims to characterize the versatility of grammar form by using geometric figures and reveal the systematism and regularity of grammar form multifunctional modes in human language.

Finally, for step 105, one or more cluster centers are obtained using a graph clustering algorithm, the sum of the similarity, the consistency, the importance and the distance decay of each cluster center is calculated, and the sentence with the sum n being ranked as a consecutive sentence sequence is taken, wherein the cluster analysis is a common machine learning technique, and the purpose of the cluster analysis is to divide a data point into several classes. The data of the same class have higher similarity, and the similarity between different classes is lower.

In the scheme of the embodiment, the starting position of key information is identified through a preset classification template or a preset classification model, then a text segment is taken from the starting position, each sentence and the text segment are embedded in a word, the semantic similarity of each sentence and the text segment is calculated, certain semantic similarity attenuation is given by combining the distance information between the sentences, and finally N sentences are selected as final results, so that the determination of the sentences belonging to the continuity in the text segment can be realized.

Referring to fig. 2, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, where the flowchart is mainly a flowchart of steps for further defining "construct semantic graph based on the similarity and calculate importance of each sentence according to the semantic graph" in step 104 of fig. 1.

As shown in fig. 2, in step 201, each sentence is taken as a node of the semantic graph, and edges between the nodes represent the similarity between each sentence and other sentences;

in step 202, the importance of each sentence is calculated using texttrank algorithm based on the similarity.

In this embodiment, for step 201, the speech recognition text consistency processing device uses each sentence as a node of the semantic graph, and the edges between the nodes represent the similarity between each sentence and other sentences, for example, the semantic similarity between the ith sentence and the jth sentence is S (i, j). Establishing an NxN semantic graph among sentences, wherein nodes in the semantic graph are sentences, edges among the nodes represent semantic relativity, and the relativity is represented by S (i, j);

then, for step 202, the speech recognition text consistency processing means calculates the importance level of each sentence using texttrank algorithm based on the similarity, for example, the importance level of the i-th sentence may be represented by S3 (i).

In the scheme of the embodiment, semantic similarity between sentences is calculated by constructing a semantic graph, so that the importance degree of each sentence can be calculated by using a texttrank algorithm.

In some alternative embodiments, the preset classification template is composed of the preset keyword and a template, and the method further includes: training the preset classification model by using the template and the preset keywords, so that the preset classification model can identify the key information in the voice recognition text, and the key information of the voice recognition text can be identified.

Referring to fig. 3, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, where the flowchart mainly refers to a flowchart of step 102 "in fig. 1, in which a plurality of sentences are taken from the start position, a second word embedding corresponding to each sentence is calculated according to a first word embedding of each word or each phrase in each sentence, and a third word embedding corresponding to a text segment composed of the plurality of sentences is calculated according to the second word embedding, and the flowchart includes" further defined steps ".

As shown in fig. 3, in step 301, a plurality of sentences are taken from the starting position, and the first word of each word or each phrase in each sentence is embedded and accumulated to obtain a second word corresponding to each sentence;

in step 302, the second word is embedded and accumulated to obtain a third word corresponding to the text segment composed of the plurality of sentences.

In this embodiment, for step 301, the speech recognition text consistency processing device starts to take a plurality of sentences from the starting position to form a text segment, performs word embedding, converts each word or each phrase in each sentence into a vector representation of each word or each phrase, that is, first word embedding, and then directly adds up the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, that is, second word embedding.

Finally, for step 302, the vector representations corresponding to the text segments formed by the plurality of sentences, that is, the third word embedding, are obtained by direct accumulation of the vector representations of each sentence.

For example, one text segment is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the method comprises the following steps of: today, we, discuss, let, A project, question, meeting, time, fixed, 4 pm, place, fixed, meeting room vector representation, then by accumulating the vector of today + we + discussion + a project + question, such direct accumulation gets the vector representation of the sentence of today's question we discuss a project, then directly accumulating the vector representation of the sentence to get the vector representation of the text segment.

In the scheme of the embodiment, the word embedding of each sentence and the word embedding of the text segment can be used for realizing the vector representation of each sentence and the text segment.

Referring to fig. 4, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, and the flowchart is mainly a flowchart of steps further defined in step 103 "calculate similarity between each sentence and other sentences based on at least the first word embedding, the second word embedding and the third word embedding, the distance attenuation between each sentence and the initial sentence, and the consistency between each sentence and the initial sentence" in the flowchart of fig. 1.

As described in fig. 4, in step 401, the similarity of the second word embedding and the third word embedding is calculated by cosine distance decay;

in step 402, calculating a distance attenuation from each sentence to the start sentence based on a preset distance attenuation formula;

in step 403, it is determined whether a conjunctive word is included between each sentence and the other sentences or whether each sentence and the other sentences include a common physical word, so as to calculate the consistency between each sentence and the starting sentence.

In this embodiment, for step 401, the speech recognition text consistency processing means calculates the similarity between the second word embedding and the third word embedding by means of cosine distance attenuation between the second word embedding and the third word embedding; then, for step 402, the speech recognition text consistency processing means calculates a distance decay from each sentence to the starting sentence based on a preset distance decay formula; for example, the distance between the semantic vector of each sentence and the whole semantic vector of the text fragment is calculated through cosine distance, and the larger the score is, the more semantically the sentence is matched with the whole semantic of the text fragment; finally, for step 403, the speech recognition text consistency processing device determines whether each sentence contains a conjunctive word with other sentences or whether each sentence contains a common entity word with other sentences, so as to calculate the consistency between each sentence and the starting sentence, for example, whether there are conjunctions such as "sum" and "still" between sentences, or whether there are naming entity words such as a common name, for example, S3 (i) may be used to represent the importance of the i-th sentence.

In the scheme of the embodiment, the semantic similarity between each sentence and the text fragment can be calculated by utilizing cosine distance attenuation through word embedding of each sentence and word embedding of the text fragment, and the importance degree of each sentence can be calculated by judging whether each sentence contains a conjunctive word or a public entity word.

In the method of the foregoing embodiment, the preset distance attenuation formula is:

θ(l)＝N ₀ e ^-λl ；

wherein N is ₀ =1.0, λ is a preset threshold, and l is the distance from the current sentence to the starting sentence.

In some alternative embodiments, the method further comprises: in response to an audio input or recording by a user, the audio input or recording is converted into speech recognition text.

The following description is given to better understand the solution of the present application by describing some problems encountered by the inventor in the process of implementing the present application and describing one specific embodiment of the finally determined solution.

Drawbacks of these similar techniques:

method based on acoustic features: predicting the whole sentence according to the pause or prosody (prosody) of the person when speaking; because the end of the sentence and the pause time in the voice information are not necessarily related, the speaking speed of each person in each context is different, and the pause time threshold is difficult to set, so that the accuracy of the method is low. Also, sentence end punctuation marks such as periods, question marks, exclamation marks, and the like cannot be distinguished.

Text feature based method: language model modeling or sequence annotation modeling is used to predict whether a word is followed by a mark of sentence end. The model trained by the method has poor generalization performance, and the using habits of punctuation marks are different in different contexts, so that the final effect of the model is limited.

Why it is not easy to think of the reason:

current methods for judging sentence consistency are often based on acoustic and language models, and common practices include:

method based on acoustic features: i.e. predicting the whole sentence according to the pause or prosody (prosody) of the person speaking; the method for solving the defects of the method is to dynamically adjust the threshold value of the pause interval, and set different threshold values according to acoustic characteristics such as speech speed of each person.

Whether a word is followed by a mark of sentence end is predicted based on language model modeling or sequence labeling modeling. The problem with this type of approach is poor generalization and the expansion to other areas also requires retraining or tuning of the model. The solution is that a large-scale pre-trained model such as BERT, etc. can be used.

The technical problems existing in the prior art are solved through the following scheme:

the scheme provided by the patent not only considers language characteristics, but also considers semantic information among sentences more, and focuses on semantic association among sentences more, and the consistency of text fragments is recognized instead of the consistency inside the sentences. Meanwhile, the method can realize the identification of discontinuous sentences by using a graph ordering algorithm. Such as the sentence ABCDEF, where the ABDF may be the conclusion of the meeting and the CE is the meeting to be handled. The identification of non-consecutive sentences as identical semantically consecutive fragments may be achieved. This is not possible with conventional methods.

Taking a summary generation task after converting a conference recording into a text as an example, firstly, key information such as a conference theme, a conference conclusion, a start position of important information such as a conference to-be-handled and the like is identified through a rule template, a keyword or a classification model. Then, a text segment with a certain length is selected by taking the first sentence as a starting point, wherein the text segment comprises a plurality of sentences. Each sentence and text fragment is then embedded. And calculating the semantic similarity of each sentence and the fragment. And combining distance information among sentences, giving certain semantic similarity attenuation, and finally selecting n sentences as a final result. The application combines a plurality of information such as keyword information, position information, semantic similarity and the like. And determining that the fragments belong to the coherent sentences through a sequencing algorithm.

The application has the technical innovation points that:

important information pre-positioning method based on keywords

Method for calculating semantic similarity between sentences

Sentence sorting and selecting method

The flow of the method is shown in fig. 1, and the starting position of the key information is first identified, and the key words, templates or classification models can be utilized. The following are some templates and keywords. The keywords and templates can also be used for training a classification model, and the model can identify key information such as topics, time, places and the like.

Conference theme:

1. today we discuss the problem of xxxx

2. We chat a chat xxxxx

3. Today we are working on (talk chat discussion explore) xxxx things

Meeting time:

breakfast meeting

Meeting at the afternoon of today

The meeting time is afternoon

Meeting place:

we are in xxx meeting

Our meeting place is xxxx

Meeting place at xxxxxx

Meeting person information:

today meeting is xxx, xxx, xxx and xxx

xxxx, xxx, xxx together open

3. Participants and conferees have xxxx

Meeting to be done:

backlog 1 is/has xxxx

To-do matter is/has xxx

To-do is/has xxx

Meeting to-be-handled responsible person:

is responsible for xxxx. xxxx assistance reporting to xxxx

Responsible person is xxx

Responsible persons are xxx and xxxx

Conference conclusion:

conference conclusion is xxx

In summary, xxxxx

Overall, xxxxx

And secondly, starting from the initial position determined in the first step, taking N sentences. Directly accumulating Word references (Word Embedding) of each Word or phrase in the sentence, thereby obtaining Word references representation of the sentence; and directly accumulating the WordEmbedding of each sentence to obtain the Word Embedding of the text fragments consisting of N sentences.

The distance between the semantic vector of each sentence and the whole semantic vector of the text fragment is calculated through cosine distance, and the larger the score is, the more semantically matched the sentence with the whole semantic of the text fragment is. And calculating the semantic similarity score of the whole semantics of each sentence and the text fragment, and expressing the semantic similarity of the ith sentence by using S1 (i).

Third, the distance decay is calculated by the following equation (1), where n0=1.0, λ is adjusted as needed, and l is the distance from the current sentence to the starting sentence.

θ(l)＝N ₀ e ^-λl (1)

Semantic similarity between the nth sentence and the previous N-1 sentence, and the value of N is 2 to N.

Step four, calculating the continuity, wherein the continuity comprises 1, whether the sentences have 'sum' or not, and the like, and if the continuity is 1.0; 2. whether public name and other named entity words exist among sentences or not, and if the consistency is 1.0; the consistency of the i-th sentence and the initial sentence is represented by S2 (i).

Fifthly, constructing a semantic graph, and calculating semantic similarity among sentences, wherein the method is the same as that in the step 2. For example, the semantic similarity between the i-th sentence and the j-th sentence is S (i, j). And establishing an N-N semantic graph among sentences, wherein nodes in the semantic graph are sentences, edges among the nodes represent semantic relativity, and the relativity is represented by S (i, j). The importance degree of each node (sentence) is calculated by using a texttrank algorithm, and the importance degree of the ith sentence is represented by S3 (i).

And sixthly, finding one or more clustering centers by using a graph clustering algorithm, and then calculating S=S1 (i) +S2 (i) +S3 (i) +theta (l) for each clustering center, and taking sentences with the top n of the S rank as a continuous sentence sequence.

Referring to fig. 8, a block diagram of a speech recognition text consistency processing device according to an embodiment of the application is shown.

As shown in fig. 8, the speech recognition text continuity processing device 800 includes: the identification module 810, the embedding module 820, the first calculation module 830, the construction module 840, and the second calculation module 850.

The recognition module 810 is configured to recognize a starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; an embedding module 820 configured to take a plurality of sentences from the starting position, calculate a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, and calculate a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding; a first calculation module 830 configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and the starting sentence, and a consistency between each sentence and the starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; a construction module 840 configured to construct a semantic graph based on the similarity, and calculate the importance level of each sentence according to the semantic graph; and a second calculation module 850 configured to obtain one or more cluster centers by using a graph clustering algorithm, calculate a sum value of the similarity, the consistency, the importance and the distance attenuation of each cluster center, and take a sentence with the sum value ranked n top as a consecutive sentence sequence.

It should be understood that the modules depicted in fig. 8 correspond to the various steps in the method described with reference to fig. 1, 2, 3, and 4. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 8, and are not described here again.

It should be noted that the module in the embodiment of the present application is not limited to the solution of the present application, for example, the recognition module may be described as recognizing the starting position of at least one key information in the speech recognition text through a preset classification template or a preset classification model, where the preset classification template or the preset classification model is formed based on a preset keyword, the key information is a content corresponding to the preset keyword, and in addition, the related functional module may be implemented by a hardware processor, for example, the recognition module may also be implemented by a processor, which is not described herein.

In other embodiments, embodiments of the present application further provide a non-volatile computer storage medium storing computer executable instructions that are capable of performing the method for processing consistency of speech recognition text in any of the above-described method embodiments;

as one embodiment, the non-volatile computer storage medium of the present application stores computer-executable instructions configured to:

identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;

taking a plurality of sentences from the initial position, calculating second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculating third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding;

calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding;

constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph;

and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.

The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from the use of the speech recognition text continuity processing means, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes a memory remotely located with respect to the processor, the remote memory being connectable to the speech recognition text continuity processing means through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present application also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described speech recognition text consistency processing methods.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 9, where the device includes: one or more processors 910, and a memory 920, one processor 910 being illustrated in fig. 9. The apparatus for a speech recognition text consistency processing method may further include: an input device 930, and an output device 940. The processor 910, memory 920, input device 930, and output device 940 may be connected by a bus or other means, for example in fig. 9. Memory 920 is the non-volatile computer-readable storage medium described above. The processor 910 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 920, i.e. implements the method embodiments described above for speech recognition text continuity processing means methods. The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function controls for the speech recognition text continuity processing device. The output device 940 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

As an embodiment, the electronic device is applied to a speech recognition text consistency processing device, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:

The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.

(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.

(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.

(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.

(5) Other electronic devices with data interaction function.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of speech recognition text consistency processing, comprising:

taking a plurality of sentences from the starting position, calculating a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, and calculating a third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding, wherein the first word embedding is to convert each word or each phrase in each sentence into a vector representation of each word or each phrase, the second word embedding is to directly accumulate the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, and the third word embedding is to directly accumulate the vector representation of each sentence to obtain a vector representation corresponding to the text segment formed by the plurality of sentences;

calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based on the first word embedding, the second word embedding, and the third word embedding;

calculating the similarity of the second word embedding and the third word embedding through cosine distance attenuation;

calculating the distance attenuation from each sentence to the initial sentence based on a preset distance attenuation formula, wherein the preset distance attenuation formula is as follows:

；

wherein N is ₀ =1.0，For a preset threshold, l is the distance from the current sentence to the starting sentence;

judging whether each sentence and the other sentences contain continuous words or whether each sentence and the other sentences contain public entity words or not so as to calculate the consistency of each sentence and the initial sentence;

2. The method of claim 1, wherein the constructing a semantic graph based on the similarity and calculating the importance of each sentence from the semantic graph comprises:

taking each sentence as a node of a semantic graph, and representing the similarity between each sentence and other sentences by edges between the nodes;

and calculating the importance degree of each sentence by using a texttrank algorithm based on the similarity.

3. The method of claim 1, wherein the preset classification template consists of the preset keywords and templates, the method further comprising:

training the preset classification model by using the template and the preset keywords, so that the preset classification model can identify the key information in the voice recognition text.

4. The method of claim 1, wherein the taking a plurality of sentences from the starting position, calculating a second word insert corresponding to each sentence from a first word insert of each word or each phrase in each sentence, and calculating a third word insert corresponding to a text segment composed of the plurality of sentences from the second word insert comprises:

taking a plurality of sentences from the initial position, and embedding and accumulating the first word of each word or each phrase in each sentence to obtain a second word embedding corresponding to each sentence;

and embedding and accumulating the second words to obtain third word embedding corresponding to the text fragments formed by the sentences.

5. The method of any of claims 1-4, wherein the method further comprises:

in response to an audio input or recording of a user, the audio input or recording is converted into speech recognition text.

6. A speech recognition text continuity processing device comprising:

the recognition module is configured to recognize the starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;

an embedding module configured to take a plurality of sentences from the starting position, calculate a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, calculate a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding, wherein the first word embedding is to convert each word or each phrase in each sentence into a vector representation of each word or each phrase, the second word embedding is to directly accumulate the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, and the third word embedding is to directly accumulate the vector representation of each sentence to obtain a vector representation corresponding to the text segment composed of the plurality of sentences;

a first calculation module configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based on the first word embedding, the second word embedding, and the third word embedding; the first computing module further includes: the similarity of the second word embedding and the third word embedding is calculated through cosine distance attenuation; calculating the distance attenuation from each sentence to the initial sentence based on a preset distance attenuation formula, wherein the preset distance attenuation formula is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein n0=1.0, +.>For a preset threshold, l is the distance from the current sentence to the starting sentence;judging whether each sentence and the other sentences contain continuous words or whether each sentence and the other sentences contain public entity words or not so as to calculate the consistency of each sentence and the initial sentence;

the construction module is configured to construct a semantic graph based on the similarity and calculate the importance degree of each sentence according to the semantic graph;

and the second calculation module is configured to acquire one or more clustering centers by using a graph clustering algorithm, calculate the sum value of the similarity, the consistency, the importance and the distance attenuation of each clustering center, and take the sentences with the sum value of n ranked before as a coherent sentence sequence.

7. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of claims 1-5.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.