WO2022121859A1 - 口语信息处理方法、装置和电子设备 - Google Patents

口语信息处理方法、装置和电子设备 Download PDF

Info

Publication number
WO2022121859A1
WO2022121859A1 PCT/CN2021/135834 CN2021135834W WO2022121859A1 WO 2022121859 A1 WO2022121859 A1 WO 2022121859A1 CN 2021135834 W CN2021135834 W CN 2021135834W WO 2022121859 A1 WO2022121859 A1 WO 2022121859A1
Authority
WO
WIPO (PCT)
Prior art keywords
spoken language
word
information
smooth
sample
Prior art date
Application number
PCT/CN2021/135834
Other languages
English (en)
French (fr)
Inventor
林雨
蒙嘉颖
吴培昊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022121859A1 publication Critical patent/WO2022121859A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to the field of Internet technologies, and in particular, to a method, apparatus and electronic device for processing spoken language information.
  • spoken language information When English is used as a communication language, it is often necessary to process the spoken information of the speaker for translation into text information in other languages or into relatively standardized text information for circulation. In this process, the spoken language information can be deduplicated, and then further processing operations on the spoken language information can be implemented by downstream tasks (eg, grammatical error correction in the spoken language information, extracting phrases for analysis, etc.).
  • downstream tasks eg, grammatical error correction in the spoken language information, extracting phrases for analysis, etc.
  • Embodiments of the present disclosure provide a method, an apparatus, and an electronic device for processing spoken language information.
  • an embodiment of the present disclosure provides a method for processing spoken language information.
  • the method includes: determining a stem corresponding to each word in the initial spoken language information, and obtaining a stem corresponding to the initial spoken language information based on the stem corresponding to the word.
  • the initial spoken language stem vector ; according to the initial spoken language vector corresponding to the initial spoken language information and the initial spoken word stem vector, determine the label corresponding to the word in the initial spoken language information; the label at least includes: smooth, Non-smooth; process the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • an embodiment of the present disclosure provides an apparatus for processing spoken language information, the apparatus comprising: a determination module configured to determine a stem corresponding to each word in the initial spoken language information, and obtain a word stem corresponding to the word based on the stem corresponding to the word Describe the initial spoken language stem vector corresponding to the initial spoken language information; the labeling module is used to determine the label corresponding to the word in the initial spoken language information according to the initial spoken language vector corresponding to the initial spoken language information and the initial spoken language stem vector ; the label at least includes: smooth and non-smooth; a processing module for processing the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device on which one or more programs are stored, when the one or more programs are stored by the one or more programs A plurality of processors execute, so that the one or more processors implement the spoken language information processing method described in the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the method for processing spoken language information in the first aspect.
  • FIG. 1 is a flowchart of an embodiment of a method for processing spoken language information according to the present disclosure
  • FIG. 2 is a schematic flowchart of an embodiment of training a spoken language processing model according to the present disclosure
  • FIG. 3 is a schematic structural diagram of an embodiment of a spoken language information processing apparatus according to the present disclosure.
  • FIG. 4 is an exemplary system architecture to which the spoken language information processing method according to an embodiment of the present disclosure may be applied;
  • FIG. 5 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 shows a flowchart of an embodiment of a spoken language information processing method according to the present disclosure.
  • the spoken language information processing method includes the following steps 101 to 103 .
  • Step 101 Determine the stems corresponding to each word in the initial spoken language information, and obtain an initial spoken language stem vector corresponding to the initial spoken language information based on the stems corresponding to the words.
  • the initial spoken language stem vector corresponding to the initial spoken language information is obtained based on the stem corresponding to the word, including: based on the stem corresponding to at least one word in each word in the above spoken language information, obtaining the initial spoken language information corresponding to the initial spoken language information Spoken word stem vector.
  • an initial spoken word stem vector corresponding to the initial spoken word information is obtained based on the stem corresponding to each word.
  • the above-mentioned initial spoken language information may include spoken language text information converted from corresponding spoken language speech information.
  • word segmentation processing may be performed on the spoken voice information to obtain each word contained in the spoken voice information, and then the above spoken text information may be obtained.
  • the technology of converting the spoken voice information into the spoken text information is the prior art, which will not be repeated here.
  • stemming processing may be performed on the initial spoken language information. That is, the stem corresponding to each word in the initial spoken language information can be determined, and the corresponding stem information can be obtained. For example, when the initial spoken language information is "they are workers", the stem information corresponding to each word can be “they", “are”, and "worker” respectively.
  • the corresponding initial spoken word stem vector can be determined.
  • the vector corresponding to each word can be found in the pre-designed word-vector comparison table A, which can simplify the input operation of the word, so that the spoken language processing model can quickly identify the corresponding word information.
  • the vector corresponding to the word “I” can be the number "1”; the vector corresponding to the word “love” can be the number "2”; the vector corresponding to the word “reading” can be the number "3”;
  • the vector can be the number "4"; the vector corresponding to the word "books” can be the number "5".
  • the corresponding initial spoken word stem information can be "I love reade read book”
  • the corresponding initial spoken word stem vector can be "12445".
  • Step 102 according to the initial spoken language vector and the initial spoken word stem vector corresponding to the initial spoken language information, determine the label corresponding to the word in the initial spoken language information; the label at least includes: smooth and non-smooth.
  • the above-mentioned initial spoken language vector may be vector information corresponding to the initial spoken language information.
  • the initial spoken language vector corresponding to the initial spoken language information "I love reading read books” may be "12345".
  • the tags corresponding to each word in the initial spoken language information can be determined by using the above-mentioned initial spoken language vector and the initial spoken language stem vector.
  • the labels here can be used to characterize the state of the word in the initial spoken information. For example, numbers, non-numbers; person names, non-person names, etc.
  • the above labels include at least: smooth and non-slip. That is, through the above-mentioned initial spoken language vector and initial spoken word stem vector, it can be determined whether each word in the initial spoken language information is smooth or not.
  • the above step 102 may include: inputting the initial spoken language vector and the initial spoken word stem vector corresponding to the initial spoken language information into a pre-trained spoken language processing model, and obtaining the corresponding words in the initial spoken language information. Label.
  • the above-mentioned spoken language processing model can be used to determine whether each word is smooth (or repeated), and label each word according to the determination result to obtain a corresponding label.
  • the above-mentioned spoken language processing model may include a sequence labeling model. For example, after inputting the above-mentioned initial spoken language vector "12345” and the above-mentioned initial spoken word stem vector "12445" into the spoken language processing model, if the non-smooth label of the spoken language processing model is "1" and the smooth label is "0", then The output labels corresponding to each word in the initial spoken language information "I love reading read books" can be "0", "0", “1", "0", “0".
  • Step 103 Process the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • the label corresponding to each word After the label corresponding to each word is obtained, it can be determined whether the corresponding word is smooth or not based on the label, and then the smooth target spoken language information can be determined.
  • the above step 103 includes: deleting the words corresponding to the tags marked as non-smooth to obtain the target spoken language information.
  • the initial spoken language information can be post-processed based on the labels corresponding to each word, the words corresponding to the non-smooth labels can be deleted, and then smooth target spoken language information can be obtained.
  • the non-smooth label "1” can be deleted The corresponding word "reading” can then get the smooth target oral information "I love read books”.
  • the initial spoken language information is usually directly input, and the corresponding smooth target spoken language information is output through the spoken language processing model.
  • these non-smooth initial oral information mainly comes from people with strong oral English ability (such as people whose native language is English).
  • the non-smooth initial spoken language information provided by them has less non-smooth parts, and does not require a spoken language processing model with high recognition accuracy (for example, a bidirectional encoder representation model from a transformer) to process it into a smooth target spoken language information.
  • a spoken language processing model with high recognition accuracy for example, a bidirectional encoder representation model from a transformer
  • the stems corresponding to each word in the initial spoken language information are determined, and based on the stems corresponding to the words, an initial spoken language stem vector corresponding to the initial spoken language information is obtained;
  • the initial spoken language vector and the initial spoken word stem vector of determine the label corresponding to the word in the initial spoken language information; the label at least includes: smooth and non-smooth; according to the label corresponding to the word, the Initial spoken language information, get smooth target spoken language information.
  • the spoken language processing model includes a first preset spoken language processing model, a second preset spoken language processing model, and a third preset spoken language processing model, and the spoken language processing model is pre-trained based on the following steps:
  • a training sample set is constructed; the training sample set includes multiple non-smooth sample information.
  • the above non-smooth sample information may be non-smooth spoken English information collected in advance. For example, “Uh so he goes to find go to find the boys”, “but i don't think it's it's a good it's a good idea for you”, “so does uh so does does uh does changsha government”, etc.
  • the collected non-smooth spoken English information can be organized into a data set to obtain the above training sample set.
  • Step 202 for each non-smooth sample information, determine the sample stem corresponding to each sample word in the non-smooth sample information, and obtain the non-smooth corresponding to the non-smooth sample information based on the sample stem corresponding to each sample word. Sliding sample stem vector.
  • Each non-smooth sample information in the training sample set can be stemmed to obtain the non-smooth sample stem vector corresponding to each non-smooth sample information respectively.
  • the stemming process for obtaining the stem vector in the above step 202 may be the same as or similar to the stemming process described in the step 101 in the embodiment shown in FIG. 1 , and details are not described here.
  • Step 203 using the non-smooth sample vector and the non-smooth sample stem vector corresponding to the non-smooth sample information to train the first preset spoken language processing model and the second preset spoken language processing model to convergence, respectively.
  • the above-mentioned non-smooth sample vector can be used to train the first preset spoken language processing model, and the above-mentioned non-smooth sample stem vector can be used to train the second preset model, so that the first preset spoken language processing model and the second preset spoken language processing model are All the spoken language processing models can converge.
  • the non-smooth sample vector can be input into the first preset spoken language processing model, and the first preset spoken language processing model can be used to output the predicted label vector of the latitude corresponding to the non-smooth sample vector, and then the predicted label vector can be output.
  • the label vector is compared with the standard label vector corresponding to the non-smooth sample vector, and then the training result of the first preset spoken language processing model can be determined. If the two are inconsistent, the standard label vector can be used to improve the first preset spoken language processing model.
  • the first preset spoken language processing model can be converged. For example, the non-smooth sample vector "13578" corresponding to the non-smooth sample information "I like eating eat apples" can be input into the first preset spoken language processing model, and the predicted label corresponding to the latitude can be output through the first preset spoken language processing model.
  • the first preset spoken language processing model and the second preset spoken language processing model may include, for example, a bidirectional encoder representation model from a transformer (Bidirectional Encoder Representations from Transformers, bert model for short).
  • its model coding layer can output a coding vector with a preset latitude (for example, the coding label of B*L*D1 latitude, where B can be regarded as used to train the first preset spoken language
  • a preset latitude for example, the coding label of B*L*D1 latitude, where B can be regarded as used to train the first preset spoken language
  • L can be regarded as the number of words in the sample information
  • D1 can be regarded as a hyperparameter set in advance based on experience
  • the model prediction layer can predict the sample vector
  • Output a prediction vector of a preset latitude for example, B*L*K latitude prediction label, where K can be regarded as the number of types of labels, and the probability that a word belongs to each type can be predicted based on the prediction label of this latitude.
  • K can be regarded as the number of types of labels
  • Step 204 splicing the output label vectors of the converged first preset spoken language processing model and the second preset spoken language processing model according to preset rules, and using the spliced combined vector as the input of the third preset spoken language processing model,
  • the third preset spoken language processing model is trained to converge to obtain a spoken language processing model.
  • the two preset spoken language processing models may output encoding vectors.
  • the two encoding vectors for the same non-smooth sample information may be spliced with corresponding latitudes according to preset rules to obtain a combined vector.
  • the combination vector can be used to train the third preset spoken language processing model to convergence.
  • the first preset spoken language processing model can output "1*5*512" latitude
  • the second preset spoken language processing model can output the second encoding vector of "1*5*1024" latitude.
  • the first coding vector and the second coding vector can be spliced to obtain a combined vector of "1*5*(512+1024)" latitude.
  • the third preset spoken language processing model here may be, for example, a convolutional neural network (Convolutional Neural Networks, CNN for short), a long short-term memory artificial neural network (Long Short-Term Memory, LSTM for short), a transformer block, or the like. It should be noted that the working principles of the above-mentioned CNN, LSTM, or transformer block are the prior art, which will not be repeated here.
  • CNN convolutional Neural Networks
  • LSTM Long Short-Term Memory
  • the first preset spoken language processing model and the second preset spoken language processing model can be fused, which can reduce the dependence of the spoken language processing model on repeated words, and help identify some modal changes repeated words (for example, the words "interesting” and "interested”).
  • constructing a training sample set includes the following sub-steps:
  • the smooth spoken sample information can be collected in advance, and by adding noise to the spoken sample information, more complex non-smooth sample information in more scenarios can be obtained.
  • the original morphological word corresponding to each sample word in the smooth sample information is searched in a preset lexicon; the preset lexicon includes the original morphological word corresponding to each sample word.
  • the above-mentioned original form words may include, for example, words corresponding to initial forms of sample words in adverb form, sample words in noun form, sample words in adjective form, and the like.
  • the original morphological words corresponding to the morphological words "would", “does", and “did” may all be "do”.
  • the above preset thesaurus stores the original form words corresponding to each sample word respectively.
  • the original form word When the original form word is obtained, it can be searched in the preset thesaurus. For example, for the above morphological words "would", “does”, and “did", the corresponding original morphological word "do" can be found in the preset thesaurus.
  • Sub-step 2013 Determine the position of the sample word corresponding to the found original form word in the smooth sample information.
  • the position where the sample word can be inserted as a repeated word can be determined according to the position of the sample word in the smooth sample information. For example, for the smooth sample information "would you pass me a cup of tea", the original form word "do" corresponding to the sample word "would” can be found in the preset thesaurus, and then the sample word "would" can be determined
  • the corresponding position may be the first position of the sample information.
  • Sub-step 2014 using the position as the starting position and the sample word as the starting word, insert a plurality of sample words of preset repetition length and preset repetition times.
  • the sample word can be used as the starting word, the sample word with the preset repetition length can be selected from the starting position, and multiple preset repetition times can be repeated in sequence.
  • sample words For example, when the preset repetition length is 3 and the preset repetition number is 1, the position of the sample word "would” in the sample information "would you pass me a cup of tea” can be used as the starting position, and the sample word "would” can be used as the starting position. "Would” is used as the starting word, and the non-smooth sample information "Would you pass would you pass me a cup of tea” is obtained.
  • the preset repetition length and the preset repetition times can be set randomly, so as to increase the authenticity of the initial spoken language information.
  • non-smooth parts based on grammatical features and part-of-speech features can be added to the smooth sample information, and then non-smooth sample information can be constructed to provide more and more realistic information for training the spoken language processing model. training sample information on sex and diversity.
  • constructing a training sample set includes: acquiring smooth sample information; randomly inserting at least one repeated word into the smooth sample information to obtain non-smooth sample information; Initial sample words.
  • the initial sample words can be randomly inserted into the smooth sample information as repeated words to obtain non-smooth sample information.
  • the initial sample word "you” can be inserted into the position corresponding to the sample word “you” of the smooth sample information "would you pass me a cup of tea” to obtain the corresponding non-smooth sample "would you you pass me a cup of tea” of tea”; two initial sample words “a” can also be inserted at the position corresponding to the sample word "a” to obtain the corresponding non-smooth sample "would you pass me a a a cup of tea”.
  • the number of repetitions of the repeated word may be 1, or may be 2 or 3, which is not limited here.
  • constructing a training sample set includes: obtaining smooth sample information; randomly inserting tone words into the smooth sample information to obtain non-smooth sample information.
  • mood words can be randomly inserted into the smooth sample information to obtain non-smooth sample information.
  • the mood word “uh” can be randomly added to the smooth sample information “so he go to find the boys”, to obtain “uh so he go to find the boys”, “so he uh go to find the boys”, “ so he go to find uh the boys” and other non-smooth sample information.
  • FIG. 3 shows a schematic structural diagram of an embodiment of a spoken language information processing apparatus according to the present disclosure.
  • the determination module 301 is used to determine the stem corresponding to each word in the initial spoken language information, and based on the stem corresponding to the word, obtain the initial spoken word stem vector corresponding to the initial spoken language information;
  • the labeling module 302 used for According to the initial spoken language vector corresponding to the initial spoken language information and the initial spoken language stem vector, determine the label corresponding to the word in the initial spoken language information; the label at least includes: smooth, non-smooth; processing module 303 , which is used to process the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • the labeling module 302 is further configured to: input the initial spoken language vector and the initial spoken word stem vector corresponding to the initial spoken language information into the pre-trained spoken language processing model, and obtain the corresponding initial spoken language information The label corresponding to each word in .
  • the spoken language processing model includes a first preset spoken language processing model, a second preset spoken language processing model, and a third preset spoken language processing model
  • the spoken language processing model is pre-trained based on the following steps : construct a training sample set; the training sample set includes multiple non-smooth sample information; for each non-smooth sample information, determine the sample stem corresponding to each sample word in the non-smooth sample information, and based on the corresponding sample word obtain the non-smooth sample stem vector corresponding to the non-smooth sample information; use the non-smooth sample vector and the non-smooth sample stem vector corresponding to the non-smooth sample information to train the first preset spoken language processing respectively
  • the model and the second preset spoken language processing model are converged; the output label vectors of the converged first preset spoken language processing model and the second preset spoken language processing model are spliced according to preset rules, and the spliced combined vector is used as the first
  • constructing a training sample set includes: acquiring smooth sample information; searching for the original form word corresponding to each sample word in the smooth sample information in a preset thesaurus; preset The thesaurus includes the original morphological words corresponding to each sample word; determine the position of the sample word corresponding to the found original morphological word in the smooth sample information; take the position as the starting position and the sample word as the starting word, insert Multiple sample words with preset repeat length and preset number of repetitions
  • constructing a training sample set includes: acquiring smooth sample information; randomly inserting at least one repeated word into the smooth sample information to obtain non-smooth sample information; the repeated words are included in the The initial sample word at the caret position.
  • constructing a training sample set includes: acquiring smooth sample information; randomly inserting tone words into the smooth sample information to obtain non-smooth sample information.
  • the processing module 303 is further configured to: delete the words corresponding to the tags marked as non-smooth to obtain the target spoken language information.
  • FIG. 4 shows an exemplary system architecture to which the spoken language information processing method according to an embodiment of the present disclosure may be applied.
  • the system architecture may include terminal devices 401 , 402 , and 403 , a network 404 , and a server 405 .
  • the network 404 is a medium used to provide a communication link between the terminal devices 401 , 402 , 403 and the server 405 .
  • the network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the above-mentioned terminal devices and servers can use any currently known or future developed network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium (for example, communication network) interconnection. Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, Ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • LAN local area networks
  • WAN wide area networks
  • the Internet eg, the Internet
  • the terminal devices 401, 402, and 403 can interact with the server 405 through the network 404 to receive or send messages and the like.
  • Various client applications may be installed on the terminal devices 401 , 402 and 403 , such as video publishing applications, search applications, and news information applications.
  • the terminal devices 401, 402, and 403 may be hardware or software.
  • the terminal devices 401, 402, and 403 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3
  • MP4 Motion Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the terminal devices 401, 402, and 403 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (eg, software or software modules for providing distributed services), or as a single software or software module. There is no specific limitation here.
  • the server 405 may be a server that can provide various services, such as receiving a processing request sent by the terminal devices 401, 402, and 403 for determining the stem corresponding to each word in the initial spoken language information, analyzing and processing the processing request, and analyzing the processing results. (For example, the stem corresponding to each word corresponding to the above processing request) is sent to the terminal devices 401 , 402 , and 403 .
  • the spoken language information processing method provided by the embodiments of the present disclosure may be executed by a server or a terminal device, and correspondingly, the spoken language information processing apparatus may be set in the server or in the terminal device.
  • terminal devices, networks and servers in FIG. 4 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 5 it shows a schematic structural diagram of an electronic device (eg, the server or the server in FIG. 4 ) suitable for implementing an embodiment of the present disclosure.
  • the electronic device shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device may include a processing device (eg, a central processing unit, a graphics processor, etc.) 501 that may be loaded into a random access memory according to a program stored in a read only memory (ROM) 502 or from a storage device 508
  • the program in the (RAM) 503 executes various appropriate operations and processes.
  • various programs and data required for the operation of the electronic device are also stored.
  • the processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to bus 504 .
  • I/O interface 505 input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration
  • An output device 507 such as a computer
  • a storage device 508 including, for example, a magnetic tape, a hard disk, etc.
  • Communication means 509 may allow the electronic device to communicate wirelessly or by wire with other devices to exchange data. While Figure 5 shows an electronic device having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 509, or from the storage device 508, or from the ROM 502.
  • the processing apparatus 501 When the computer program is executed by the processing apparatus 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: determines the stem corresponding to each word in the initial spoken language information, and based on the corresponding word of the word to obtain the initial spoken word stem vector corresponding to the initial spoken word information; according to the initial spoken word vector corresponding to the initial spoken word information and the initial spoken word stem vector, determine the word corresponding to the word in the initial spoken word information Label; the label at least includes: smooth and non-smooth; and processing the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the name of the module does not constitute a limitation of the unit itself in some cases, for example, the determination module 301 can also be described as "determining the stem corresponding to each word in the initial spoken language information, and based on the corresponding Stemming obtains the module of the initial spoken word stem vector corresponding to the initial spoken word information".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the spoken language information processing method includes: determining a stem corresponding to each word in the initial spoken language information, and obtaining a stem corresponding to the initial spoken language information based on the stem corresponding to the word Initial spoken word stem vector; according to the initial spoken word vector corresponding to the initial spoken word information and the initial spoken word stem vector, determine the label corresponding to the word in the initial spoken word information; the label at least includes: smooth, non-smooth Smooth; process the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information.
  • determining the label corresponding to the word in the initial spoken language information according to the initial spoken language vector corresponding to the initial spoken language information and the initial spoken language stem vector includes: Inputting the initial spoken language vector corresponding to the initial spoken language information and the initial spoken word stem vector into a pre-trained spoken language processing model to obtain labels corresponding to each word in the initial spoken language information.
  • the spoken language processing model includes a first preset spoken language processing model, a second preset spoken language processing model, and a third preset spoken language processing model
  • the spoken language processing model is preliminarily based on the following Step training: constructing a training sample set; the training sample set includes a plurality of non-smooth sample information; for each non-smooth sample information, determine the sample stem corresponding to each sample word in the non-smooth sample information, and obtain the non-smooth sample stem vector corresponding to the non-smooth sample information based on the sample stem corresponding to each sample word; use the non-smooth sample vector corresponding to the non-smooth sample information, the non-smooth sample information
  • the sample stem vector trains the first preset spoken language processing model and the second preset spoken language processing model to convergence respectively; the converged first preset spoken language processing model and the second preset spoken language processing model
  • the output label vector of the splicing is performed according to the preset rules, and
  • the constructing the training sample set includes: acquiring smooth sample information; searching for the original morphological word corresponding to each sample word in the smooth sample information in a preset thesaurus;
  • the preset thesaurus includes the original form words corresponding to the respective sample words; determine the position of the found sample words corresponding to the original form words in the smooth sample information; start with the position position, and using the sample word as a starting word, insert a plurality of sample words with a preset repetition length and a preset number of repetitions.
  • the constructing a training sample set includes: acquiring smooth sample information; randomly inserting at least one repeated word into the smooth sample information to obtain the non-smooth sample information;
  • the repeated words include the original sample word at the insertion position.
  • the constructing a training sample set includes: acquiring smooth sample information; and randomly inserting tone words into the smooth sample information to obtain the non-smooth sample information.
  • the processing of the initial spoken language information according to the label corresponding to the word to obtain smooth target spoken language information includes: deleting the word corresponding to the label marked as non-smooth, Obtain the target spoken language information.
  • the spoken language information processing apparatus includes: a determination module, configured to determine the stem corresponding to each word in the initial spoken language information, and obtain a word stem corresponding to the word based on the stem corresponding to the word
  • the initial spoken language stem vector corresponding to the initial spoken language information is used to determine the label corresponding to the word in the initial spoken language information according to the initial spoken language vector corresponding to the initial spoken language information and the initial spoken language stem vector;
  • the label at least includes: smooth and non-smooth; a processing module, configured to process the initial spoken language information according to the label corresponding to the word, to obtain smooth target spoken language information
  • the labeling module 302 is further configured to: input the initial spoken language vector and the initial spoken word stem vector corresponding to the initial spoken language information into the pre-trained spoken language processing model, and obtain the corresponding initial spoken language information.
  • the label corresponding to the word is further configured to: input the initial spoken language vector and the initial spoken word stem vector corresponding to the initial spoken language information into the pre-trained spoken language processing model, and obtain the corresponding initial spoken language information. The label corresponding to the word.
  • the spoken language processing model includes a first preset spoken language processing model, a second preset spoken language processing model, and a third preset spoken language processing model
  • the spoken language processing model is pre-trained based on the following steps: constructing training sample set; the training sample set includes multiple non-smooth sample information; for each non-smooth sample information, determine the sample stem corresponding to each sample word in the non-smooth sample information, and based on the sample corresponding to each sample word Stem to obtain the non-smooth sample stem vector corresponding to the non-smooth sample information; use the non-smooth sample vector and the non-smooth sample stem vector corresponding to the non-smooth sample information to train the first preset spoken language processing model and The second preset spoken language processing model is converged; the output label vectors of the converged first preset spoken language processing model and the second preset spoken language processing model are spliced according to preset rules, and the spliced combined vector
  • constructing a training sample set includes: acquiring smooth sample information; searching a preset thesaurus for original morphological words corresponding to each sample word in the smooth sample information; preset thesaurus include the original form words corresponding to each sample word; determine the position of the sample words corresponding to the found original form words in the smooth sample information; take the position as the starting position and the sample word as the starting word, insert the preset Multiple sample words with repeat length, preset number of repetitions
  • constructing a training sample set includes: acquiring smooth sample information; randomly inserting at least one repeated word into the smooth sample information to obtain non-smooth sample information; the repeated word is included in the insertion position initial sample words at .
  • constructing a training sample set includes: acquiring smooth sample information; randomly inserting tone words into the smooth sample information to obtain non-smooth sample information.
  • the processing module 303 is further configured to: delete the words corresponding to the tags marked as non-smooth to obtain the target spoken language information.

Abstract

本公开实施例公开了一种口语信息处理方法、装置和电子设备。该方法的一具体实施方式包括:确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的所述单词对应的标签;所述标签至少包括:顺滑、非顺滑;根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。能够基于初始口语向量和初始口语词干向量处理初始口语信息,利于初始口语信息的去重处理,得到顺滑的目标口语信息。

Description

口语信息处理方法、装置和电子设备
相关申请的交叉引用
本申请要求于2020年12月08日提交的,申请号为202011461385.3、发明名称为“口语信息处理方法、装置和电子设备”的中国专利申请的优先权,该申请的全文通过引用结合在本申请中。
技术领域
本公开涉及互联网技术领域,尤其涉及一种口语信息处理方法、装置和电子设备。
背景技术
在使用英语作为沟通语言时,常常需要将说话者的口语信息进行处理,以用于翻译成其他语言的文本信息或者转换为较为规范的文本信息进行传阅等。在这过程中,可以对该口语信息进行去重处理,继而实现下游任务(例如口语信息中的语法纠错、抽取短句进行分析等)对该口语信息的进一步处理操作。
发明内容
提供该公开内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该公开内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
本公开实施例提供了一种口语信息处理方法、装置和电子设备。
第一方面,本公开实施例提供了一种口语信息处理方法,该方法包括:确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、 非顺滑;根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
第二方面,本公开实施例提供了一种口语信息处理装置,该装置包括:确定模块,用于确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;标注模块,用于根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;处理模块,用于根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
第三方面,本公开实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现第一方面所述的口语信息处理方法。
第四方面,本公开实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现上述第一方面所述的口语信息处理方法的步骤。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为根据本公开的口语信息处理方法的一个实施例的流程图;
图2为根据本公开涉及的训练口语处理模型的一个实施例的流程示意图;
图3为根据本公开的口语信息处理装置的一个实施例的结构示意图;
图4为本公开的一个实施例的口语信息处理方法可以应用于其中的示例性系统架构;
图5为根据本公开实施例提供的电子设备的基本结构的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。
请参考图1,其示出了根据本公开的口语信息处理方法的一个实施例的流程图,如图1所示,该口语信息处理方法包括以下步骤101至步骤103。
步骤101,确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与初始口语信息对应的初始口语词干向量。
这里基于所述单词对应的词干得到与初始口语信息对应的初始口语词干向量,包括:基于上述口语信息中的各个单词中的至少一个单词对应的词干,得到与初始口语信息对应的初始口语词干向量。优选地,基于各单词对应的词干,得到与初始口语信息对应的初始口语词干向量。
上述初始口语信息可以包括由对应的口语语音信息转换得到的口语文本信息。在一些应用场景中,当采集到口语语音信息之后,可以对该口语语音信息进行分词处理,得到该口语语音信息所包含的各个单词,继而可以得到上述口语文本信息。这里,将口语语音信息转换为口语文本信息的技术为现有技术,此处不再赘述。
得到初始口语信息之后,可以将该初始口语信息进行词干化处理。也即,可以确定该初始口语信息中各个单词对应的词干,得到各自对应的词干信息。例如,初始口语信息为“they are workers”时,各个单词对应的词干信息可以分别为“they”、“are”、“worker”。
当得到各个单词对应的词干之后,可以确定对应的初始口语词干向量。在一些应用场景中,可以在预先设计的单词-向量对照表A中查找各个单词对应的向量,继而可以简化单词的输入操作,使口语处理模型能够较快识别出对应的单词信息。例如,单词“I”对应的向量可以为数字“1”;单词“love”对应的向量可以为数字“2”;单词“reading”对应的向量可以为数字“3”;单词“read”对应的向量可以为数字“4”;单词“books”对应的向量可以为数字“5”。这样,当初始口语信息为“I love reading read books”时,对应的初始口语词干信息可以为“I love reade read book”,则对应的初始口语词干向量可以为“12445”。
步骤102,根据初始口语信息对应的初始口语向量和初始口语词干向量,确定与初始口语信息中的单词对应的标签;标签至少包括:顺滑、非顺滑。
上述初始口语向量可以为初始口语信息对应的向量信息。例如,基于上述单词-向量对照表A,初始口语信息“I love reading read books” 对应的初始口语向量可以为“12345”。
可以利用上述初始口语向量和初始口语词干向量,确定与所述初始口语信息中各个单词对应的标签。这里的标签可以用于表征单词在初始口语信息中的状态。例如数字、非数字;人名,非人名等。在本公开中,上述标签至少包括:顺滑和非顺滑。也即,通过上述初始口语向量和初始口语词干向量,可以确定初始口语信息中各个单词是否顺滑。
在一些可选的实现方式中,上述步骤102可以包括:将初始口语信息对应的初始口语向量和初始口语词干向量输入预先训练好的口语处理模型中,得到与初始口语信息中各个单词对应的标签。
也即,可以利用上述口语处理模型判定各个单词是否顺滑(或者是否重复),并根据判定结果为各个单词进行标注,得到对应的标签。相应的,上述口语处理模型可以包括序列标注模型。例如,将上述初始口语向量“12345”和上述初始口语词干向量“12445”输入口语处理模型之后,若该口语处理模型的非顺滑标签为“1”,顺滑标签为“0”,则初始口语信息“I love reading read books”中各个单词对应的输出标签可以为“0”、“0”、“1”、“0”、“0”。
步骤103,根据所述单词对应的标签处理初始口语信息,得到顺滑的目标口语信息。
得到各个单词对应的标签之后,可以基于标签判断对应的单词是否顺滑,继而可以确定顺滑的目标口语信息。
在一些可选的实现方式中,上述步骤103包括:删除被标记为非顺滑的标签对应的单词,得到目标口语信息。
也就是说,可以基于各个单词对应的标签对初始口语信息进行后处理,删除非顺滑标签对应的单词,继而可以得到顺滑的目标口语信息。例如,对于上述初始口语信息“I love reading read books”中各个单词对应的输出标签“0”、“0”、“1”、“0”、“0”,可以删除非顺滑标签“1”对应的单词“reading”,继而可以得到顺滑的目标口语信息“I love read books”。
相关技术中,为了将非顺滑的初始口语信息处理为顺滑状态,通 常会直接输入初始口语信息,通过口语处理模型输出对应的顺滑的目标口语信息。但是这些非顺滑的初始口语信息主要来自于英语口语表达能力较强的人群(例如母语为英文的人群)。他们所提供的非顺滑初始口语信息中的非顺滑部分较少,不需要识别精度很高的口语处理模型(例如:来自变压器的双向编码器表示模型)就能够处理成顺滑的目标口语信息。但是针对于英语口语表达能力不强的人群所提供的初始口语信息,如果识别精度较低,则很难得到顺滑的目标口语信息。
在本实施例中,通过确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
在一些可选的实现方式中,口语处理模型包括第一预设口语处理模型、第二预设口语处理模型和第三预设口语处理模型,以及口语处理模型预先基于如下步骤训练:
步骤201,构建训练样本集;训练样本集中包括多个非顺滑样本信息。
上述非顺滑样本信息可以是提前搜集到的非顺滑的英文口语信息。例如“Uh so he goes to find go to find the boys”、“but i don't think it's it's a good it's a good idea for you”、“so does uh so does does uh does changsha government”等。
实践中,可以将搜集到的非顺滑的英语口语信息整理成数据集,得到上述训练样本集。
步骤202,针对每一个非顺滑样本信息,确定该非顺滑样本信息中各个样本单词对应的样本词干,并基于各个样本单词对应的样本词干得到该非顺滑样本信息对应的非顺滑样本词干向量。
可以将训练样本集中的各个非顺滑样本信息均进行词干化处理,得到各个非顺滑样本信息分别对应的非顺滑样本词干向量。在本实施例中,上述步骤202中得到词干向量的词干化处理可以与图1所示的 实施例中步骤101所叙述的词干化处理相同或相似,此处不赘述。
步骤203,利用非顺滑样本信息对应的非顺滑样本向量、非顺滑样本词干向量分别训练第一预设口语处理模型和第二预设口语处理模型至收敛。
也就是说,可以利用上述非顺滑样本向量训练第一预设口语处理模型,利用上述非顺滑样本词干向量训练第二预设模型,以使得第一预设口语处理模型和第二预设口语处理模型均能够收敛。在一些应用场景中,可以将非顺滑样本向量输入第一预设口语处理模型中,利用第一预设口语处理模型输出该非顺滑样本向量对应纬度的预测标签向量,然后可以将该预测标签向量与该非顺滑样本向量对应的标准标签向量进行比对,继而可以确定第一预设口语处理模型的训练成果。如果两者不一致,可以利用标准标签向量完善第一预设口语处理模型。通过多个非顺滑样本向量的训练,可以使第一预设口语处理模型收敛。例如,可以将非顺滑样本信息“I like eating eat apples”对应的非顺滑样本向量“13578”输入第一预设口语处理模型中,通过第一预设口语处理模型输出对应纬度的预测标签向量,当该预测标签向量与标准标签向量“00100”一致时,可以视为该第一预设口语处理模型收敛。相似地,第二预设口语处理模型的训练方式可以参照第一预设口语处理模型的训练过程,此处不再赘述。应当说明的是,第一预设口语处理模型和第二预设口语处理模型例如可以包括来自变压器的双向编码器表示模型(Bidirectional Encoder Representations from Transformers,简称bert模型)。利用上述bert模型处理对应的样本向量时,其模型编码层可以输出一个预设纬度的编码向量(例如B*L*D1纬度的编码标签,这里的B可以视为用以训练第一预设口语处理模型或者第二预设口语处理模型的样本数量,L可以视为样本信息中单词的个数,D1可以视为依据经验提前设置的超参数),模型预测层可以对该样本向量进行预测,输出一个预设纬度的预测向量(例如B*L*K纬度预测标签,这里的K可以视为标签的种类数,可以基于该纬度的预测标签预测某个单词属于各个种类对应的概率)。上述bert模型的工作原理为现有技术,此处不再赘述。
步骤204,将收敛的第一预设口语处理模型和第二预设口语处理模型的输出标签向量按照预设规则进行拼接,并将拼接后的组合向量作为第三预设口语处理模型的输入,训练第三预设口语处理模型至收敛,得到口语处理模型。
当第一预设口语处理模型和第二预设口语处理模型均收敛之后,两个预设口语处理模型可以输出编码向量。在一些应用场景中,可以将两者对于同一个非顺滑样本信息的编码向量按照预设规则将对应的纬度进行拼接,得到组合向量。并可以利用组合向量训练第三预设口语处理模型至收敛。例如,对于样本数量为1的第一预设口语处理模型,如果该非顺滑样本信息为“I like eating eat apples”,则第一预设口语处理模型可以输出“1*5*512”纬度的第一编码向量,第二预设口语处理模型可以输出“1*5*1024”纬度的第二编码向量。此时,可以将第一编码向量和第二编码向量进行拼接,得到“1*5*(512+1024)”纬度的组合向量。并可以利用该组合向量训练第三预设口语处理模型,得到对应的1*5*(512+1024)纬度的预测标签向量,当该纬度的预测标签向量与真实标签向量“00100”一致时,可以视为第三预设口语处理模型收敛,继而得到了目标的口语处理模型。这里的第三预设口语处理模型例如可以为卷积神经网络(Convolutional Neural Networks,简称CNN)、长短期记忆人工神经网络(Long Short-Term Memory,简称LSTM)或变压器组(transformer block)等。应当说明的是,上述CNN、LSTM或transformer block等的工作原理为现有技术,此处不再赘述。
通过上述步骤201至步骤204,可以将第一预设口语处理模型和第二预设口语处理模型进行融合,继而可以减少口语处理模型对重复单词的依赖,并且有助于识别出一些具有情态变化的重复单词(例如,单词“interesting”和“interested”)。
在一些可选的实现方式中,构建训练样本集,包括以下子步骤:
子步骤2011,获取顺滑样本信息。
可以提前搜集顺滑的口语样本信息,通过给这些口语样本信息进行加噪,得到更多场景下的、更复杂的非顺滑样本信息。
子步骤2012,在预设词库中查找顺滑样本信息中的各个样本单词对应的原始形态单词;预设词库中包括各个样本单词对应的原始形态单词。
上述原始形态单词例如可以包括与副词形式的样本单词、名词形式的样本单词、形容词形式的样本单词等的初始形态对应的单词。例如,形态单词“would”、“does”、“did”对应的原始形态单词可以均为“do”。上述预设词库中存储了各个样本单词分别对应的原始形态单词。
当得到原始形态单词之后,可以在预设词库中进行查找。例如,对于上述形态单词“would”、“does”、“did”,可以在预设词库中查找到对应的原始形态单词“do”。
子步骤2013,确定查找到的原始形态单词对应的样本单词在顺滑样本信息中的位置。
当查找到与样本单词对应的原始形态单词之后,可以根据该样本单词在顺滑样本信息中的位置确定该样本单词作为重复单词时可以插入的位置。例如,顺滑的样本信息“would you pass me a cup of tea”,可以在预设词库中查找到与样本单词“would”对应的原始形态单词“do”,继而可以确定样本单词“would”对应的位置可以为该样本信息的首位。
子步骤2014,以位置为起始位置,并以样本单词为起始单词,插入预设重复长度、预设重复次数的多个样本单词。
当确定了与原始形态单词对应的样本单词的位置之后,可以以该样本单词为起始单词,选择预设重复长度的样本单词从该起始位置开始,顺次重复预设重复次数的多个样本单词。例如,当预设重复长度为3,预设重复次数为1时,可以将上述样本单词“would”在样本信息“would you pass me a cup of tea”中的位置作为起始位置,将样本单词“would”作为起始单词,得到非顺滑样本信息“Would you pass Would you pass me a cup of tea”。这里,对于预设重复长度和预设重复次数可以随机设定,以增加初始口语信息的真实性。
通过上述子步骤2011至子步骤2014,可以为顺滑样本信息添加基于语法特征、词性特征的非顺滑部分,继而构造出非顺滑样本信息, 为训练口语处理模型提供更多的更具真实性和多样性的训练样本信息。
在一些可选的实现方式中,构建训练样本集,包括:获取顺滑样本信息;在顺滑样本信息中随机插入至少一个重复单词,得到非顺滑样本信息;重复单词包括在插入位置处的初始样本单词。
也就是说,可以将初始样本单词作为重复单词随机插入顺滑样本信息中,得到非顺滑样本信息。例如,可以在顺滑样本信息“would you pass me a cup of tea”的样本单词“you”对应的位置插入初始样本单词“you”,得到对应的非顺滑样本“would you you pass me a cup of tea”;也可以样本单词“a”对应的位置插入两个初始样本单词“a”,得到对应的非顺滑样本“would you pass me a a a cup of tea”。这里,重复单词的重复次数可以为1,也可以为2或3,此处不作限定。
通过上述随机插入重复单词的方式,更加贴近于现实中的口语场景下的口语信息,进一步增强了样本信息的真实性和复杂性,使得训练出的口语处理模型的识别精度更高。
在一些可选的实现方式中,构建训练样本集,包括:获取顺滑样本信息;在顺滑样本信息中随机插入语气单词,得到非顺滑样本信息。
在说出英语口语时,常常会在口语信息中夹杂语气词。继而可以在顺滑样本信息中随机插入语气单词,得到非顺滑样本信息。例如,可以在顺滑样本信息“so he go to find the boys”中随机加入语气单词“uh”,得到“uh so he go to find the boys”、“so he uh go to find the boys”、“so he go to find uh the boys”等非顺滑样本信息。
通过上述随机插入语气单词的方式,更加贴近于现实中的口语场景下的口语信息,进一步增强了样本信息的真实性和复杂性,使得训练出的口语处理模型的识别精度更高。
请参考图3,其示出了根据本公开的口语信息处理装置的一个实施例的结构示意图,如图3所示,口语信息处理装置包括确定模块301、标注模块302和处理模块303。其中,确定模块301,用于确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;标注模块302,用于根据 所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;处理模块303,用于根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
需要说明的是,该口语信息处理装置的确定模块301、标注模块302和处理模块303的具体处理及其所带来的技术效果可分别参考图1对应实施例中步骤101至步骤103的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,标注模块302进一步用于:将初始口语信息对应的初始口语向量和初始口语词干向量输入预先训练好的口语处理模型中,得到与初始口语信息中各个单词对应的标签。
在本实施例的一些可选的实现方式中,口语处理模型包括第一预设口语处理模型、第二预设口语处理模型和第三预设口语处理模型,以及口语处理模型预先基于如下步骤训练:构建训练样本集;训练样本集中包括多个非顺滑样本信息;针对每一个非顺滑样本信息,确定该非顺滑样本信息中各个样本单词对应的样本词干,并基于各个样本单词对应的样本词干得到该非顺滑样本信息对应的非顺滑样本词干向量;利用非顺滑样本信息对应的非顺滑样本向量、非顺滑样本词干向量分别训练第一预设口语处理模型和第二预设口语处理模型至收敛;将收敛的第一预设口语处理模型和第二预设口语处理模型的输出标签向量按照预设规则进行拼接,并将拼接后的组合向量作为第三预设口语处理模型的输入,训练第三预设口语处理模型至收敛,得到口语处理模型。
在本实施例的一些可选的实现方式中,构建训练样本集,包括:获取顺滑样本信息;在预设词库中查找顺滑样本信息中的各个样本单词对应的原始形态单词;预设词库中包括各个样本单词对应的原始形态单词;确定查找到的原始形态单词对应的样本单词在顺滑样本信息中的位置;以位置为起始位置,并以样本单词为起始单词,插入预设重复长度、预设重复次数的多个样本单词
在本实施例的一些可选的实现方式中,构建训练样本集,包括:获取顺滑样本信息;在顺滑样本信息中随机插入至少一个重复单词, 得到非顺滑样本信息;重复单词包括在插入位置处的初始样本单词。
在本实施例的一些可选的实现方式中,构建训练样本集,包括:获取顺滑样本信息;在顺滑样本信息中随机插入语气单词,得到非顺滑样本信息。
在本实施例的一些可选的实现方式中,处理模块303进一步用于:删除被标记为非顺滑的标签对应的单词,得到目标口语信息。
请参考图4,其示出了本公开的一个实施例的口语信息处理方法可以应用于其中的示例性系统架构。
如图4所示,系统架构可以包括终端设备401、402、403,网络404,服务器405。网络404用以在终端设备401、402、403和服务器405之间提供通信链路的介质。网络404可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。上述终端设备和服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,Ad hoc端对端网络),以及任何当前已知或未来研发的网络。
终端设备401、402、403可以通过网络404与服务器405交互,以接收或发送消息等。终端设备401、402、403上可以安装有各种客户端应用,例如视频发布应用、搜索类应用、新闻资讯类应用。
终端设备401、402、403可以是硬件,也可以是软件。当终端设备401、402、403为硬件时,可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。当终端设备401、402、403为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块 (例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器405可以是可以提供各种服务的服务器,例如接收终端设备401、402、403发送的确定初始口语信息中各个单词对应的词干的处理请求,对处理请求进行分析处理,并将分析处理结果(例如与上述处理请求对应的各个单词对应的词干)发送给终端设备401、402、403。
需要说明的是,本公开实施例所提供的口语信息处理方法可以由服务器执行,也可以由终端设备执行,相应地,口语信息处理装置可以设置在服务器中,也可以设置在终端设备中。
应该理解,图4中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
下面参考图5,其示出了适于用来实现本公开实施例的电子设备(例如图4中的服务器或服务器)的结构示意图。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图5所示,电子设备可以包括处理装置(例如中央处理器、图形处理器等)501,其可以根据存储在只读存储器(ROM)502中的程序或者从存储装置508加载到随机访问存储器(RAM)503中的程序而执行各种适当的动作和处理。在RAM 503中,还存储有电子设备操作所需的各种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。
通常,以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置507;包括例如磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或 更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不 构成对该单元本身的限定,例如,确定模块301还可以被描述为“确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与初始口语信息对应的初始口语词干向量的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例所提供的口语信息处理方法,包括:确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
根据本公开的一个或多个实施例,所述根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中所述单词对应的标签,包括:将所述初始口语信息对应的初始口语向量和所述初始口语词干向量输入预先训练好的口语处理模型中,得到与所述初始口语信息中各个单词对应的标签。
根据本公开的一个或多个实施例,所述口语处理模型包括第一预 设口语处理模型、第二预设口语处理模型和第三预设口语处理模型,以及所述口语处理模型预先基于如下步骤训练:构建训练样本集;所述训练样本集中包括多个非顺滑样本信息;针对每一个所述非顺滑样本信息,确定该非顺滑样本信息中各个样本单词对应的样本词干,并基于所述各个样本单词对应的样本词干得到该非顺滑样本信息对应的非顺滑样本词干向量;利用所述非顺滑样本信息对应的非顺滑样本向量、所述非顺滑样本词干向量分别训练所述第一预设口语处理模型和所述第二预设口语处理模型至收敛;将收敛的所述第一预设口语处理模型和所述第二预设口语处理模型的输出标签向量按照预设规则进行拼接,并将拼接后的组合向量作为所述第三预设口语处理模型的输入,训练所述第三预设口语处理模型至收敛,得到所述口语处理模型。
根据本公开的一个或多个实施例,所述构建训练样本集,包括:获取顺滑样本信息;在预设词库中查找所述顺滑样本信息中的各个样本单词对应的原始形态单词;所述预设词库中包括所述各个样本单词对应的原始形态单词;确定查找到的所述原始形态单词对应的样本单词在所述顺滑样本信息中的位置;以所述位置为起始位置,并以所述样本单词为起始单词,插入预设重复长度、预设重复次数的多个样本单词。
根据本公开的一个或多个实施例,所述构建训练样本集,包括:获取顺滑样本信息;在所述顺滑样本信息中随机插入至少一个重复单词,得到所述非顺滑样本信息;所述重复单词包括在插入位置处的初始样本单词。
根据本公开的一个或多个实施例,所述构建训练样本集,包括:获取顺滑样本信息;在所述顺滑样本信息中随机插入语气单词,得到所述非顺滑样本信息。
根据本公开的一个或多个实施例,所述根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息,包括:删除被标记为非顺滑的标签对应的单词,得到所述目标口语信息。
根据本公开的一个或多个实施例所提供的口语信息处理装置,包括:确定模块,用于确定初始口语信息中各个单词对应的词干,并基 于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;标注模块,用于根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;处理模块,用于根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息
根据本公开的一个或多个实施例,标注模块302进一步用于:将初始口语信息对应的初始口语向量和初始口语词干向量输入预先训练好的口语处理模型中,得到与初始口语信息中各个单词对应的标签。
根据本公开的一个或多个实施例,口语处理模型包括第一预设口语处理模型、第二预设口语处理模型和第三预设口语处理模型,以及口语处理模型预先基于如下步骤训练:构建训练样本集;训练样本集中包括多个非顺滑样本信息;针对每一个非顺滑样本信息,确定该非顺滑样本信息中各个样本单词对应的样本词干,并基于各个样本单词对应的样本词干得到该非顺滑样本信息对应的非顺滑样本词干向量;利用非顺滑样本信息对应的非顺滑样本向量、非顺滑样本词干向量分别训练第一预设口语处理模型和第二预设口语处理模型至收敛;将收敛的第一预设口语处理模型和第二预设口语处理模型的输出标签向量按照预设规则进行拼接,并将拼接后的组合向量作为第三预设口语处理模型的输入,训练第三预设口语处理模型至收敛,得到口语处理模型。
根据本公开的一个或多个实施例,构建训练样本集,包括:获取顺滑样本信息;在预设词库中查找顺滑样本信息中的各个样本单词对应的原始形态单词;预设词库中包括各个样本单词对应的原始形态单词;确定查找到的原始形态单词对应的样本单词在顺滑样本信息中的位置;以位置为起始位置,并以样本单词为起始单词,插入预设重复长度、预设重复次数的多个样本单词
根据本公开的一个或多个实施例,构建训练样本集,包括:获取顺滑样本信息;在顺滑样本信息中随机插入至少一个重复单词,得到非顺滑样本信息;重复单词包括在插入位置处的初始样本单词。
根据本公开的一个或多个实施例,构建训练样本集,包括:获取 顺滑样本信息;在顺滑样本信息中随机插入语气单词,得到非顺滑样本信息。
根据本公开的一个或多个实施例,处理模块303进一步用于:删除被标记为非顺滑的标签对应的单词,得到目标口语信息。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (10)

  1. 一种口语信息处理方法,其特征在于,包括:
    确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;
    根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的所述单词对应的标签;所述标签至少包括:顺滑、非顺滑;
    根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签,包括:
    将所述初始口语信息对应的初始口语向量和所述初始口语词干向量输入预先训练好的口语处理模型中,得到与所述初始口语信息中各个单词对应的标签。
  3. 根据权利要求2所述的方法,其特征在于,所述口语处理模型包括第一预设口语处理模型、第二预设口语处理模型和第三预设口语处理模型,以及所述口语处理模型预先基于如下步骤训练:
    构建训练样本集;所述训练样本集中包括多个非顺滑样本信息;
    针对每一个所述非顺滑样本信息,确定该非顺滑样本信息中各个样本单词对应的样本词干,并基于所述各个样本单词对应的样本词干得到该非顺滑样本信息对应的非顺滑样本词干向量;
    利用所述非顺滑样本信息对应的非顺滑样本向量、所述非顺滑样本词干向量分别训练所述第一预设口语处理模型和所述第二预设口语处理模型至收敛;
    将收敛的所述第一预设口语处理模型和所述第二预设口语处理模型的输出标签向量按照预设规则进行拼接,并将拼接后的组合向量作 为所述第三预设口语处理模型的输入,训练所述第三预设口语处理模型至收敛,得到所述口语处理模型。
  4. 根据权利要求3所述的方法,其特征在于,所述构建训练样本集,包括:
    获取顺滑样本信息;
    在预设词库中查找所述顺滑样本信息中的各个样本单词对应的原始形态单词;所述预设词库中包括所述各个样本单词对应的原始形态单词;
    确定查找到的所述原始形态单词对应的样本单词在所述顺滑样本信息中的位置;
    以所述位置为起始位置,并以所述样本单词为起始单词,插入预设重复长度、预设重复次数的多个样本单词。
  5. 根据权利要求3所述的方法,其特征在于,所述构建训练样本集,包括:
    获取顺滑样本信息;
    在所述顺滑样本信息中随机插入至少一个重复单词,得到所述非顺滑样本信息;所述重复单词包括在插入位置处的初始样本单词。
  6. 根据权利要求3所述的方法,其特征在于,所述构建训练样本集,包括:
    获取顺滑样本信息;
    在所述顺滑样本信息中随机插入语气单词,得到所述非顺滑样本信息。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息,包括:
    删除被标记为非顺滑的标签对应的单词,得到所述目标口语信息。
  8. 一种口语信息处理装置,其特征在于,包括:
    确定模块,用于确定初始口语信息中各个单词对应的词干,并基于所述单词对应的词干得到与所述初始口语信息对应的初始口语词干向量;
    标注模块,用于根据所述初始口语信息对应的初始口语向量和所述初始口语词干向量,确定与所述初始口语信息中的单词对应的标签;所述标签至少包括:顺滑、非顺滑;
    处理模块,用于根据所述单词对应的标签处理所述初始口语信息,得到顺滑的目标口语信息。
  9. 一种电子设备,其特征在于,包括:
    至少一个处理器;
    存储装置,其上存储有至少一个程序,当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现权利要求1-7中任一所述的方法。
  10. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-7中任一所述的方法。
PCT/CN2021/135834 2020-12-08 2021-12-06 口语信息处理方法、装置和电子设备 WO2022121859A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011461385.3 2020-12-08
CN202011461385.3A CN112651231B (zh) 2020-12-08 2020-12-08 口语信息处理方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2022121859A1 true WO2022121859A1 (zh) 2022-06-16

Family

ID=75353745

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135834 WO2022121859A1 (zh) 2020-12-08 2021-12-06 口语信息处理方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN112651231B (zh)
WO (1) WO2022121859A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651231B (zh) * 2020-12-08 2023-10-27 北京有竹居网络技术有限公司 口语信息处理方法、装置和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6125341A (en) * 1997-12-19 2000-09-26 Nortel Networks Corporation Speech recognition system and method
CN108829894A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 口语词识别和语义识别方法及其装置
CN110782885A (zh) * 2019-09-29 2020-02-11 深圳和而泰家居在线网络科技有限公司 语音文本修正方法及装置、计算机设备和计算机存储介质
CN110853621A (zh) * 2019-10-09 2020-02-28 科大讯飞股份有限公司 语音顺滑方法、装置、电子设备及计算机存储介质
CN112651231A (zh) * 2020-12-08 2021-04-13 北京有竹居网络技术有限公司 口语信息处理方法、装置和电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562760B (zh) * 2016-06-30 2020-11-17 科大讯飞股份有限公司 一种语音数据处理方法及装置
CN107293296B (zh) * 2017-06-28 2020-11-20 百度在线网络技术(北京)有限公司 语音识别结果纠正方法、装置、设备及存储介质
CN110349564B (zh) * 2019-07-22 2021-09-24 思必驰科技股份有限公司 一种跨语言语音识别方法和装置
CN111145732B (zh) * 2019-12-27 2022-05-10 思必驰科技股份有限公司 多任务语音识别后的处理方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6125341A (en) * 1997-12-19 2000-09-26 Nortel Networks Corporation Speech recognition system and method
CN108829894A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 口语词识别和语义识别方法及其装置
CN110782885A (zh) * 2019-09-29 2020-02-11 深圳和而泰家居在线网络科技有限公司 语音文本修正方法及装置、计算机设备和计算机存储介质
CN110853621A (zh) * 2019-10-09 2020-02-28 科大讯飞股份有限公司 语音顺滑方法、装置、电子设备及计算机存储介质
CN112651231A (zh) * 2020-12-08 2021-04-13 北京有竹居网络技术有限公司 口语信息处理方法、装置和电子设备

Also Published As

Publication number Publication date
CN112651231A (zh) 2021-04-13
CN112651231B (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
CN111027331B (zh) 用于评估翻译质量的方法和装置
CN111274815B (zh) 用于挖掘文本中的实体关注点的方法和装置
CN111368559A (zh) 语音翻译方法、装置、电子设备及存储介质
CN111382261B (zh) 摘要生成方法、装置、电子设备及存储介质
CN111046677B (zh) 一种翻译模型的获取方法、装置、设备和存储介质
CN113139391B (zh) 翻译模型的训练方法、装置、设备和存储介质
WO2022166613A1 (zh) 文本中角色的识别方法、装置、可读介质和电子设备
CN112633947A (zh) 文本生成模型生成方法、文本生成方法、装置及设备
CN111368560A (zh) 文本翻译方法、装置、电子设备及存储介质
WO2023005729A1 (zh) 语音信息处理方法、装置和电子设备
CN111400454A (zh) 摘要生成方法、装置、电子设备及存储介质
CN115270717A (zh) 一种立场检测方法、装置、设备及介质
WO2022121859A1 (zh) 口语信息处理方法、装置和电子设备
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
CN111339789B (zh) 一种翻译模型训练方法、装置、电子设备及存储介质
CN111815274A (zh) 信息处理方法、装置和电子设备
CN111767259A (zh) 内容分享的方法、装置、可读介质和电子设备
CN110598049A (zh) 用于检索视频的方法、装置、电子设备和计算机可读介质
WO2023011260A1 (zh) 翻译处理方法、装置、设备及介质
WO2022174804A1 (zh) 文本简化方法、装置、设备及存储介质
CN112836476B (zh) 一种纪要生成方法、装置、设备及介质
CN112257459B (zh) 语言翻译模型的训练方法、翻译方法、装置和电子设备
CN114765025A (zh) 语音识别模型的生成方法、识别方法、装置、介质及设备
CN110852043B (zh) 一种文本转写方法、装置、设备及存储介质
CN115967833A (zh) 视频生成方法、装置、设备计存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21902564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21902564

Country of ref document: EP

Kind code of ref document: A1