CN116564283A - Far-field voice generation method and device, electronic equipment and storage medium - Google Patents

Far-field voice generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116564283A
CN116564283A CN202211141695.6A CN202211141695A CN116564283A CN 116564283 A CN116564283 A CN 116564283A CN 202211141695 A CN202211141695 A CN 202211141695A CN 116564283 A CN116564283 A CN 116564283A
Authority
CN
China
Prior art keywords
far
field
voice
voice data
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211141695.6A
Other languages
Chinese (zh)
Inventor
张超
王乐
滕勇
丁希剑
李健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaovo Technology Co ltd
Original Assignee
Xiaovo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaovo Technology Co ltd filed Critical Xiaovo Technology Co ltd
Priority to CN202211141695.6A priority Critical patent/CN116564283A/en
Publication of CN116564283A publication Critical patent/CN116564283A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a far-field voice generation method, a far-field voice generation device, electronic equipment and a storage medium. The method comprises the following steps: determining voice data to be processed, and determining a data tag of the voice data to be processed; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed; inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model; and converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model. By adopting the technical scheme of the embodiment of the invention, the waste of financial resources and material resources caused by collecting a large amount of voice data to be processed through equipment is avoided; and any near-field voice data can be converted into any far-field voice data, so that the remote voice data are enriched.

Description

Far-field voice generation method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice processing, in particular to a far-field voice generation method, a far-field voice generation device, electronic equipment and a storage medium.
Background
Existing artificial intelligence technology, especially various algorithms based on deep learning in the voice field, rely on various big data, and most of the data is acquired by near field acquisition in the voice field. For some algorithms, such as ASR and voiceprint recognition, if the model trained using near field data is used to recognize far field speech, the recognition effect is significantly reduced. The common practice for this situation is to use the device to collect far-field speech and add training data in the scenario where the algorithm is applicable, however, not all scenarios are applicable, and the cost of separately collecting data is too high, and the device has no versatility.
Therefore, how to effectively generate far-field speech is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides a far-field voice generation method, a far-field voice generation device, electronic equipment and a storage medium, so as to convert any near-field voice into far-field voice and enrich far-field voice data.
In a first aspect, an embodiment of the present invention provides a far-field speech generating method, including:
determining voice data to be processed, and determining a data tag of the voice data to be processed; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed;
inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model;
and converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model.
In a second aspect, an embodiment of the present invention further provides a far-field speech generating apparatus, including:
the voice data processing module is used for processing voice data and determining a data tag of the voice data; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed;
the target far-field speech generation model training module is used for inputting the speech data to be processed into a preset far-field speech generation model, and training the preset far-field speech generation model according to a far-field speech discrimination result in the preset far-field speech generation model to obtain a target far-field speech generation model;
and the far-field voice generation module is used for converting near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
a storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the far-field speech generating method of any of the embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the program is executed by a processor to implement the far-field speech generating method according to any embodiment of the present invention.
The embodiment of the invention provides a far-field voice generation method, a device, electronic equipment and a storage medium, wherein voice data to be processed are determined, and a data tag of the voice data to be processed is determined; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed; inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model; and converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model. By adopting the technical scheme of the embodiment of the invention, the waste of financial resources and material resources caused by collecting a large amount of voice data to be processed through equipment is avoided; and any near-field voice data can be converted into any far-field voice data, so that the remote voice data are enriched.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
fig. 1 is a schematic flow chart of a far-field speech generation method provided in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a structure for acquiring voice data to be processed according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of training a target far-field speech generation model provided in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a far-field speech generating device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing the exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations (or steps) can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The technical scheme of the application is that the acquisition, storage, use, processing and the like of the data meet the relevant regulations of national laws and regulations.
Fig. 1 is a flow chart of a far-field speech generating method according to an embodiment of the present invention, where the embodiment is applicable to a case of converting incoming speech into far-field speech in different scenes, the method of the present invention may be performed by a far-field speech generating device, and the device may be implemented in hardware and/or software. The apparatus may be configured in a far-field speech generation server. The method specifically comprises the following steps:
s110, determining voice data to be processed, and determining a data tag of the voice data to be processed.
The voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed. Near field voice data may refer to voice data acquired within a range of relatively close distances from a sound source; far-field speech data may refer to speech data acquired over a range of distances greater than the sound source. For example, referring to fig. 2, data acquired within 0.5m of a sound source is taken as near-field voice data; data acquired at more than 0.5m is taken as far-field speech data.
As an alternative but non-limiting implementation, the determining the voice data to be processed includes steps A1-A2:
step A1: when a sound source emits voice, at least two voice data acquisition devices are adopted to acquire voice data at different distances simultaneously;
step A2: and calibrating and aligning the voice data to determine the voice data to be processed with uniform voice length.
And acquiring voice data by adopting voice data acquisition equipment under different distances from the sound source. For example, voice data acquisition devices may be provided at 0.5m, 1m, 1.5m, and 2m from the sound source to acquire near-field voice data as well as far-field voice data. The far-field speech data includes near-far-field speech data acquired at 1m from the sound source, mid-far-field speech data acquired at 1.5m from the sound source, and ultra-far-field speech data acquired at 2m from the sound source.
And calibrating and aligning the acquired near-field voice data and far-field voice data to acquire the to-be-processed near-field voice data and the to-be-processed far-field voice data with uniform voice length. The calibration alignment may refer to keeping the start-stop frame order of near field speech data and far field speech data consistent. For example, when sound data within a time period of 2-8 seconds after the sound source emits sound is selected as near-field voice time, sound data within the same time period is selected as far-field voice data, so that the content of the acquired voice data to be processed is consistent. And taking the near-field voice data to be processed and the far-field voice data to be processed with uniform voice lengths as a data pair, repeatedly collecting the voice data and calibrating and aligning the voice data to obtain at least two data pairs, and performing model training by adopting the at least two data pairs.
The data tag of the voice data to be processed may refer to a tag capable of distinguishing the voice data to be processed. For example, the near-field voice data to be processed is marked with a data tag a, the near-far-field voice data to be processed is marked with a data tag b, the far-field voice data in the processing is marked with a data tag c, and the ultra-far-field voice data to be processed is marked with a data tag d.
S120, inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model.
And performing model training by adopting the acquired data tags of the to-be-processed voice data and the to-be-processed far-field voice data so as to determine a target far-field voice generation model.
As an optional but non-limiting implementation manner, the inputting the to-be-processed voice data into a preset far-field voice generation model, training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model, and the method includes steps B1-B3:
step B1: inputting near-field voice data to be processed and a far-field voice data tag to be processed into a preset voice generator to obtain first far-field voice data; the first far-field voice data are the same as the tags of the far-field voice data to be processed;
step B2: inputting the first far-field voice data and the far-field voice data to be processed into a preset voice discriminator, and determining a far-field voice discrimination result;
step B3: according to the far-field voice discrimination result, adjusting the preset far-field voice generation model parameters to determine a target far-field voice generation model; the preset far-field voice generation model comprises a preset voice generator and a preset voice discriminator.
Referring to fig. 3, near-field voice data to be processed and a far-field voice data tag to be processed are input into a preset voice generator, so as to obtain first far-field voice data identical to the far-field voice data tag to be processed. Inputting the first far-field voice data and the far-field voice data to be processed into a preset voice discriminator to determine whether the first far-field voice data is consistent with the far-field voice data to be processed; if the preset voice discriminator can effectively judge that the generated far-field voice data is inconsistent with the input far-field voice data to be processed, parameters of the preset far-field voice generation model are required to be adjusted so as to obtain a target far-field voice generation model.
As an optional but non-limiting implementation manner, the adjusting the preset far-field speech generation model parameter according to the far-field speech discrimination result includes steps C1-C2:
step C1: if the far-field voice discrimination result is smaller than a preset discrimination result threshold, adjusting parameters of the preset far-field voice generation model to obtain an adjusted preset far-field voice generation model;
step C2: and adopting the adjusted preset far-field voice generation model to continuously convert the voice data to be processed into far-field voice data to obtain an adjusted far-field voice judgment result until the adjusted far-field voice judgment result is greater than or equal to a preset judgment result threshold value.
The preset judging result threshold value can be used for judging whether the generated far-field voice data is consistent with the far-field voice data to be processed or not. If the far-field voice discrimination result is smaller than the preset discrimination result threshold, the far-field voice generation model is represented to be not trained, and parameters of the preset voice generator and the preset voice discriminator are adjusted until the adjusted far-field voice discrimination result is larger than or equal to the preset discrimination result threshold.
S130, converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model.
The near-field voice data and the far-field voice data tag are input into a target far-field voice generation model which is trained, so that far-field voice corresponding to the far-field voice data tag is generated. For example, near field speech data a 0 Far field voice data tag b 0 Input into the target far-field speech generation model to output the far-field speech data tag b 0 Corresponding far-field speech data b 01
As an alternative but non-limiting implementation, the target far-field voice data tag includes at least one of the following; the first far-field speech, the second far-field speech, and the third far-field speech; the far field voice data tag is related to a distance between the voice data collection device and the sound source.
As an alternative but non-limiting implementation, the distance between the at least two voice data acquisition devices and the sound source satisfies a preset condition.
As an optional but non-limiting implementation, the first far-field speech includes far-field speech acquired by the speech data acquisition device at a first preset distance; the second far-field voice comprises far-field voice acquired by voice data acquisition equipment at a second preset distance; the third far-field voice comprises far-field voice acquired by voice data acquisition equipment at a third preset distance;
the first preset distance is smaller than the second preset distance, and the second preset distance is smaller than the third preset distance; the first preset distance, the second preset distance and the third preset distance satisfy an arithmetic series relationship.
Wherein an arithmetic series relationship is satisfied between the sound source and the at least two voice data collection devices. For example, as shown in fig. 2, the first far-field speech may refer to far-field speech acquired at 1m from the sound source; the second far-field speech may refer to far-field speech acquired at a distance of 1.5 from the sound source; the third far-field speech may refer to far-field speech acquired at 2m from the sound source.
In an alternative solution of the embodiment of the present invention, the distances between the at least two voice data acquisition devices and the sound source may also satisfy an equal-ratio array relationship or an exponential relationship. The distance between the at least two voice data collection devices and the sound source is not particularly limited in the embodiment of the present invention.
The embodiment of the invention provides a far-field voice generation method, which comprises the steps of determining voice data to be processed and determining a data tag of the voice data to be processed; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed; inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model; and converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model. By adopting the technical scheme of the embodiment of the invention, the waste of financial resources and material resources caused by collecting a large amount of voice data to be processed through equipment is avoided; and any near-field voice data can be converted into any far-field voice data, so that the remote voice data are enriched.
Fig. 4 is a schematic structural diagram of a far-field speech generating device according to an embodiment of the present invention, where the device includes: a pending speech data determination module 410, a target far-field speech generation model training module 420, and a far-field speech generation module 430; wherein, the liquid crystal display device comprises a liquid crystal display device,
a to-be-processed voice data determining module 410, configured to determine to-be-processed voice data, and determine a data tag of the to-be-processed voice data; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed;
the target far-field speech generation model training module 420 is configured to input the speech data to be processed into a preset far-field speech generation model, and train the preset far-field speech generation model according to a far-field speech discrimination result in the preset far-field speech generation model to obtain a target far-field speech generation model;
the far-field speech generation module 430 is configured to convert near-field speech into far-field speech corresponding to the target far-field speech data tag by using the target far-field speech generation model.
On the basis of the foregoing embodiment, optionally, the to-be-processed voice data determining module includes:
when a sound source emits voice, at least two voice data acquisition devices are adopted to acquire voice data at different distances simultaneously;
and calibrating and aligning the voice data to determine the voice data to be processed with uniform voice length.
On the basis of the foregoing embodiment, optionally, the target far-field speech generation model training module includes:
inputting near-field voice data to be processed and a far-field voice data tag to be processed into a preset voice generator to obtain first far-field voice data; the first far-field voice data are the same as the tags of the far-field voice data to be processed;
inputting the first far-field voice data and the far-field voice data to be processed into a preset voice discriminator, and determining a far-field voice discrimination result;
according to the far-field voice discrimination result, adjusting the preset far-field voice generation model parameters to determine a target far-field voice generation model; the preset far-field voice generation model comprises a preset voice generator and a preset voice discriminator.
On the basis of the foregoing embodiment, optionally, the target far-field speech generation model training module further includes:
if the far-field voice discrimination result is smaller than a preset discrimination result threshold, adjusting parameters of the preset far-field voice generation model to obtain an adjusted preset far-field voice generation model;
and adopting the adjusted preset far-field voice generation model to continuously convert the voice data to be processed into far-field voice data to obtain an adjusted far-field voice judgment result until the adjusted far-field voice judgment result is greater than or equal to a preset judgment result threshold value.
On the basis of the foregoing embodiment, optionally, the far-field speech generating module includes:
the target far-field voice data tag comprises at least one of the following; the first far-field speech, the second far-field speech, and the third far-field speech; the far field voice data tag is related to a distance between the voice data collection device and the sound source.
On the basis of the foregoing embodiment, optionally, the far-field speech generating module further includes:
the distance between the at least two voice data acquisition devices and the sound source meets the preset condition.
On the basis of the foregoing embodiment, optionally, the far-field speech generating module further includes:
the first far-field voice comprises far-field voice acquired by voice data acquisition equipment at a first preset distance; the second far-field voice comprises far-field voice acquired by voice data acquisition equipment at a second preset distance; the third far-field voice comprises far-field voice acquired by voice data acquisition equipment at a third preset distance;
the first preset distance is smaller than the second preset distance, and the second preset distance is smaller than the third preset distance; the first preset distance, the second preset distance and the third preset distance satisfy an arithmetic series relationship.
The far-field voice generating device provided by the embodiment of the invention can execute the far-field voice generating method provided by any embodiment of the invention, has the corresponding functions and beneficial effects of executing the far-field voice generating method, and the detailed process refers to the related operation of the far-field voice generating method in the embodiment.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as far-field speech generation methods.
In some embodiments, the far-field speech generating method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the far-field speech generating method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the far-field speech generation method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A far-field speech generation method, the method comprising:
determining voice data to be processed, and determining a data tag of the voice data to be processed; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed;
inputting the voice data to be processed into a preset far-field voice generation model, and training the preset far-field voice generation model according to a far-field voice discrimination result in the preset far-field voice generation model to obtain a target far-field voice generation model;
and converting the near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model.
2. The method of claim 1, wherein the determining the voice data to be processed comprises:
when a sound source emits voice, at least two voice data acquisition devices are adopted to acquire voice data at different distances simultaneously;
and calibrating and aligning the voice data to determine the voice data to be processed with uniform voice length.
3. The method according to claim 1, wherein the inputting the to-be-processed speech data into a preset far-field speech generation model, training the preset far-field speech generation model according to a far-field speech discrimination result in the preset far-field speech generation model, and obtaining a target far-field speech generation model, includes:
inputting near-field voice data to be processed and a far-field voice data tag to be processed into a preset voice generator to obtain first far-field voice data; the first far-field voice data are the same as the tags of the far-field voice data to be processed;
inputting the first far-field voice data and the far-field voice data to be processed into a preset voice discriminator, and determining a far-field voice discrimination result;
according to the far-field voice discrimination result, adjusting the preset far-field voice generation model parameters to determine a target far-field voice generation model; the preset far-field voice generation model comprises a preset voice generator and a preset voice discriminator.
4. The method of claim 3, wherein adjusting the preset far-field speech generation model parameters according to the far-field speech discrimination result comprises:
if the far-field voice discrimination result is smaller than a preset discrimination result threshold, adjusting parameters of the preset far-field voice generation model to obtain an adjusted preset far-field voice generation model;
and adopting the adjusted preset far-field voice generation model to continuously convert the voice data to be processed into far-field voice data to obtain an adjusted far-field voice judgment result until the adjusted far-field voice judgment result is greater than or equal to a preset judgment result threshold value.
5. The method of claim 1, wherein the target far-field voice data tag comprises at least one of; the first far-field speech, the second far-field speech, and the third far-field speech; the far field voice data tag is related to a distance between the voice data collection device and the sound source.
6. The method of claim 5, wherein the distance between the at least two voice data acquisition devices and the sound source satisfies a preset condition.
7. The method of claim 6, wherein the first far-field speech comprises far-field speech acquired by a speech data acquisition device at a first preset distance; the second far-field voice comprises far-field voice acquired by voice data acquisition equipment at a second preset distance; the third far-field voice comprises far-field voice acquired by voice data acquisition equipment at a third preset distance;
the first preset distance is smaller than the second preset distance, and the second preset distance is smaller than the third preset distance; the first preset distance, the second preset distance and the third preset distance satisfy an arithmetic series relationship.
8. A far-field speech generating device, the device comprising:
the voice data processing module is used for processing voice data and determining a data tag of the voice data; the voice data to be processed comprises near-field voice data to be processed and far-field voice data to be processed;
the target far-field speech generation model training module is used for inputting the speech data to be processed into a preset far-field speech generation model, and training the preset far-field speech generation model according to a far-field speech discrimination result in the preset far-field speech generation model to obtain a target far-field speech generation model;
and the far-field voice generation module is used for converting near-field voice into far-field voice corresponding to the target far-field voice data tag by adopting the target far-field voice generation model.
9. An electronic device, comprising:
one or more processors;
a storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the far-field speech generating method of any of claims 1-7.
10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the far-field speech generating method according to any of claims 1-7.
CN202211141695.6A 2022-09-20 2022-09-20 Far-field voice generation method and device, electronic equipment and storage medium Pending CN116564283A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211141695.6A CN116564283A (en) 2022-09-20 2022-09-20 Far-field voice generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211141695.6A CN116564283A (en) 2022-09-20 2022-09-20 Far-field voice generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116564283A true CN116564283A (en) 2023-08-08

Family

ID=87492083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211141695.6A Pending CN116564283A (en) 2022-09-20 2022-09-20 Far-field voice generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116564283A (en)

Similar Documents

Publication Publication Date Title
US20220147822A1 (en) Training method and apparatus for target detection model, device and storage medium
CN112597754B (en) Text error correction method, apparatus, electronic device and readable storage medium
EP3852008A2 (en) Image detection method and apparatus, device, storage medium and computer program product
US20220351398A1 (en) Depth detection method, method for training depth estimation branch network, electronic device, and storage medium
CN113205041A (en) Structured information extraction method, device, equipment and storage medium
CN113627361B (en) Training method and device for face recognition model and computer program product
CN114186681A (en) Method, apparatus and computer program product for generating model clusters
CN114399513B (en) Method and device for training image segmentation model and image segmentation
CN116363444A (en) Fuzzy classification model training method, fuzzy image recognition method and device
CN113360672B (en) Method, apparatus, device, medium and product for generating knowledge graph
CN113554062B (en) Training method, device and storage medium for multi-classification model
CN116564283A (en) Far-field voice generation method and device, electronic equipment and storage medium
CN114724144A (en) Text recognition method, model training method, device, equipment and medium
CN114093006A (en) Training method, device and equipment of living human face detection model and storage medium
CN114119990A (en) Method, apparatus and computer program product for image feature point matching
CN115312042A (en) Method, apparatus, device and storage medium for processing audio
CN113903071A (en) Face recognition method and device, electronic equipment and storage medium
CN115641481A (en) Method and device for training image processing model and image processing
CN113379750A (en) Semi-supervised learning method of semantic segmentation model, related device and product
CN114218166A (en) Data processing method and device, electronic equipment and readable storage medium
CN115131709B (en) Video category prediction method, training method and device for video category prediction model
CN113963433B (en) Motion search method, motion search device, electronic equipment and storage medium
CN116257611B (en) Question-answering model training method, question-answering processing device and storage medium
CN116416500B (en) Image recognition model training method, image recognition device and electronic equipment
US11669672B1 (en) Regression test method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination