Human voices are dynamic expressions of identity. The mere presence of a voice can lend a sense of personhood to machines (
Abercrombie et al. 2023), while hearing the familiar voices of other humans can soothe and reassure (
Seltzer et al. 2010,
2012). Voice conversion and cloning, which use artificial intelligence to synthesise a human talker’s vocal identity, can now be done at no or low cost, and on the basis of a small amount of audio input data. Thus, where self-voice synthesis was once mainly used in quite narrow contexts, a person’s partially or fully synthesised voice identity can today be readily applied in a wide range of settings, from virtual assistants to audiobook readers and chatbots. Given the deeply personal nature of voices (
Sidtis & Kreiman 2012), we need a better understanding of the implications of personalised voice synthesis technologies for human listeners and their relationships with voices in everyday life.
Voice conversion and cloning
Voice conversion and cloning are terms used to describe the application of AI to create novel audio that is recognisable as the speech of a specific person—in other words, an ‘audio deepfake’ of a person’s vocal identity. In some respects, voice identity replicas have been an everyday reality since the advent of voice recording and editing technologies first allowed individual voices to be reproduced by machines. However, to flexibly generate novel utterances in a speaker’s voice requires much greater sophistication than, for example, concatenating individual recordings of words and phrases. Primarily with the aim of providing individualised synthetic voices to patients whose physical voice was compromised or lost, developments in voice technology during the early 21st century found ways to computationally learn aspects of a speaker’s voice quality (the acoustic properties that made them uniquely recognisable) and use these to create a text-to-speech (TTS) synthesiser bearing a likeness to that speaker. Such approaches were typically implemented in augmentative and alternative communication (AAC) devices for use in everyday communication by patients who have lost the use of their physical voice due to illness (for example, motor neurone disease (MND); also known as ALS) or injury (for retrospective accounts, see Mills
et al. 2014, Veaux
et al. 2013, Yamagishi
et al. 2012). These early models, while potentially transformative for users, often required hours of donated audio recordings from the original speaker (a process known as ‘voice banking’) for the computer model to adequately learn and reproduce the identity-specific characteristics of the voice. The resulting synthetic voices typically also bore some limitations, both in their likeness to the original talker, and in their perceived naturalness (
Mckelvey et al. 2012).
With advances in deep learning and generative artificial intelligence (AI), the 2010s and 2020s have seen dramatic and accelerating changes in voice synthesis capabilities, with implications for AAC (
Judge & Hayton 2022) and other use cases. In one approach, known as voice conversion, models can be employed to learn the acoustics of one speaker and transfer these to recordings from another speaker, thus ‘grafting’ a new identity onto pre-existing speech or vocal audio. This form of ‘speech-to-speech’ synthesis can be applied to human speech in (almost) real time (for example, via online conferencing or streaming platforms). Other methods, more commonly known as voice cloning, instead typically use ‘text-to-speech’ methods to generate audio bearing the learned acoustic features of a target voice identity. Here, the output speech does not need to be present in existing recordings, or be produced by a donor speaker, but rather is determined by the user’s text input (with output latencies dependent on the input length; for a general introduction to these methods, see Hutiri
et al. 2024).
There are some immediate implications of the broad differences between conversion and cloning. On the one hand, the fact that conversion is applied to pre-existing human speech audio allows for aspects of naturalness and context-appropriateness (for example, emotion or speech rate) to be retained in the synthesised speech output, while these may not be guaranteed in fully generative cloning approaches. On the other hand, the outputs of conversion are limited to, and constrained by, the existing audio to which conversion is applied, and therefore on the labour of the human speaker(s) producing the donor speech materials—in contrast, voice cloning offers theoretically limitless amounts of possible speech output generated from text prompts. In this paper, we will focus on both methods’ common approach of replicating voice identity characteristics to consider the psychological impacts of this approach within different use cases.
What is very powerful about some of the latest voice identity synthesis approaches is that they no longer require bespoke model training or fine-tuning to generate a specific voice identity—voice conversion or cloning can be achieved by mapping a small amount of voice input audio (sometimes as little as 3 seconds;
Microsoft Research (n.d.)) to an existing learned speaker space (
Arik et al. 2018;
Jia et al. 2018). Utterances generated from such a ‘zero-shot learning’ approach can bear startling perceptual similarity to the original talker, with even fully generative models producing fluent and naturalistic speech intonation in the talker’s own and other languages. In the case of purely generative speech synthesis, synthesised pauses and breaths further aid in creating the illusion of hearing an authentic human speech recording (
Abercrombie et al. 2023). Some of the most cutting-edge voice conversion and cloning models are commercially available via user-friendly web-based interfaces and application programming interfaces (APIs) (for example, from ElevenLabs, Microsoft (VALL-E), Speechify, Meta (Audiobox)), while other open-source voice cloning methods are accessible to users with appropriate expertise and computing hardware (for example, YourTTS, Tortoise TTS, OpenVoice).
Whether using voice conversion or cloning techniques, the possibility of synthesising a person’s voice identity has thus rapidly opened to a global user base of professionals and general publics (
Federal Trade Commission 2019;
Hutiri et al. 2024). For example:
•
A social media influencer might synthesise their own voice for ease of generating new online content (
Zhang et al. 2021).
•
A university lecturer might decide to generate personalised voiceovers for their teaching materials in multiple languages for a diverse student audience (
Dao et al. 2021;
Pérez et al. 2021).
•
A parent may wish to apply their own voice identity to audiobooks to be played as bedtime stories for their children (
Epp et al. 2017).
•
A voiceover artist may be interested in using a synthesised version of their voice to maximise their commercial reach (given appropriate legal protections (
Cieslak 2024)).
When thinking about these intentional uses of voice conversion or cloning, some key questions begin to arise for the voice owner.
First, who gets to hear and use my synthesised voice? At a voice owner’s discretion, their voice identity could be made available to audiences ranging from intimate (for example, used only by the self, and/or selected close relatives) to more general but context-defined (for example, a lecturer’s voice used only by their university), to completely widespread (for example, a voice donated to an open-access database for use by anyone).
Second, how will my voice be used? A person may be happy for their voice to be used as an audiobook reader of children’s books, but not to read books describing graphic violence; they may be happy to be the voice of Amazon Echo’s weather reports, but not to voice a personal chatbot within the same device.
Third, for how long will my voice be used? A university lecturer may prefer that their employer ceases to create new content in their voice once they have resigned or retired; a person making their will might prefer to place limits on how their inheritors use their voice data after they are deceased.
In purely practical terms, the state-of-the-art is still not without its limitations. Most widely available voice cloning models generate speech in the style of reading text aloud and, despite good overall naturalness, there are still shortcomings in the appropriate synthesis of context-relevant emotional prosody and speaking styles (
Kolekar et al. 2024). Similarly, the success of cloning and conversion in terms of perceived accuracy is dependent on the composition of the training dataset underpinning the model’s functionality, which may lack sufficient representation of minority languages, non-standard accents, identities, and speaking styles (
Barnett 2023)—larger, speaker-specific input data is required to fine-tune models to obtain more personalised results (thus placing an added burden on minoritised talkers to provide suitable data). Finally, as mentioned above, the relative ease of implementing and using voice conversion or cloning technology is currently correlated with financial cost—the more technically accessible web-based models are typically made available on a subscription basis, with limits on the amount of material that can be generated and downloaded per unit time. When the subscription is paused, so is access to the clone, and thus it is not always possible to ‘own’ one’s cloned voice for use in a truly permanent and flexible way (for example, integration into word processing software, social media apps, or AAC hardware). However, there have been very recent initiatives to widen accessibility even to commercial tools—for example, ElevenLabs’ Impact Program which offers free licences to individuals with MND/ALS and ‘social good’ partners working in sectors such as education (ElevenLabs
2024a).
Given the rapid changes in the sophistication of voice identity synthesis outlined above, we conclude that it is reasonable to expect that the technological capacity for high-quality, low-latency, and inclusive voice identity conversion and cloning will continue to advance rapidly, with accessibility following suit. Thus, it is already time to consider a world in which the synthesised vocal identities of ourselves and other people are available for use in our everyday lives. Here, therefore, we will use findings from psychology, neuroscience, and allied literatures as a scaffold to address questions (such as those around who, how, and for how long, as outlined above) and form predictions about the possible impacts of voice conversion and cloning on human perception and experience. Where available, we will integrate evidence from existing empirical research on personalised voice technologies, including our own preliminary findings, although we note that this literature is still in its infancy and thus our overall perspective must necessarily be more speculative than evaluative or conclusive.
Here, we focus on the speaking voice as the predominate modality of human vocal expression and the main mechanism for human sociality and social organisation. While acknowledging that synthesis of the sung voice as it relates to personal identity is also of great importance, particularly for artists (
Josan 2024), there may be important distinctions in the treatment of our questions in the context of the singing voice that go beyond the scope of the current paper—for example, the personal singing voice is strongly linked to creative expression, and singers may have higher-stakes involvement in economies that commodify voice and vocal identity.
Later parts of the discussion will broaden the focus to consider multidisciplinary insights on the moral, ethical, and legal issues associated with voice conversion and cloning, culminating in speculative consideration of the cloned voice as a part of digital afterlives. Throughout, our focus will be on intentional and legal uses of these technologies, rather than on issues around deepfaking and identity misrepresentation, which are actively discussed elsewhere (for example, Barnett
2023, Hutiri
et al. 2024). Our discussion is nonetheless relevant to existing and emerging legislation, in terms of how this might be shaped to minimise certain ethical risks to both the human owners/donors of voice data, and to the other stakeholders implicated in applications of voice identity synthesis technologies.
The human voice
The human voice is a dynamic audio signal that can be used flexibly to express thoughts, emotions, and mental states via both verbal (that is, speech) and non-verbal (for example, laughter or sighs) vocal behaviours (
Belin et al. 2004;
Lavan et al. 2019;
Scott & McGettigan 2016). The physical sound of the voice arises from the vibration of air molecules passing through the human vocal tract. The vibration of the vocal folds within the larynx (the ‘source’) generates a largely periodic signal (that is, one with pitch) that is modulated by both the morphology of the static structures of the vocal tract (for example, hard palate) as well as the dynamic positioning of the articulators including the lips, jaw, and soft palate (the ‘filter’ (
Fant 1971)). As a vocal behaviour, human speech requires exquisite precision and coordination of the source and filter to execute a rapid stream of vowels and consonants that are recognisable and comprehensible to human listeners; within this, changes to the velocity of airflow, the rate and quality of vocal fold vibrations, and the rate and extent of articulations, can add variation in the perceived energy, linguistic focus, and emotional content of the spoken signal (broadly termed ‘prosody’). However, while acoustic correlates of voice quality—or timbre—within speech and other vocal behaviours can be measured and quantified, it has remained challenging to adequately relate these physical properties of vocal sounds to our complex perceptual experiences of voices [that is, to cross the ‘timbral abyss’ (
Kreiman 2024)].
The overall shape of the vocal apparatus, as well as the ways in which its moveable parts are engaged, varies widely across and within speakers. Between-speaker variations are strongly influenced by sexual maturity: children have shorter vocal tracts and vocal folds than adults, yielding voices that sound smaller and higher-pitched than adult voices. In adults, the effects of testosterone during male puberty mean that post-pubertal males have on average longer vocal tracts as well as longer and thicker vocal folds than adult females, with the perceptual consequence that males tend to sound larger and lower-pitched than females (
Cartei & Reby 2013;
Fitch & Giedd 1999). The language(s) and accent(s) spoken by talkers will add a host of additional differences in the physical dynamics of speech behaviours and their acoustic and perceptual correlates across groups of people. At the level of the individual, we can also see variation in the shape and the dynamics of vocal behaviours that underpin each speaker’s unique vocal character and repertoire (
Lim et al. 2021). Moreover, the idiosyncrasies of a person’s speech are not fixed—speakers can volitionally modulate their speech in a variety of ways, including learning to speak additional languages, disguising their vocal identity (including, but not limited to, expert voice artistry), and dynamically adjusting speech depending on the acoustic, communicative, and social contexts (for example, Aziz-Zadeh
et al. 2010, Cartei
et al. 2012,
2019, Guldner
et al. 2020,
2024, Hazan & Baker
2011, Hughes
et al. 2014, Pisanski & Reby
2021, Sorokowski
et al. 2019).
What are the implications of between-speaker and within-speaker variations for personalised voice synthesis? In terms of replicating voice qualities that can map onto a given speaker, the difficulty of the task increases with the degree of individuation required—thus, while it is has for some time been very straightforward to convert or synthesise a voice to sound broadly like an adult male human, it becomes more complex to make that voice sound specifically like our first author’s oldest male friend from Derry in Northern Ireland when he’s in a bad mood. In the context of the state of the art, this has a lot to do with how models are trained and fine-tuned: in order to be able to replicate the diversity of speakers in the human population, the datasets upon which models are trained will need to include a fair amount of that diversity, and a biased model will produced biased outcomes (for example, the friend from Northern Ireland becomes cloned as a male-sounding speaker with a General American accent). The related issue, noted above, is that a perfect reproduction of a given speaker under all possible situations requires additional data about how that person might sound under different acoustic/communicative/social pressures, including both their linguistic habits (vocabulary, word choice, and syntactic and pragmatic preferences) and how these become manifest in their speech acoustics. For that, personalised voice synthesis models need more person-specific data, which requires substantially greater input from both speakers and the model developers.