Language data in Australia - Mapping a conceptual landscape

The data which is being made accessible through the Language Data Commons of Australia will contribute to the task of documenting language use and language behaviour in Australia. But what does this include? We are aware of many data sources and, in the short term, the important questions are about priorities:

  • What will be immediately useful?
  • What data is stored precariously?
  • What relationships can be leveraged?

But in thinking for the long term, there is value in asking the question from a more conceptual point of view:

  • What were the possibilities for making records of language use over our history and what has resulted (and not resulted) from those possibilities?

I will try to at least start answering that question by providing a conceptual map of a metaphorical landscape and this will be structured around two themes: demography and technology. Demography is important because what languages were being used at any particular time depends on who was living in Australia at that time. On this dimension, the major point of articulation is the arrival of non-Indigenous people. The places of origin of those non-Indigenous people have changed over time, and that has had linguistic consequences, but not on the same scale as the initial change. Technology is important because the kinds of records which might exist of language use depend on the means which were available to make such records at a given time. And on this dimension, I think that there are three points of articulation that are important: the possibility of making written records, the possibility of recording sound and vision, and the possibility of digital records.

Before European contact, Australia had a very diverse linguistic ecology. Credible estimates of the number of distinct languages range from around 250 up to as many as 490, with many more dialects. Contact between language groups was common and therefore multilingualism and multidialectalism were also common. We know that there was contact between Indigenous Australians and fishermen from the Indonesian archipelago (especially Makassarese people) from the evidence of loan words. But the only record which existed of this period in the linguistic history of the continent was what was passed from one generation to another by oral transmission.

The presence of Europeans on the Australian continent brought a huge change to both dimensions of the map I am developing. Europeans landed on parts of Australia from some time in the 17th century, but I will take two dates in the 18th century to be crucial. Although earlier visitors may have recorded a few words which they heard from Indigenous Australians, written records of the languages only start properly (and even then to a limited extent) with Cook’s expedition in 1770, reflecting perhaps the scientific orientation of Joseph Banks. This is the first point of technological articulation in the language data landscape which introduced the possibility of making language records independent of human memory. And then from 1788, there has been a continuous non-Indigenous presence in Australia representing a huge demographic articulation point.

During the 19th century (and into the 20th century), written records of Australian languages were produced by a variety of people such as explorers, missionaries, administrators (in fact, pretty much anyone who could be bothered). These records are scattered and new material continues to be found (for example, Des Crump’s ongoing work in the State Library of Queensland and the Queensland State Archives). If you would like to get a flavour of some records of this kind, the Nyingarn project is making many of them available online. (Nyingarn builds on earlier work which presented the Indigenous language materials collected by Daisy Bates as an online resource.) However the efforts of these early recorders were sporadic, uncoordinated and poorly focused. In 1945, Sydney Baker wrote: “Records of their languages are extremely deficient for instance, no exhaustive grammar of an aboriginal language has been published. There is no comprehensive or even partially comprehensive dictionary of reference to aboriginal dialects” (p218). And this record did not reflect the richness of the oral tradition or of language use. For example, it is very difficult to make an accurate record of conversation when contemporaneous writing is your only technological resource.

The non-Indigenous group who arrived in 1788 were the First Fleet, the first group of convicts transported from the United Kingdom with their jailers - English has had a permanent presence in Australia since that date. Written records are all that we have until the end of the 19th century, but they are extensive and a sample of them is available in the COrpus of Oz Early English (COOEE, Fritz 2007). This collection, and indeed the overall record, has the problems we expect to be associated with written sources. The authors are not a representative group, and the material is biased towards non-vernacular styles. Material such as Corbyn’s more-or-less verbatim accounts of court scenes in mid-century Sydney is uncommon (Corbyn 1854), and it would be of great value to those researching the development of Australian English if more informal material (personal letters, diaries) could be made accessible.

From 1788 on, speakers of non-Indigenous languages other than English have been present in Australia. Within the convict population (and then also amongst free settlers), a significant minority of the European population in Australia knew Irish. Speakers of other languages were occasionally present in the first part of the 19th century, and then, after the discovery of gold in the middle of the century, speakers of many languages, including major European languages and Sinitic languages, were present in Australia. The written records of these languages, represented by periodical publications, are diverse. But before turning to those records, a few words about the relative absence of Irish in the written record.

As mentioned, a large proportion, probably around one third, of those who arrived in Australia from the United Kingdom were Irish. Many had some knowledge of the language, even up to half of the Irish immigrants to Victoria according to Noone (2012). But some Irish transportees were political prisoners and use of the language was viewed with suspicion; speaking Irish could even be construed as a subversive act. There is evidence that the language continued to be spoken: Irish-speaking priests were needed to hear confessions, and interpreters were used in court occasionally. The written record, however, is limited. As O’Farrell (1988) points out, Irish was not even used on tombstones, one place where Irish could be used without consequences. A bilingual magazine was published in Melbourne in the 1920s, but, rather than continuing a tradition, this is a manifestation of the Gaelic revival.

Other non-Indigenous languages were also present from quite early in the post-invasion period, but increasingly so after the gold rushes of the mid 19th century. The crucial evidence here is the record of newspapers published in such languages, and here I rely on the research of Tim Sherratt using the resources of the National Library of Australia. German and Chinese were both well represented by published material from at least the middle of the 19th century and French and Italian were also present. Most of us probably think of Greek migration to Australia as a post-WW2 phenomenon, but publication in Greek started in 1931. Greek and Italian had important newspaper publications in the second half of the 20th century, as did Chinese but rather later.

I have suggested that technology is an important factor in mapping out this data landscape and in considering this it is important to track not just the availability of technologies but also who controlled them and to what ends. The interplay of these considerations is evident in the informal and unsystematic approach to making records of Indigenous languages described previously, and it is also evident in the almost complete lack of data on contact varieties which have existed at various times in Australia. Early contact between Europeans and Indigenous people led to the use of pidgins, but there are only minimal accounts of these in early records. Aboriginal Englishes are a range of contact varieties which are in use and still developing today, while Kriol is a contact language spoken in northern Australia. Good records of any of these varieties only began to be made in the 1980s. At least two contact varieties developed in specific social settings. Queensland Kanaka English originated with the presence of approximately 60,000 Melanesian agricultural workers in Queensland between 1860 and 1906, and Broome Pidgin developed when pearl divers from Asia, for whom Malay was a lingua franca, worked in Broome between about 1900 and 1930. In both cases, we know almost nothing except that these varieties existed and were used. As in the case of Indigenous languages, these varieties were of little interest to those who controlled the technology used to make records.

The second point of technological articulation is the possibility of making records of sounds and images. These technologies make it possible to document the sounds of speech, and then movement including gesture. Audio recording technologies were developed in the latter part of the 19th century and the earliest sound recording in the National Film and Sound Archive (NFSA) catalogue is from 1888. The first recordings of speech (‘comic monologue’) are catalogued as c1896. The first NFSA catalogue entry for an Australian language is for 1899; the material is a selection from the three cylinder recordings of 1899 of Fanny Cochrane Smith, who claimed to be ’the last of the [Aboriginal] Tasmanians’. The earliest catalogue entry at the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS) is for 1898; these are recordings from the Cambridge Anthropological Expedition to Torres Straits (1898). There is a continuing audio record for Australian languages from this point on. Much of this material is curated by AIATSIS, but there is also material elsewhere.

Although audio recordings of English in Australia begin in the 19th century, the first systematic set of materials recorded with the intention of documenting Australian English was not made until 1959-60 when A.G. Mitchell and Arthur Delbridge carried out a survey of the speech habits of young Australians. Of course, recordings of Australians speaking exist from earlier, but it is not obvious where they might be. Again, the questions about who controls technology are relevant: who would have an interest in a) making recordings and b) preserving them? Two answers to these questions seem interesting, and they are also relevant in the case of other non-Indigenous languages. Firstly, media organisations make recordings and may preserve them, and secondly, preservation of recorded material is an important aim for oral historians. For English, the ABC has a substantial archive including radio recordings starting in 1932 and television material starting in 1956. The ABC treats most of this material as a commercial resource and therefore access and use conditions are not simple - but there is an enormous amount of material there. And since at least the 1980s, considerable amounts of material have been collected by oral historians, an example which shows the benefit of looking for language data beyond what linguists have collected. For other non-Indigenous languages, SBS Radio has existed since 1975 and broadcasts in 68 languages today. The extent of their archive and how it might be accessed are questions we are exploring. Oral history is also likely to be an important source of data for these languages. A range of material from informal recordings to oral history interviews exists, mostly held by individuals but community associations may be a gateway.

Technology to record moving images developed a little later than audio recording. The NFSA catalogue does not specify whether material is silent or has sound, and I have yet to establish the earliest video record of Indigenous language stored by that institution. At AIATSIS, the earliest materials are recordings made at Ernabella by Norman Tindale in 1933, consisting of silent film with an accompanying wax cylinder audio recording. The first clear instance of film of Indigenous language use which I have traced to date is a film called Aborigines of the sea coast produced by the Australian Commonwealth Film Unit in 1948. This film is a record of a 1948 expedition to Arnhem Land led by anthropologist Charles Mountford. It depicts the ancestral fishing, hunting, building and boatmaking techniques used by the communities of the region. For English, the NFSA has a substantial collection of filmed material created in Australia and again the ABC archives are potentially a valuable source, especially for non-scripted material (such as interviews and the like).

The digital revolution of the last few decades is the third point of technological articulation and it has fundamentally altered our relationship to language data. This is true both for how we acquire and handle data and for what data is available. Pre-digital audio and video content required expensive equipment, the recorded media had to be stored very carefully and using the recordings caused them to deteriorate. Today, cheap (and very portable) equipment can produce excellent results for multimodal data, the resulting data can be replicated, edited and disseminated easily and without degradation, and any problems relating to storage of data are (largely) general ones. How we collect and view written data has also changed in various ways. Optical Character Recognition (OCR) can transform the printed record into machine readable text (and increasingly can access handwritten material), and new genres of writing have come into existence. These developments mean that the amount of data available is enormous and continues to expand. These large bodies of data would be intractable using traditional methods; fortunately, new tools have also been developed which allow us to analyse large collections of data.

As a result of these changes, we face new and different problems in how we approach language data. Finding data is simple, but choosing what part of the data and/or how much of the data we should acquire may not be simple. Control of data has become complex when the definition of Public Domain has to accommodate new modes of dissemination and large collections of data are considered commercial assets.

Even the question of what should be considered ‘Australian language data’ does not have an obvious answer in this digital world. Does the language of an Australian resident from a South Asian background contributing to social media in Sri Lanka come under the term? What about the tweets of someone who was born in Australia and grew up here but has lived in Europe for a number of years? The landscape of this latest phase in language data is still emerging and finding good answers to these questions (and many others) will be needed before we can map the new landscape clearly.

This post is based on presentations given to the LaTrobe University Linguistics Program (27 May 2021) and to the Monash University Linguistics Program (10 May 2022). I am grateful for helpful comments from both those audiences, and for comments on the draft of this post from Leah Gustafson, Sara King and Harriet Sheppard.