[ki’parla] – Corpus of Spoken Italian

What is [ki’parla]?

[ki’parla] is a new resource for the study of spoken Italian that is being developed by the LEAdHoC staff. The name of the corpus is the IPA transcription for Italian chi parla (“who’s speaking”), but at the same time, the first three sounds can be read as the acronym of Corpus di Italiano Parlato (Corpus of Spoken Italian), for both <c> and <ch> are the Italian spelling of the velar soundless plosive [k]. This name points to the fact that the corpus offers a rich set of metadata, allowing to identify and trace the features of every speaker across turns. It is a collection of different types of recordings collected in the academic setting, involving students and professors.

What are the aims of [ki’parla]? – Since LEAdHoC is deeply based on the observation of phenomena occurring in spoken interaction, the main aim of the corpus is to provide a considerable amount of data in order to study the different categorisation strategies that occur in spoken varieties. Moreover, the construction of [ki’parla] also tries to pursue a long-term goal: to create an open access resource for the study of contemporary spoken Italian, both interacting with already existing corpora and providing a state-of-the-art methodology for future data-collections.

How to cite [ki’parla]? – Access to [ki’parla] is licensed under an Open Database License. Our corpus is made available under the Open Database License. Any rights in individual contents of the corpus are licensed under the Database Contents License. To cite [ki’parla] please use the following references:

Goria, E. and Mauri C. (2017) Corpus [ki’parla]:  Corpus of Spoken Italian.  http://www.leadhoc.org/index.php/data-access/corpus-of-spoken-italian/

Goria, E. and Mauri C. (2017) Corpus [ki’parla]: verso la costruzione di un corpus di italiano parlato.  MS. University of Bologna.

Corpus design

The data for [ki’parla] is being collected in two Italian cities: Torino and Bologna.

They were chosen in order to be complementary with the cities present in LIP corpus, which includes some of the biggest administrative centres in Italy. Torino and Bologna are also consistent with each other from a sociolinguistic perspective, especially for what concerns the relationship between local dialects, regional varieties of Italian and standard Italian.

[ki’parla] collects several types of interaction recorded at university. This allows to define with considerable accuracy the most relevant features of the corpus:

  • As long as the level of education represents the main indicator of social class, [ki’parla] is extremely homogeneous with respect to this feature, in that it involves only speakers of higher social status, such as undergraduate students, graduate students and academic professors. Hence its chief chatacterisation as a corpus of educated speakers.
  • Other social variables are represented such as age, gender and region of origin of the speakers.
  • As for the types of interaction, a maximum of heterogeneity has been sought. Several parameters have been regarded as crucial, in particular: the level of formality between the speakers, the planned or unplanned nature of an interaction, the presence of a moderator or of any conventions regulating turn-taking.

The expected size of the corpus is 70 hours, roughly corresponding to 700k words.

These are in detail the typologies of interaction considered for [ki’parla], based on the already existing grid used for the LIP corpus:

A1 Professor-student interaction during office hours
A2 Guided group-interaction, both in spontaneous contexts (student organisations, internship meetings, …) and in controlled contexts (guided focus-groups organised on purpose by the research group, concerning material aspects of the students’life such as Erasmus destinations, housing in Torino and Bologna, …)
A3 Random conversations recorded by in-group members (students and professors) without direct involving of the researcher.
C1 Professor-student interaction in oral examinations
D1 Academic lessons
D2 Semi-structured interviews collected by students within the peer-group and aimed at the elicitation of oral narratives

SITUAZIONE / TASK LENGTH (approx.) AMOUNT TOT.
Office hours 30′ 6 180′
Group interaction 90′ 5 450′
Random conversation 30′ 6 180′
Examinations 60′ 3 180′
Lessons 90′ 8 720′
Interviews 30′ 13 390′

Radio conversations in university-based web radioes are also being monitored.

Ethical code

All our recordings are realised after specific training of the researchers participating in the data collection.

In all the recordings, the microphone is visible to all the participants and they are explicitly asked for permission to record. No button is pushed without consent.

All the participants are handed an informative sheet which briefly explains the aims of the research and where they can also have direct contact with the project staff.+

The participants are asked to sign a two-step consent form, where they can authorise the researchers

  1. to use the recorded material for scientific purposes
  2. to make it available for open access search

Metadata

[ki’parla] is being collected in order to be fit for sociolinguistic studies. This entails specific attention on different types of metadata, which are systematically collected for every participant. We ask for:

  • Age
  • Place of birth
  • Place where they attended high-school (students) / where they last worked (professors)
  • Main occupation
  • Other occupations

Data transcription and annotation

The recordings will be transcribed with ELAN applying the Jefferson (2004) conventions for transcriptions, as is common in Conversation Analysis.

An annotation grid is being developed in order to systematically encode all the constructions that are relevant for the aims of LEAdHoC (lists, connectives, general extenders, reformulation …). This will provide the basis for the project database.