KIParla – Corpus of Spoken Italian

What is KIParla?

KIParla is a new resource for the study of spoken Italian that is being developed by the LEAdHoC staff. The name of the corpus is the IPA transcription for Italian chi parla (“who’s speaking”), but at the same time, the first three sounds can be read as the acronym of Corpus di Italiano Parlato (Corpus of Spoken Italian), for both <c> and <ch> are the Italian spelling of the velar soundless plosive [k]. This name points to the fact that the corpus offers a rich set of metadata, allowing to identify and trace the features of every speaker across turns. It is a collection of different types of recordings collected in the academic setting, involving students and professors.

Link to the corpus: http://kiparla.it/

What are the aims of KIParla? – Since LEAdHoC is deeply based on the observation of phenomena occurring in spoken interaction, the main aim of the corpus is to provide a considerable amount of data in order to study the different categorisation strategies that occur in spoken varieties. Moreover, the construction of KIParla also tries to pursue a long-term goal: to create an open access resource for the study of contemporary spoken Italian, both interacting with already existing corpora and providing a state-of-the-art methodology for future data-collections.

How to cite KIParla? – Access to KIParla is licensed under an Open Database License. Our corpus is made available under the Open Database License. Any rights in individual contents of the corpus are licensed under the Database Contents License. To cite KIParla please use the following references:

Goria, E. and Mauri C. (2017) Corpus KIParla: Corpus of Spoken Italian. http://www.leadhoc.org/index.php/data-access/corpus-of-spoken-italian/

Mauri, Caterina, Silvia Ballarè, Eugenio Goria, Massimo Cerruti & Francesco Suriano, (2019) “KIParla corpus: a new resource for spoken Italian”. In: Bernardi, Raffaella, Roberto Navigli & Giovanni Semeraro (eds.), Proceedings of the 6th Italian Conference on Computational Linguistics CLiC-it.

Corpus design

The data for KIParla is being collected in two Italian cities: Torino and Bologna.

They were chosen in order to be complementary with the cities present in LIP corpus, which includes some of the biggest administrative centres in Italy. Torino and Bologna are also consistent with each other from a sociolinguistic perspective, especially for what concerns the relationship between local dialects, regional varieties of Italian and standard Italian.

KIParla collects several types of interaction recorded at university. This allows to define with considerable accuracy the most relevant features of the corpus:

As long as the level of education represents the main indicator of social class, KIParla is extremely homogeneous with respect to this feature, in that it involves only speakers of higher social status, such as undergraduate students, graduate students and academic professors. Hence its chief chatacterisation as a corpus of educated speakers.
Other social variables are represented such as age, gender and region of origin of the speakers.
As for the types of interaction, a maximum of heterogeneity has been sought. Several parameters have been regarded as crucial, in particular: the level of formality between the speakers, the planned or unplanned nature of an interaction, the presence of a moderator or of any conventions regulating turn-taking.

The expected size of the corpus is 70 hours, roughly corresponding to 700k words.

These are in detail the typologies of interaction considered for KIParla, based on the already existing grid used for the LIP corpus:

– A1 Professor-student interaction during office hours
– A2 Guided group-interaction, both in spontaneous contexts (student organisations, internship meetings, …) and in controlled contexts (guided focus-groups organised on purpose by the research group, concerning material aspects of the students’life such as Erasmus destinations, housing in Torino and Bologna, …)
– A3 Random conversations recorded by in-group members (students and professors) without direct involving of the researcher.
– C1 Professor-student interaction in oral examinations
– D1 Academic lessons
– D2 Semi-structured interviews collected by students within the peer-group and aimed at the elicitation of oral narratives

SITUAZIONE / TASK	LENGTH (approx.)	AMOUNT	TOT.
Office hours	30′	6	180′
Group interaction	90′	5	450′
Random conversation	30′	6	180′
Examinations	60′	3	180′
Lessons	90′	8	720′
Interviews	30′	13	390′

Radio conversations in university-based web radioes are also being monitored.

Ethical code

All our recordings are realised after specific training of the researchers participating in the data collection.

In all the recordings, the microphone is visible to all the participants and they are explicitly asked for permission to record. No button is pushed without consent.

All the participants are handed an informative sheet which briefly explains the aims of the research and where they can also have direct contact with the project staff.

The participants are asked to sign a two-step consent form, where they can authorise the researchers

to use the recorded material for scientific purposes
to make it available for open access search

Metadata

KIParla is being collected in order to be fit for sociolinguistic studies. This entails specific attention on different types of metadata, which are systematically collected for every participant. We ask for:

Age
Place of birth
Place where they attended high-school (students) / where they last worked (professors)
Main occupation
Other occupations

Data transcription and annotation

The recordings will be transcribed with ELAN applying the Jefferson (2004) conventions for transcriptions, as is common in Conversation Analysis.

An annotation grid is being developed in order to systematically encode all the constructions that are relevant for the aims of LEAdHoC (lists, connectives, general extenders, reformulation …). This will provide the basis for the project database.