Vai al contenuto principale





Corpus linguistics represents the main focus area of our research group.

We aim at building and processing electronic POS-tagged corpora (using the CQP format of the IMS Stuttgart Corpus Workbench), therefore providing freely available textual data and fine-grained querying opportunities, whether at morphosyntactic or lexical or textual level.

The first working phase was launched in the framework of a FIRB national project devoted to “Italian language in the variety of texts” (2001). The group was then involved in many projects supported by Italian funding bodies:

  • CNR Promozione Ricerca Giovani 2005, an initiative by the National Research Council promoting the construction of our text-structure oriented legal corpus;
  • FIRB 2006 “RIDIRE - Risorsa Italiana Dinamica di Rete” aimed at compiling, through the use of crawling techniques, a repository of the Italian language that exploits contents on the Internet;
  • VALERE - Formal Varieties in Newsgroups of European Languages: Structural Features, Interlinguistic Comparison and Teaching Applications”, funded by Regione Piemonte in 2009.


Here follow the main resources you can find at and

  • Corpus Taurinense (CT): a corpus of 13th Century Florentine texts, POS-tagged accordingly an expressly devised EAGLES tagset;
  • NewsgroupsUseNet Corpora (NUNC): a multilingual suite of corpora based on the language of newsgroups and consisting of different specialised subcorpora;
  • Athenaeum, built up with  academic texts produced by Turin University, classified by topics and text gender;
  • Corpus Segusinum: a regional newspaper Italian corpus;
  • Jus Jurium: a text-structure oriented legal corpus, that aims at covering the entire legal universe of contemporary Italy, with textual and diplomatical markup (in progress).

International cooperation with the Universities of Erlangen and Stuttgart.


Ultimo aggiornamento: 10/10/2017 09:30
Non cliccare qui!