PaWaC - Public Administration Web as Corpus

The corpus PaWaC was designed and developed by the University of Pisa within the Tuscan regional project named SEMPLICE (SEMantic Instruments for PubLIc Administrators and CitizEns) involving several local SMEs and the University of Pisa ( It is composed by 4172 documents, corresponding to 3.043.842 sentences and 25.218.385 tokens. The corpus gathers a wide typology of administrative acts (resolutions, circular letters, etc.) representative of the Public Administration Italian language and is freely available for research purposes. The corpus is available in two different versions: 147 MB of raw text, and 648 MB of automatically annotated text (morpho-syntactic and lemma information) in the CONLL format.
The corpus was collected by crawling the web sites of 277 Tuscan municipalities.