Word Formation Latin
Welcome to the Word Formation Latin (WFL) website
WFL received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 658332-WFL. The project was based at the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE), at the Università Cattolica del Sacro Cuore, Milan, Italy. The project ran from November 2015 to the end of October 2017, and resulted in the publication of a word formation based lexicon, which is accessible digitally through its own website (http://wfl.marginalia.it) and in connection to the newest version of the morphological analyser and lemmatiser for Latin Lemlat (http//:www.lemlat3.eu).
- Milano
- Word Formation Latin
- The project
The project
In the past two decades there has been a considerable increase in the creation of computational linguistic resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for modern languages. These resources are represented by annotated corpora, treebanks, computational lexica, and digital libraries. Beside these language resources there are NLP tools, such as morphological analysers, part-of-speech taggers, and syntactic parsers.
The WFL project consists in the compilation of a derivational morphological dictionary of the Latin language, which connects lexical elements on the basis of word-formation rules, where lemmas are segmented and analysed into their derivational morphological components, so to establish relationships between them on the basis of word formation, and the verbal noun amator can be reconnected to the verb amo through a suffixation of –a-tor.
A first attempt at constructing a lexicon based on word-formation for Latin was made by Marco Passarotti and Francesco Mambrini in 2012 [M. Passarotti & F. Mambrini, First Steps towards the Semi-automatic Development of a wordformation-based Lexicon of Latin, in Proceedings of LREC 2012, Istanbul, Turkey, 852-859], when they published a paper proposing a model for the semi-automatic extraction of word formation rules and the subsequent pairing of lemmas to their morphologically simplest lemma (i.e. non-derived). WFL is expanding on this first attempt and will result in a definitive linguistic resource.
The WFL project has three main aims:
1. the enrichment of an existing morphological analyser for the Latin language, LEMLAT, [ Passarotti, M. (2004). “Development and perspectives of the Latin morphological analyser LEMLAT”. In A. Bozzi, L. Cignoni & J.L. Lebrave (Eds.), Digital Technology and Philological Disciplines. Linguistica Computazionale, XX-XXI, pp. 397- 414.] with wordformation information, and the integration of data within a interface similar to Word Manager [Domenig, M. & ten Hacken, P. (1992). Word Manager: A system for morphological dictionaries. Hildesheim: Olms.], which has been already applied to other modern languages (English, German, Italian);
2. the integration of the information extracted from the resulting derivational morphological dictionary into the morphological layer of annotation the Index Thomisticus Treebank (IT-TB). The Index Thomisticus (IT) is considered a pathfinder in digital humanities; started by Padre Roberto Busa in 1949. It is a database retaining the opera omnia by Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). The size of the corpus is around 11 million tokens (150.000 types; 20.000 lemmas). The corpus is fully lemmatised and morphologically tagged. The IT-TB, based at CIRCSE, is the syntactically annotated portion of the IT, and it contains around 300.000 tokens for 15.000 syntactically parsed sentences. The morphological layer reports information about the lemmatization and the morphological features (PoS, gender, number, tense, etc.) for each word in the base text
3. offering the results of the project work via a user-friendly project website which will display the derivational morphological dictionary through a web based search interface. This will allow the lexicon to be accessed:
- by single lexical entry, which will show both the ancestors and their derived words;
- by morphological family, i.e. the set of lemmas morphologically derived from one common ancestor-lemma;
- by WFR.
The project relies on the automatic realisation of the linguistic resource both at the level of WFRs creation and to their application on the lexical items included in the morphological analyser LEMLAT.
The final resource will be both a standalone dictionary accessible through its own website, and interconnected with the Index Thomisticus Treebank (IT-TB).
The integration with the IT-TB will be operated through the embedding of the dictionary data within the morphological layer of annotation of the treebank, using TEI (Text Encoding Initiative) P5 conformant XML encoding to favour data exchange and linking to other lexical resources. The data resulting from the dictionary, once encoded in XML, will be applied to the IT-TB data.
The results of the project work will be offered via a user-friendly website which will display the derivational morphological dictionary through a web based search interface.
The WFL team consisted in:
- Eleonora Litta Modignani Picozzi, MSCA Research Fellow;
- Marco Passarotti, Project Supervisor.
Documentation:
The WFL documentation is kept and updated at the WFL GitHub: https://github.com/CIRCSE/WFL.
Publications:
The present documentation on WFL is covered partly by the following publications that result from the work done during the MSCA Fellowship:
- Budassi, Marco, Eleonora Litta, and Marco Passarotti. 2017. ‘-io Nouns through the Ages. Analysing Latin Morphological Productivity with Lemlat’. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017), 65-70. aAccademia University Press, Roma. http://www.aaccademia.it/component/search/?searchword=clic-it&searchphrase=all&Itemid=118
- Budassi, Marco, and Eleonora Litta. 2017. ‘In Trouble with the Rules. Theoretical Issues Raised by the Instertion of -sc- verbs into Word Formation Latin’. In Proceedings of the Workshop on Resources and Tools for Derivational Morphology (DeriMo), 15–26. Milan: Educatt. http://itreebank.marginalia.it/doc/2017_Litta-Passarotti_Proceedings-DeriMo.pdf
- Culy, Chris, Eleonora Litta, and Marco Passarotti. n.d. ‘Visual Exploration of Latin Derivational Morphology’. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference. Marco Island, Florida. May 22–24, 2017, 601–6. Palo Alto, California - USA: The AAAI Press. https://www.aaai.org/Library/FLAIRS/flairs17contents.php
- Litta Eleonora, and Marco Passarotti. 2017. 'Preface'. In Proceedings of the Workshop on Resources and Tools for Derivational Morphology (DeriMo). Milan: Educatt. http://itreebank.marginalia.it/doc/2017_Litta-Passarotti_Proceedings-DeriMo.pdf
- Litta, Eleonora, Marco Passarotti, and Paolo Ruffolo. 2017. ‘Node Formation: Using Networks to Inspect Productivity in Affixal Derivation in Classical Latin’. In Proceedings of the 2Nd International Conference on Digital Access to Textual Cultural Heritage, 103–8.DATeCH2017. New York, NY, USA: ACM. doi:10.1145/3078081.3078092.
- Litta, Eleonora, Marco Passarotti, and Chris Culy. n.d. ‘Formatio Formosa Est. Building a Word Formation Lexicon for Latin’. In Third Italian Conference on Computational Linguistics (CLiC–it 2016), 185–89. Naples: aAccademia University Press. http://www.aaccademia.it/component/search/?searchword=CliC-it 2016&searchphrase=all&Itemid=118.
- Micheli, Silvia, and Eleonora Litta. 2017. 'E pluribus unum. E pluribus unum. Representing compounding in a derivational lexicon of Latin.' In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017), 65-70. aAccademia University Press, Roma. http://www.aaccademia.it/component/search/?searchword=clic-it&searchphrase=all&Itemid=118.
- Passarotti, Marco, Marco Budassi, Eleonora Litta, and Paolo Ruffolo. 2017. ‘The Lemlat 3.0 Package for Morphological Analysis of Latin’. In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, 24–31. Linköping University Electronic Press. http://www.ep.liu.se/ecp/article.asp?issue=133&article=006&volume=.
Read more
-
LiLaLiLa ha l'obiettivo di connettere e, in ultima istanza, sfruttare l'insieme di risorse linguistiche e strumenti di trattamento automatico del linguaggio costruiti fino ad oggi
-
Index Thomisticus TreebankIniziato da padre Roberto Busa SJ nel 1949, l'Index Thomisticus è considerato un progetto pionieristico della linguistica computazionale. L'Index consiste in un corpus contenente l'opera omnia di Tommaso d'Aquino (118 testi) e 61 testi di autori connessi a Tommaso, per un totale di circa 11 milioni di parole, ciascuna delle quali è stata lemmatizzata manualmente a livello morfologico.
-
Word Formation LatinWFL has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 658332-WFL.