The UPSKILLS Learning Content

Processing texts and corpora

      🕑   5 ECTS (+ a student project amounting to 1.5 extra ECTS)

Description and scope

This learning block puts together basic notions and techniques of text processing with skills related to corpus linguistics, such as corpus creation and management and the ability to use corpus query tools for simple and advanced queries. 

The philosophy behind the block is that students should take an active approach in learning. While several manuals and online courses exist in this area, including the DigiLing materials that we re-used here, we are not aware of any that address non-specialist BA students using this approach. So, the underlying research-oriented, autonomy-fostering view of learning embodied by these materials makes it complementary to the more technical, information-rich, but inevitably instructor-centred, courses on text and corpus processing currently available.

The structure of the course was designed with input from the UPSKILLS needs analysis. For this reason, besides the technical and theoretical skills, a central aim of the course is also for students to develop research skills as well as organizational and interactional skills, these being of primary importance for the new generations of language specialists. Thus, this learning block is characterized by a largely practical and research-based approach, linking theoretical and technical skills to real-life working situations and aiming to generate curiosity, interest and nurture critical thinking in the minds of the students.

The block’s units are conceived in a modular way that allows lecturers and learners to take (or adapt) them either in sequence or as self-standing contents, depending on their needs. 

Disclaimer: Though it might seem obvious, processing text requires having profound knowledge of the language(s) in which the text itself is produced. For this reason, the focus of the course is limited to the languages known to the course creators: the examples and methods you will find described here are therefore based on Western languages, written from left to right and using the latin alphabet. Any extensions to other languages will have to be evaluated on a case by case basis by experts of those languages. 

Block outline

(the overall workload associated with the first 6 units of this block amounts to 5 ECTS)

  1. Why process texts?  
  2. Basics of text processing
  3. Corpus design and construction
  4. Corpus annotation
  5. Corpus consultation
  6. Corpus types, research priorities and applications
  7. Student project – Speaking of consequences: good, bad, neutral? (1.5 ECTS)
Learning outcomes

Overall, the materials and activities present in this block will allow students to:

  • distinguish between different data formats and extract text content from them;
  • explain the specifics of plain text;
  • apply the principles and issues of text annotation, incl. morphological analysis, syntactic and semantic tagging;
  • utilise available resources for text processing and analysis, incl. taggers and concordancers;
  • analyse existing as well as their own corpora with a variety of available corpus-based tools;
  • build complex CQL queries, incl. regular expressions and logical operators;
  • use text processing to address language and discourse research issues.
Target audience

The primary target audience are lecturers who (want to) teach about text processing and corpora in the domain of linguistics, translation and other language-related areas. Students can also use the materials autonomously, but should be aware that this is not a typical self-study course. 

Creative Commons License

This UPSKILLS learning content block is licensed under a Creative Commons Attribution 4.0 International License.

Block designers

Silvia Bernardini

Adriano Ferraresi

Novella Tedesco