Pla docent de l'assignatura



Tanca imatge de maquetació




Dades generals


Nom de l'assignatura: Processament del Llenguatge Natural

Codi de l'assignatura: 572673

Curs acadčmic: 2020-2021

Coordinació: David Buchaca Prats

Departament: Facultat de Matemątiques i Informątica

crčdits: 3

Programa śnic: S



Hores estimades de dedicació

Hores totals 75


Activitats presencials i/o no presencials



-  Teoria

Presencial i no presencial




-  Teoricoprąctica




Treball tutelat/dirigit


Aprenentatge autņnom






Learn how to use numpy properly before joining the course.



Competčncies que es desenvolupen


To know how to gather and extract information from structured and non-structured data sources.


To know how to clean and massage data with the goal of creating valuable,manageable, and informative data sets.


To be able to use storage and processing technologies for handling large data sets.


To efficiently and effectively apply machine learning analytic and predictive.


To communicate results using appropriate communication skills and visualization tools and techniques.





Objectius d'aprenentatge


Referits a coneixements

Getting to know the basic concepts in the area of Natural Language Processing


Discovering and using the basic methods used in Natural Language Processing



Blocs temątics


1. Overview of the course

2. Representing text for machines, working with strings, formatting strings, useful functions for strings, parsing text documents and creating a corpus.

2.1. How to format strings.

2.2. Relevant functions for strings

2.3. Parse text documents

3. Regular expressions

3.1. How to define regular expressions for extracting relevant information.

3.2. Regular expressions for tokenizing sentences.

4. Language models and edit distance

4.1. How can we define metrics over strings?

4.2. Detecting similar words to a misspelled word

4.3. Fast retrieval of similar words: BK tree datastructure, Using Cython to speed up distances, 

4.4. NLTK

5. Document representations for classification

5.1. Scipy.sparse matrices, efficient ways to operate with sparse matrices. 

5.2. Using machine learning models such as  Logistic regression, Perceptron, Support Vector Machines and Multilayer Perceptron.

5.3. Using Pipelines to group and train machine learning and preprocessing steps.

6. Document representations for retrieval tasks

6.1. How can we represent documents in a way that is suitable to define a metric between documents? For example, given a document of sports, how can we recommend to a user "similar" documents? 

6.2. Tf-idf feature vector


Techniques and data structures for fast retrieval of similar words

7. clustering documents

7.1. Understand how to retrieve similar documents to a query document using a model.

7.2. Group related documents using different approaches, k-means and mixture of gaussians.

8. Sequence models for NLP taks. Named Entity Recognition and Part of Speech tagging.

8.1. Understanding how to find the most likely hidden state sequence according to some scores. Use it to anonimize documents.

8.2. Implementing a Hidden Markov model

8.3. Implementing a  Structured Perceptron model

9. Dense word representations for words (word2vec and similar models), understand how word embeddings work.

9.1. Learn how to benefit from pretrained embeddings.

9.2. Generate combined features representations of counts and word embeddings for better document classification.

10. Recurrent neural networks for NLP tasks


How to implement models that use recurrent blogs such as LSTM and GRU Cells.



Metodologia i activitats formatives


Classes will have 2 parts. 


- Key conceptual ideas and basic theory will be explained in slides and blackboard

- Practical knowledge will be showed executing jupyter notebooks.




Avaluació acreditativa dels aprenentatges


The evaluation is based on: 2 deliverable projects (70%) and a final exam (30%)


- Projects count 70% of the finall mark.

  - There are two projects to be delivered (with code to replicate results). 

  - Projects will consist on working with a "challenge", You will be given a task (dataset and metric) and you will have some freedom to explore different strategies to solve the task.


- The final exam counts 30% of the final mark. 

  - Final exam will evaluate all the material in the course.



Fonts d'informació bąsica

Consulteu la disponibilitat a CERCABIB


Foundations of Statistical Natural Language Processing (Christopher D. Manning and Hinrich Schüte)

Neural Network methods in natural language processign (Yoav Goldberg, Graeme Hirst)

Deep Learning in natural language processing (Li Deng, Yang Liu)

Regex Quick Syntax Reference (Zsolt Nagy)