Pla docent de l'assignatura

 

Tanca imatge de maquetació

 

Imprimeix

 

Dades generals

 

Nom de l'assignatura: Processament del Llenguatge Natural

Codi de l'assignatura: 572673

Curs acadčmic: 2019-2020

Coordinació: David Buchaca Prats

Departament: Facultat de Matemątiques i Informątica

crčdits: 3

Programa śnic: S

 

 

Hores estimades de dedicació

Hores totals 75

 

Activitats presencials

30

 

-  Teoria

 

10

 

-  Teoricoprąctica

 

20

Treball tutelat/dirigit

15

Aprenentatge autņnom

30

 

 

Competčncies que es desenvolupen

 

To know how to gather and extract information from structured and non-structured data sources.

 

To know how to clean and massage data with the goal of creating valuable,manageable, and informative data sets.

 

To be able to use storage and processing technologies for handling large data sets.

 

To efficiently and effectively apply machine learning analytic and predictive.

 

To communicate results using appropriate communication skills and visualization tools and techniques.

 

 

 

 

Objectius d'aprenentatge

 

Referits a coneixements

Getting to know the basic concepts in the area of Natural Language Processing

 

Discovering and using the basic methods used in Natural Language Processing

 

 

Blocs temątics

 

1. Overview of the course

2. Representing text for machines, working with strings, formatting strings, useful functions for strings, parsing text documents and creating a corpus.

2.1. How to format strings.

2.2. Relevant functions for strings

2.3. Parse text documents

3. Regular expressions

3.1. How to define regular expressions for extracting relevant information.

3.2. Regular expressions for tokenizing sentences.

4. Language models and edit distance

4.1. How can we define metrics over strings?

4.2. Detecting similar words to a misspelled word

4.3. Fast retrieval of similar words: BK tree datastructure, Using Cython to speed up distances, 

4.4. NLTK

5. Document representations for classification

5.1. Scipy.sparse matrices, efficient ways to operate with sparse matrices. 

5.2. Using machine learning models such as  Logistic regression, Perceptron, Support Vector Machines and Multilayer Perceptron.

5.3. Using Pipelines to group and train machine learning and preprocessing steps.

6. Document representations for retrieval tasks

6.1. How can we represent documents in a way that is suitable to define a metric between documents? For example, given a document of sports, how can we recommend to a user "similar" documents? 

6.2. Tf-idf feature vector

6.3.

Techniques and data structures for fast retrieval of similar words

7. clustering documents

7.1. Understand how to retrieve similar documents to a query document using a model.

7.2. Group related documents using different approaches, k-means and mixture of gaussians.

8. Sequence models for NLP taks. Named Entity Recognition and Part of Speech tagging.

8.1. Understanding how to find the most likely hidden state sequence according to some scores. Use it to anonimize documents.

8.2. Implementing a Hidden Markov model

8.3. Implementing a  Structured Perceptron model

9. Dense word representations for words (word2vec and similar models), understand how word embeddings work.

9.1. Learn how to benefit from pretrained embeddings.

9.2. Generate combined features representations of counts and word embeddings for better document classification.

10. Recurrent neural networks for NLP tasks

10.1.

How to implement models that use recurrent blogs such as LSTM and GRU Cells.
 

 

 

Metodologia i activitats formatives

 

Classes will have 2 parts. 

 

- Key conceptual ideas and basic theory will be explained in slides and blackboard

- Practical knowledge will be showed executing jupyter notebooks.

 

 

 

Avaluació acreditativa dels aprenentatges

 

The evaluation is based on 

 

- Projects count 70% of the finall mark.

  - There are two projects to be delivered (with code to replicate results). 

  - Projects will consist on working with a "challange", You will be given a task (dataset and metric) and you will have freedom to explore different strategies to solve the task.

- Little exams count 10% of the final mark. There is the option to re-evaluate this 10% of the overall mark in the final exam.

  - There will be several little exams during classes. Exams are designed to test basic knowledge of the concepts that are explained in the lectures.

- The final exam counts 20% of the final mark.

  - Final exam will evaluate all the material in the course.

 

 

Fonts d'informació bąsica

Consulteu la disponibilitat a CERCABIB

Llibre

Foundations of Statistical Natural Language Processing (Christopher D. Manning and Hinrich Schüte)

Neural Network methods in natural language processign (Yoav Goldberg, Graeme Hirst)

Deep Learning in natural language processing (Li Deng, Yang Liu)

Regex Quick Syntax Reference (Zsolt Nagy)