← Projects

Greek Parliament Information Retrieval

A comprehensive system for analyzing and extracting insights from speeches in the Greek Parliament spanning 1989–2024, featuring full-text search, topic modeling, clustering, and sentiment analysis.

github.com/pompos02/Greek_Parliament_Information_Retrieval_1989_2024Video demonstration of the web application

Overview

This project creates a robust system for analyzing parliamentary discourse over 35 years of Greek political history. By processing and structuring thousands of speeches, the system enables researchers, journalists, and citizens to explore political trends, track key topics, identify ideological alignments, and understand how political sentiment has evolved over time.

Technical Stack

  • Database: PostgreSQL
  • NLP Processing: SpaCy (el_core_news_sm), NLTK for Greek text analysis
  • Machine Learning: scikit-learn for TF-IDF, LSI, NMF, K-Means clustering
  • Web Application: Interactive interface for search and visualization
  • Data Processing: Python, pandas for data manipulation

Core Features

1. Data Retrieval and Management

  • Automated retrieval of raw speech data from parliamentary records
  • Structured storage in PostgreSQL database with two tables:
    • speeches: Individual speech records
    • merged_speeches: Consolidated speeches per MP per sitting
  • Preprocessing pipeline for efficient querying and analysis

2. Full-Text Search Engine

  • Advanced querying capabilities with TF-IDF-based similarity ranking
  • Cosine similarity for relevance scoring of search results
  • Paginated interface displaying MP names, speech summaries, and full text
  • Accessible via the /search_engine route
Search engine interface before query

Search engine interface before query

Search results for query

Search results ranked by relevance

Full speech view

Full speech view with detailed information

3. Keyword and Similarity Analysis

  • Top-K keyword extraction at multiple granularities:
    • Individual speeches
    • Members of Parliament (MPs)
    • Political parties
  • Temporal analysis tracking keyword evolution over time
  • MP similarity scoring using cosine similarity on TF-IDF vectors
  • Top-K pairs identification revealing ideological alignments
Keyword trends over time

Interactive interface for tracking keyword trends over time

Top 5 similar MP pairs

Top 5 MP pairs by similarity score

Top 10 similar MP pairs

Top 10 MP pairs revealing ideological alignments

4. Topic Modeling

  • Latent Semantic Indexing (LSI) for thematic area identification
  • Dimensionality reduction of TF-IDF matrix to reveal latent concepts
  • Key terms extracted per concept with interpretable themes
  • Alternative approach using Non-Negative Matrix Factorization (NMF)
  • Reduced-dimensional representation for visualization and clustering
LSI topic modeling concepts

LSI-identified thematic concepts with key terms

NMF topic modeling concepts

NMF-identified topics for comparison

5. Speech Clustering

  • K-Means clustering applied to LSI-reduced speech vectors
  • Automatic grouping of speeches into thematic clusters
  • Pattern discovery revealing divisions in political discourse
  • Comparative evaluation of LSI vs. NMF for clustering performance
K-Means clustering with LSI

K-Means clustering results using LSI features

K-Means clustering with NMF

K-Means clustering results using NMF features

6. Sentiment Analysis

  • Polarity scoring for political party speeches over time
  • Temporal tracking of sentiment evolution across parties
  • Identification of most positive and negative keywords per political group
  • Frequency analysis of sentiment-bearing terms with visualization
  • Insights into rhetorical strategies and political positioning
Polarity tracking over time

Sentiment polarity tracking by parliament group over time

Top positive words by party

Most positive keywords per political group

Top negative words by party

Most negative keywords per political group

Implementation Details

Data Processing Pipeline

  • Greek language processing using specialized SpaCy model (el_core_news_sm)
  • NLTK integration for additional Greek text preprocessing
  • Speech merging logic consolidating MP contributions per sitting
  • Efficient indexing for rapid search and retrieval

Machine Learning Pipeline

  • TF-IDF vectorization capturing term importance across speeches
  • LSI dimensionality reduction revealing latent semantic structures
  • K-Means clustering for unsupervised thematic grouping
  • Cosine similarity metrics for MP alignment analysis
← Back to Projects