Presto and Trino: Querying Everything From Anywhere

Learn how Presto and Trino enable federated SQL queries across data lakes, warehouses, databases, and object storage without moving data.

published: March 27, 2026 reading time: 16 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Presto and Trino solve the data silo problem by pushing queries out to where data lives, rather than ETL-ing everything into one place. The coordinator parses SQL, builds a query plan, and farms out work to parallel workers that pass intermediate results between themselves. Connectors do predicate pushdown to each source, so you don't pull more data across the network than necessary. Trino has a broader connector ecosystem and moves faster on features, which puts it ahead for most new multi-source federation work.

Presto and Trino: Querying Everything From Anywhere

You have data in S3, in PostgreSQL, in MongoDB, in Kafka, in your data warehouse, and in three different MySQL instances from acquired companies. Your analysts need to join across all of it. What do you do?

Presto (and its fork Trino) answer this with a federated query engine that connects to everything and queries it in place. No ETL required to consolidate before analysis. No separate data movement jobs. SQL spans your entire data estate.

The Problem: Data Silos

Most enterprises end up with data scattered across systems that do not talk to each other. Your BI tool connects to the warehouse. Your data scientists query the data lake directly. Your operational reports hit the OLTP database directly. Customer data lives in three different systems depending on which acquisition it came from.

Consolidating everything into one system is expensive, slow, and politically difficult. Different teams own different systems. Regulatory requirements mandate data residency. Some data changes too frequently to replicate.

Presto and Trino solve this with a virtual approach: instead of moving data to the query, send the query to the data.

Architecture: Coordinator and Workers

Presto and Trino use a classic distributed query architecture. The coordinator receives SQL, parses and plans the query, schedules work across workers, and returns results. Workers execute tasks in parallel, reading from their assigned data sources and exchanging intermediate results over the network.

-- Query across multiple sources in a single SQL statement
SELECT
    c.customer_name,
    c.region,
    SUM(wh.order_amount) AS warehouse_revenue,
    SUM(ec.cart_value) AS ecommerce_revenue,
    SUM(wh.order_amount) + SUM(ec.cart_value) AS total_revenue
FROM mysql_customers.crm.customers c
JOIN redshift_sales.warehouse_orders wh ON c.customer_id = wh.customer_id
JOIN glue_catalog.ecommerce.transactions ec ON c.customer_id = ec.customer_id
WHERE c.region = 'EMEA'
GROUP BY c.customer_name, c.region;

The coordinator knows where each catalog’s data lives. It pushes down filter predicates to each source to minimize data transfer. Workers receive only the columns and rows they need for joins and aggregations.

Trino Architecture

flowchart TD
    subgraph Client[Client]
        BI[BI Tool / JDBC / CLI]
    end
    subgraph Coordinator[Coordinator Node]
        SQL[SQL Parser]
        PLAN[Query Optimizer]
        SCHEDULE[Task Scheduler]
        METRICS[Metrics]
    end
    subgraph Workers[Worker Nodes]
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
    end
    subgraph Catalogs[Data Catalogs]
        HIVE[Hive / Glue]
        POSTGRES[PostgreSQL]
        REDSHIFT[Redshift]
        S3[S3 / Object Storage]
        KAFKA[Kafka]
    end
    BI -->|SQL query| Coordinator
    Coordinator -->|dispatch| W1
    Coordinator -->|dispatch| W2
    Coordinator -->|dispatch| W3
    W1 <-->|exchange| W2
    W2 <-->|exchange| W3
    W1 -->|read| HIVE
    W2 -->|read| POSTGRES
    W3 -->|read| REDSHIFT
    W1 -->|read| S3
    W2 -->|read| KAFKA

The coordinator parses SQL and creates a distributed query plan. Workers exchange intermediate results with each other. Each worker reads from the connectors assigned to its tasks.

Presto Versus Trino: The Fork

Presto was originally developed at Facebook (now Meta) and open-sourced in 2013. For years, it was the only option for federated SQL queries. Facebook continued developing Presto internally, and the open-source version and the internal version gradually diverged.

In 2019, several former Facebook engineers left and founded Starburst (formerly PrestoSQL). They took the PrestoSQL code, renamed it Trino, and built an enterprise offering on top. The fork was amicable enough that the Presto trademark stayed with the original project.

Today, both projects continue. PrestoDB (under the Presto Foundation at the Linux Foundation) and Trino both actively develop, with slightly different priorities. Trino has historically moved faster on new features and has a larger ecosystem of connectors. PrestoDB has the backing of Meta and stability for companies deeply invested in the original codebase.

For most new deployments, Trino is the more active project with broader connector support. The SQL dialect is nearly identical. Performance characteristics are comparable. Starburst’s enterprise distribution adds security, observability, and support features for companies that want them.

Query Execution: How It Works

When you submit a query, the coordinator parses the SQL and creates a logical plan. The optimizer reorganizes joins, pushes down predicates, and determines which data to read from which connectors. The physical plan divides work into stages that run on workers.

EXPLAIN (TYPE DISTRIBUTED)
SELECT
    date_trunc('week', o.order_date) AS week,
    p.category,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    SUM(o.total_amount) AS revenue
FROM redshift.analytics.orders o
JOIN presto.default.products p ON o.product_id = p.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY 1, 2
ORDER BY 4 DESC;

The EXPLAIN output shows the distributed plan: which stages run on which workers, how data is partitioned, and what pushes down to connectors. Understanding explain plans is essential for tuning Trino queries at scale.

Connectors: Talking to Everything

The connector system is what makes Trino powerful. Each connector understands how to read (and sometimes write) from a specific data source.

The Hive connector reads from data lakes using Hive metastore or AWS Glue as the catalog. It supports Parquet, ORC, JSON, and text files. It pushes filter predicates down to file level when possible.

-- Query a data lake table managed by Glue
SELECT *
FROM hive.production.clickstream_events
WHERE event_date = DATE '2025-03-27'
  AND event_type = 'purchase'
LIMIT 100;

The PostgreSQL connector creates a virtual schema that maps to PostgreSQL tables. Queries run directly against PostgreSQL, with predicate and aggregation pushdown.

-- Join PostgreSQL dimension with Redshift facts
SELECT
    p.product_name,
    p.inventory_count,
    r.revenue_30d
FROM postgresql.inventory.products p
LEFT JOIN (
    SELECT product_id, SUM(amount) AS revenue_30d
    FROM redshift.analytics.sales
    WHERE sale_date >= CURRENT_DATE - 30
    GROUP BY product_id
) r ON p.product_id = r.product_id
WHERE p.inventory_count < 10;

Memory management is critical. Trino processes data in memory across workers, and queries that exceed memory limits fail. The task.max-memory configuration controls per-task memory. For large joins or aggregations, you may need to spill to disk or increase memory allocation.

Performance Characteristics

Trino excels at filtering and aggregating large datasets across multiple sources. A query that filters a billion-row fact table in S3 to a small result set is fast because predicate pushdown means workers only read matching data.

Joins between large tables are expensive. If both sides of a join are massive, you need to broadcast one side to all workers or shuffle data by join key. Broadcasting works when one side fits in worker memory. Shuffling works for larger datasets but requires network transfer.

Capacity Estimation

Trino workers process data in memory. Right-sizing the cluster prevents OOM failures and ensures good parallelism.

# Estimate worker memory needs
query_memory_gb = 10  # per-worker memory for a typical analytical query
concurrent_queries = 5
memory_per_worker_gb = 16  # typical worker memory
spill_space_gb = memory_per_worker_gb * 0.5  # spill space = 50% of memory

# Workers needed for concurrent queries
workers_for_concurrency = concurrent_queries * query_memory_gb / memory_per_worker_gb

# Network bandwidth
gb_per_query = 50  # data read per query
query_duration_sec = 30
network_mbps = (gb_per_query * 8) / query_duration_sec  # ~13 Gbps for 50GB in 30s

print(f"Workers for concurrent queries: {workers_for_concurrency:.0f}")
print(f"Network bandwidth per query: {network_mbps:.0f} Gbps")
# Workers for concurrent queries: 3
# Network bandwidth per query: 13 Gbps

At 50GB per query and 30-second target duration, you need ~13 Gbps network throughput per query. For 10 concurrent queries, that’s 130 Gbps total. If workers share a 10 Gbps network link, queries queue up. Profile actual network usage per query with Trino’s query.stats before scaling.

Analytical queries—aggregations, window functions, complex joins across multiple filtered sources—play to Trino’s strengths. Transactional queries that return single rows or update small amounts of data are not what Trino is designed for.

Trino vs Presto vs Native Connectors

Aspect	Trino	PrestoDB	Native Connector (Spark/Flink)
Federated queries	Yes (primary use case)	Yes	Limited
Connector ecosystem	Broad (30+ connectors)	Broad (25+ connectors)	Single source per engine
SQL dialect	Standard SQL (ANSI)	Standard SQL	Engine-specific
Active development	Very active	Moderate	Varies by engine
Memory management	Per-task memory limits	Per-task memory limits	JVM heap
Security	Built-in RBAC, LDAP	Built-in RBAC	Depends on engine
Best for	Data virtualization, multi-source joins	Meta-internal workloads	Single-source batch/streaming

For multi-source federation, Trino or PrestoDB. For single-source workloads, use the native engine (Spark for batch ETL, Flink for streaming).

When to Use Trino

Trino is the right tool when you need to query data in place without moving it. Data virtualization, cross-source joins, ad-hoc exploration of data lake contents, and federated queries across multiple warehouses or databases are all natural use cases.

It’s also excellent for data lake query acceleration. Instead of relying on Spark for all data lake queries, you can serve interactive BI queries through Trino. Many teams use Trino as the query layer for their data lake, with Spark handling ETL and batch processing.

Trino is not an operational database. It does not handle high-frequency inserts or updates well. It does not provide ACID transactions across sources. It is a query engine, not a storage system.

Trino Production Failure Scenarios

Coordinator OOM on complex query

A user submits a query with a 10-way join across large tables. The coordinator holds the entire query plan and intermediate results in memory. The coordinator runs out of heap and crashes. All running queries fail, and the cluster becomes unavailable.

The coordinator is the single point of failure in a Trino cluster. Unlike workers, which run tasks in isolation, the coordinator maintains the full query state across all stages. This includes the parsed SQL tree, the distributed query plan, statistics from each connector, and any partial results waiting for the final stage. A query with many joins creates a large query plan object that stays in memory throughout execution. If the query also involves large intermediate results from group-by operations, the coordinator’s heap fills quickly.

When the coordinator OOMs, it does not gracefully drain. Running queries lose their coordinator and time out. The workers continue processing whatever tasks were in flight, but without a coordinator to collect results, those tasks produce orphaned data. Restarting the coordinator requires reloading catalog metadata and reconnecting to workers, which takes 30-60 seconds in a typical deployment. During that window, no queries can be accepted.

Mitigation: Set query.max-memory to limit per-query memory. Set query.max-total-memory-per-query for distributed memory. Monitor coordinator heap usage and alert before it exceeds 80%. For very large queries, force task.max-memory-per-node limits. Also configure query.max-run-time so runaway queries are killed before they exhaust the coordinator. On restart, Trino re-registers workers automatically, but monitor that re-registration completes before allowing new query submissions.

Memory spill causing query timeouts

A query tries to join two large tables that don’t fit in worker memory. Trino spills intermediate results to disk, but disk is slow. The query runs for 30 minutes and times out before completing.

Trino processes data entirely in memory. When a join produces intermediate rows that exceed a worker’s memory allocation, Trino writes those rows to disk using a spill-to-disk mechanism. The key word is slow—disk I/O is orders of magnitude slower than memory access, and the spill files are stored on the worker’s local filesystem. A join that would take seconds in memory can stretch into minutes or hours when spilling kicks in. The effect compounds when multiple workers are spilling simultaneously, because they all compete for the same disk throughput.

The timeout is not just about slowness. Trino’s query.max-execution-time limits how long a query can run before it is killed. If the estimated runtime exceeds this limit, the query is terminated before it can finish. Even when the timeout is generous, a spilling query that eventually completes consumes worker resources that other queries need. Workers under memory pressure also trigger garbage collection more frequently, which adds CPU overhead on top of the disk I/O overhead.

Mitigation: Estimate memory requirements before running large joins. Use EXPLAIN to see if a join will broadcast or shuffle. For very large joins, pre-aggregate one side or use approximate algorithms (HyperLogLog for count-distinct). Monitor spilled_memory and spilled_files metrics. Set spiller.spill-path to a dedicated fast disk (SSD) rather than a shared network mount. If you see repeated spilling, consider increasing worker memory or reducing concurrency—the memory budget per query shrinks as more queries run in parallel.

Predicate pushdown failure causing connector overload

A user submits a query with a date filter event_date = '2025-03-27'. The PostgreSQL connector fails to push down the predicate because the column type doesn’t match. The connector reads all rows from PostgreSQL and transfers them to workers, overwhelming the PostgreSQL instance.

Predicate pushdown in Trino operates at two levels. The optimizer decides which predicates can be pushed to each connector based on the query plan and connector capabilities. The connector then decides whether it can actually apply those predicates when generating the native query. A common failure mode is a type mismatch at the connector level. If the source column is timestamp without time zone but the query filter uses a date literal, the PostgreSQL connector may decide it cannot safely push the predicate and instead pulls all rows for the connector to filter in Trino. The resulting data transfer from PostgreSQL to Trino can be orders of magnitude larger than expected.

This failure is particularly dangerous because it is silent. The query returns correct results—just slowly and expensively. The PostgreSQL instance feels the full read load, which can degrade performance for other applications sharing that database. Meanwhile, Trino workers consume CPU to filter rows they should never have received. In one real-world case, a simple date-filtered query went from sub-second response to 45 seconds because a column type mismatch caused the connector to pull 50 million rows instead of a few thousand.

Mitigation: Test connector pushdown with EXPLAIN. Verify column types match between the query filter and the source. For connectors with poor pushdown, add application-level filtering in the query itself. Monitor connector.read.rows vs task.output.rows to detect pushdown failures. When writing filters, cast literals to match the source column type explicitly—using date '2025-03-27' instead of a string, or timestamp '2025-03-27 00:00:00' for timestamp columns. Check the Trino documentation for each connector’s pushdown limitations, as some connectors (like Kafka) have more restricted pushdown capabilities than relational databases.

Catalog misconfiguration exposing wrong data

A new catalog is added to Trino for a development data source. The catalog name dev_postgres is similar to prod_postgres. A user accidentally queries the wrong catalog and gets stale development data in a production report.

The risk here is not a technical misconfiguration—it is an organizational one. Trino’s catalog layer is flat, which means all catalogs are visible to all users by default once they have cluster access. There is no automatic mechanism that prevents a user from running SELECT * FROM dev_postgres.crm.customers in a production BI dashboard. The query will execute. It will return data. It will just be the wrong data. In organizations where dev and prod databases look similar, this mistake is easy to make and hard to catch without active monitoring.

The consequences compound when teams share BI tools with saved queries. A dashboard saved in a production workspace might reference a catalog name that a developer changed last week. The dashboard fails silently in some places and returns wrong results in others. By the time the error is discovered, the incorrect data may have influenced business decisions.

Mitigation: Use consistent naming conventions (prefix catalogs with environment: prod_, dev_). Restrict catalog access by role. Add query comments that must specify the target environment. Audit query logs for cross-environment joins. In Trino, you can use the ALWAYS Grant system to control catalog visibility per role. For high-sensitivity environments, consider a validation layer that parses incoming queries and rejects those targeting non-approved catalogs. Some teams also add a query_context comment requirement: queries without a comment referencing the target environment are routed to a sandbox cluster that has no access to production catalogs.

Conclusion

Presto and Trino changed the game for data platform architecture. Instead of ETL pipelines that copy data into a central warehouse, you can query across your entire data estate with SQL. The tradeoff is latency and resource usage—federated queries are slower than querying local data, and they consume resources from every system they touch.

Quick Recap

Trino and PrestoDB query data in place—no ETL needed to consolidate before analysis.
The coordinator parses and optimizes SQL; workers execute tasks and exchange intermediate results.
Predicate pushdown is critical for performance: verify it works with EXPLAIN before running large queries.
Memory limits prevent single queries from crashing the cluster—set query.max-memory-per-node.
Use RBAC and catalog naming conventions to prevent cross-environment queries.
For single-source workloads, use the native engine (Spark, Flink) instead of Trino.

For teams struggling with data silos, Trino provides a pragmatic middle ground. You do not have to move everything to one place to analyze it together. Connect the sources, configure the catalogs, and start querying across systems.

Trino Observability

Key metrics to monitor:

-- Query-level performance metrics from Trino's built-in tables
SELECT
    query_id,
    state,
    total_rows,
    total_bytes,
    execution_time_ms,
    peak_memory_bytes
FROM system.runtime.queries
WHERE state = 'FAILED'
ORDER BY execution_time_ms DESC
LIMIT 10;

-- Worker health
SELECT
    worker_id,
    tasks_running,
    tasks_blocked,
    memory_usage_bytes,
    cpu_usage_pct
FROM system.runtime.worker_nodes;

Monitor per query: rows scanned, bytes scanned, execution time, peak memory. Alert on: coordinator heap > 80%, query failure rate > 5%, worker memory pressure. Track query.stats.output_bytes to detect queries returning unexpectedly large results (potential data leak).

Trino Security Checklist

Enable LDAP or OAuth authentication for all users
Configure RBAC for catalog and schema access
Restrict access to system catalog (information_schema) by role
Validate catalog names: use environment prefixes (prod_, dev_) to prevent cross-environment queries
Enable query audit logging with user, query text, and source IP
Set memory limits per query (query.max-memory-per-node) to prevent single-query OOM from affecting cluster
Encrypt connections between workers (internal communication) and to data sources

Trino Anti-Patterns

Using Trino as an operational database. Trino is not designed for high-frequency single-row lookups or write workloads. If you find yourself running millions of small queries through Trino, use the native database or add caching (Redis, Hive Metastore) instead.

Large broadcast joins without memory limits. Forcing a broadcast join on a 100GB table overwhelms worker memory. The query fails, but not before causing GC pressure on all workers. Always check EXPLAIN for memory estimates before running large joins.

No catalog access control. If any user can query any catalog, a misconfigured BI tool could hit production systems. Always configure RBAC per catalog and validate in staging first.

For large-scale batch processing, see Apache Spark. For portable streaming and batch pipelines, read about Apache Beam.

Presto and Trino: Querying Everything From Anywhere

The Problem: Data Silos

Architecture: Coordinator and Workers

Trino Architecture

Presto Versus Trino: The Fork

Query Execution: How It Works

Connectors: Talking to Everything

Performance Characteristics

Capacity Estimation

Trino vs Presto vs Native Connectors

When to Use Trino

Trino Production Failure Scenarios

Coordinator OOM on complex query

Memory spill causing query timeouts

Predicate pushdown failure causing connector overload

Catalog misconfiguration exposing wrong data

Conclusion

Quick Recap

Trino Observability

Trino Security Checklist

Trino Anti-Patterns

Category

Tags

Related Posts

AWS Data Services: Kinesis, Glue, Redshift, and S3

Azure Data Services: Data Factory, Synapse, and Event Hubs

Data Lake Architecture: Raw Data Storage at Scale