'ClickHouse'? OLAP Database Processes Billions of Rows Per Second (Setup, Usage)

ClickHouse uses a column-oriented architecture that delivers 100x faster query performance compared to traditional row-based databases, while achieving 90% storage savings. Today, we’ll explore ClickHouse, the database chosen by global companies like Cloudflare, Uber, and Spotify.

Ever found yourself waiting minutes for complex aggregation queries to complete on your traditional database? What if you could run those same queries in just seconds?

ClickHouse

Table of Contents

1. What Exactly Is ClickHouse?

ClickHouse is an OLAP (Online Analytical Processing) database management system (DBMS) developed by Russian tech giant Yandex and open-sourced in 2016.

Originally built as an internal project for Yandex’s web analytics service, Yandex.Metrica, ClickHouse was engineered to solve a critical problem: generating analytical reports in real-time directly from massive, high-volume data streams.

Now spun off as an independent company, ClickHouse, Inc. maintains the project under the Apache License 2.0 open-source license.

2. OLAP vs OLTP: Understanding the Difference

To understand ClickHouse, you first need to grasp the difference between OLAP and OLTP.

Aspect	OLTP (Transaction Processing)	OLAP (Analytical Processing)
Purpose	Handle daily transactions	Data analysis and reporting
Operations	Insert, update, delete	Complex queries and aggregations
Data Volume	Small, current data	Large, historical data
Response Time	Milliseconds	Seconds to minutes
Use Cases	Order processing, banking	Sales analysis, user behavior
Examples	PostgreSQL, MySQL	ClickHouse, BigQuery, Redshift

OLAP systems are designed to perform complex queries on large datasets for business intelligence, reporting, and forecasting purposes.

Think of it this way: “What was the average purchase amount for male customers aged 30-40 in New York last month?” is OLAP. “John Doe just paid $100” is OLTP.

3. Why Is ClickHouse So Fast?

The secret lies in columnar storage.

Row-Based vs Column-Based: The Real Difference

Traditional row-based databases (PostgreSQL, MySQL):

[ID=1, Name=John, Age=25, City=NYC]
[ID=2, Name=Jane, Age=30, City=LA]
[ID=3, Name=Mike, Age=28, City=NYC]

ClickHouse’s column-based storage:

[ID column: 1, 2, 3, ...]
[Name column: John, Jane, Mike, ...]
[Age column: 25, 30, 28, ...]
[City column: NYC, LA, NYC, ...]

Column-oriented databases store values from the same columns together, with ClickHouse organizing data by columns rather than rows.

Real-world example: For the query “What’s the average age of users over 30?”

Row-based DB: Reads all rows, extracts age column → Slow
Column-based DB: Reads age column only → Fast

This difference creates 10x to 100x performance improvements.

The Magic of Compression

When similar data is stored consecutively, compression becomes far more efficient.

For example:

City column: [NYC, NYC, NYC, LA, LA, NYC…]
After compression: [NYCx3, LAx2, NYCx1…]

ClickHouse’s columnar format combined with efficient compression algorithms drastically reduces storage costs while dramatically improving query speed.

Vectorized Processing and Parallel Execution

ClickHouse uses vectorized query execution to process data in column batches rather than row by row.

Additionally, its distributed architecture supports:

Sharding data across multiple servers
Parallel query processing for speed
Horizontal scaling for performance improvements

Data is distributed across servers enabling parallel query processing, like multiple chefs working on different parts of a meal simultaneously.

4. Real-World Use Cases: Companies at Scale

Let’s look at how actual companies use ClickHouse.

Cloudflare: Processing Hundreds of Millions of Rows Per Second

Cloudflare was one of ClickHouse’s earliest adopters, running it in production by the end of 2016, shortly after the open-source release.

Cloudflare uses ClickHouse for:

HTTP and DNS Analytics: Monitoring massive web traffic
Logging Analytics: Traffic and bot management
Customer Dashboards: Real-time data delivery
Firewall Analytics and Cloudflare Radar

Cloudflare processes hundreds of millions of rows per second, distinguishing bots from legitimate users in real-time.

Uber’s Impressive Results

Uber migrated its logging system to ClickHouse, achieving a 10x performance improvement while cutting hardware costs in half.

eBay’s DevOps Transformation

eBay uses ClickHouse for its real-time OLAP events infrastructure, reducing DevOps efforts while using 10x less hardware and achieving better results in visualization and analytics with Grafana.

A Korean Startup’s Journey

A Korean mobile app startup collects over 100 million events daily from more than a million users, processing everything with ClickHouse.

After adopting ClickHouse, their analysis speed increased 10x to 100x compared to their previous system, enabling seamless A/B testing and cohort analysis.

Reference: Real Implementation Case Study (Medium)

5. Key Features at a Glance

Feature	Description	Notes
Processing Speed	Billions of rows per second	100x faster than traditional DBs
Storage Efficiency	90% savings via compression	Massive space reduction
Real-Time Analytics	Sub-second query response	Perfect for dashboards
Scalability	Horizontal scaling support	Add servers to boost performance
SQL Compatible	Standard SQL syntax	Low learning curve
Open Source	Apache License 2.0	Free to use
Data Ingestion	Kafka, CSV, API support	Real-time streaming capable
Cloud	AWS, GCP, Azure support	Managed services available

6. When Should You Use ClickHouse?

✅ Ideal Use Cases

1. Real-Time Analytics Dashboards

Web analytics
User behavior analysis
Business intelligence dashboards

2. Log Analytics

Application logs
System logs
Security logs (SIEM)

3. Time-Series Data

IoT sensor data
Monitoring metrics
Performance indicators

4. Large-Scale Aggregations

Sales statistics
Trend analysis
Monthly/annual reporting

❌ Not Suitable For

ClickHouse is not suitable for:

1. Frequent Updates/Deletes

E-commerce order management
Banking applications
Inventory management systems → Alternative: PostgreSQL, MySQL, TiDB

2. Complex Joins

ERP systems
Relational BI applications
Complex star schemas → Alternative: StarRocks or databases optimized for multi-table joins

3. Transaction Integrity (ACID) Critical

Financial transactions
Payment systems
User account management

7. Getting Started: Installation Methods

Let’s get ClickHouse up and running. I’ll cover methods from easiest to most customizable.

Method 1: ClickHouse Cloud (Easiest, Recommended!)

If you want to skip infrastructure setup, cloud is the way to go. 30-day free trial available, supporting AWS, GCP, and Azure.

Getting Started:

Visit ClickHouse Cloud
Sign up with email
Choose cloud provider (AWS/GCP/Azure)
Select region
Cluster auto-created!

Cloud Benefits:

✅ Automatic sharding and replication
✅ Auto-upgrades
✅ Auto-scaling
✅ No complex database management

Method 2: Docker for Quick Testing (Local Development)

For local testing, Docker is simplest.

Single Container:

# Start ClickHouse server
docker run -d \
  --name clickhouse-server \
  -p 8123:8123 \
  -p 9000:9000 \
  clickhouse/clickhouse-server

# Connect with client
docker exec -it clickhouse-server clickhouse-client

Docker Compose for Production-Like Setup:

Create docker-compose.yml:

version: '3'
services:
  clickhouse-server:
    image: clickhouse/clickhouse-server
    ports:
      - "8123:8123"  # HTTP interface
      - "9000:9000"  # Native TCP
    volumes:
      - ./data:/var/lib/clickhouse  # Persistent storage
      - ./logs:/var/log/clickhouse-server  # Logs
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  clickhouse-client:
    image: clickhouse/clickhouse-client
    entrypoint: /bin/sleep
    command: infinity
    depends_on:
      - clickhouse-server

Running it:

# Create directories
mkdir -p data logs

# Start services
docker-compose up -d

# Connect with client
docker-compose exec clickhouse-client clickhouse-client --host clickhouse-server

# Access web UI
# Open http://localhost:8123/play in browser

Port Reference:

8123: HTTP interface (web-based queries, REST API)
9000: Native TCP protocol (client connections)

Method 3: Official Script Installation (Linux/macOS)

For direct server installation:

# Install
curl https://clickhouse.com/ | sh

# Start server
./clickhouse server

# Connect client (in another terminal)
./clickhouse client

Ubuntu/Debian APT Packages:

sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754

echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
    /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

sudo apt-get install -y clickhouse-server clickhouse-client

sudo service clickhouse-server start
clickhouse-client

Official guide: ClickHouse Installation Docs

8. Your First Query: Practical Examples

Now let’s actually use it.

Creating Database and Tables

-- Create database
CREATE DATABASE IF NOT EXISTS tutorial;

-- Use database
USE tutorial;

-- Create web logs table
CREATE TABLE web_logs (
    timestamp DateTime,
    user_id UInt32,
    page_url String,
    response_time UInt16,
    country String,
    device String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);

Key Concepts:

ENGINE = MergeTree(): ClickHouse’s core storage engine
PARTITION BY: Monthly data partitioning → query optimization
ORDER BY: Sorting key → faster searches

Inserting Test Data

-- Insert sample data
INSERT INTO web_logs VALUES 
    ('2025-11-04 10:00:00', 1001, '/home', 120, 'US', 'mobile'),
    ('2025-11-04 10:01:00', 1002, '/products', 250, 'UK', 'desktop'),
    ('2025-11-04 10:02:00', 1001, '/checkout', 180, 'US', 'mobile'),
    ('2025-11-04 10:05:00', 1003, '/home', 95, 'JP', 'tablet'),
    ('2025-11-04 10:10:00', 1002, '/cart', 200, 'UK', 'desktop');

-- Verify data
SELECT * FROM web_logs LIMIT 5;

Practical Analysis Queries

-- 1. Average response time
SELECT avg(response_time) as avg_response 
FROM web_logs;

-- 2. Pageviews by country
SELECT 
    country,
    count() as pageviews,
    avg(response_time) as avg_response
FROM web_logs
GROUP BY country
ORDER BY pageviews DESC;

-- 3. Hourly traffic analysis
SELECT 
    toStartOfHour(timestamp) as hour,
    count() as pageviews,
    uniq(user_id) as unique_users
FROM web_logs
GROUP BY hour
ORDER BY hour;

-- 4. Device performance comparison
SELECT 
    device,
    count() as requests,
    avg(response_time) as avg_response,
    quantile(0.95)(response_time) as p95_response
FROM web_logs
GROUP BY device;

Loading CSV Files

The most common real-world scenario:

-- Query directly from CSV
SELECT * 
FROM file('data.csv', 'CSV', 'timestamp DateTime, user_id UInt32, page String')
LIMIT 10;

-- Import CSV into table
INSERT INTO web_logs 
SELECT * 
FROM file('data.csv', 'CSV');

-- Load directly from S3
SELECT * 
FROM s3('https://mybucket.s3.amazonaws.com/data.csv', 'CSV')
LIMIT 10;

9. Real-Time Data: Kafka Integration

ClickHouse handles data ingestion from batch uploads like CSV files to real-time streams using Apache Kafka.

Setting Up Kafka Table Engine

-- Create table to read from Kafka
CREATE TABLE kafka_queue (
    timestamp DateTime,
    user_id UInt32,
    event_type String,
    event_data String
)
ENGINE = Kafka
SETTINGS 
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 2;

-- Permanent storage table
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt32,
    event_type String,
    event_data String,
    processed_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);

-- Materialized View for auto-transfer
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT 
    timestamp,
    user_id,
    event_type,
    event_data
FROM kafka_queue;

Now data flowing into Kafka automatically lands in your events table!

10. Visualization Tool Integration

ClickHouse integrates with various BI tools.

Supported Major Tools

Tool	Strengths	Best For
Grafana	Real-time monitoring dashboards	System metrics, log analysis
Metabase	User-friendly open-source BI	Business reporting
Superset	Apache’s powerful BI platform	Complex data exploration
Tableau	Enterprise-grade analytics	Large enterprise BI
Redash	SQL-based visualization	Developer-friendly

Grafana Integration Example

Install Grafana plugin:

grafana-cli plugins install grafana-clickhouse-datasource

Add data source in Grafana:
- Configuration → Data Sources → Add data source
- Select “ClickHouse”
- Enter server info (localhost:9000)
Create dashboards and write queries

Official guide: ClickHouse + Grafana Integration

11. Comparing ClickHouse to Other Databases

Let’s make objective comparisons.

ClickHouse vs PostgreSQL

Aspect	PostgreSQL	ClickHouse
Primary Use	OLTP (transactions)	OLAP (analytics)
Data Integrity	⭐⭐⭐⭐⭐	⭐⭐⭐
Analytics Performance	⭐⭐	⭐⭐⭐⭐⭐
Updates/Deletes	Fast	Slow
Large Aggregations	Slow	Very fast

Best Practice: Use both together!

PostgreSQL: Transaction processing (orders, user accounts)
ClickHouse: Data analysis (sales stats, user behavior)
CDC (Change Data Capture) for PostgreSQL → ClickHouse replication

ClickHouse vs BigQuery/Redshift

Aspect	BigQuery	Redshift	ClickHouse
Pricing Model	Per query	Per hour	Infrastructure only
Performance	Fast	Average	Very fast
Management	Easy	Medium	Challenging
Open Source	❌	❌	✅
Cloud Lock-in	GCP only	AWS only	Any cloud

Recent benchmarks show ClickHouse outperforms Redshift in most scenarios, being 1.5x to 5x faster depending on the query.

ClickHouse vs Apache Druid

Aspect	Apache Druid	ClickHouse
Strength	Event stream analysis	General-purpose OLAP
Time-Series	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
SQL Support	Limited	Full SQL
Operational Complexity	High	Medium
Performance	Fast	Faster

12. Best Practices and Gotchas

🚨 Important Warnings

1. Updates/Deletes Are Expensive

-- ❌ Avoid this pattern
UPDATE users SET status = 'active' WHERE user_id = 123;

-- ✅ Recommended: Insert new row + filter at query time
INSERT INTO users VALUES (123, 'active', now());

SELECT * FROM users 
WHERE user_id = 123 
ORDER BY timestamp DESC 
LIMIT 1;

2. Use Batch Inserts

-- ❌ Slow: One row at a time
INSERT INTO logs VALUES ('log1');
INSERT INTO logs VALUES ('log2');

-- ✅ Fast: Batch insert (minimum 1000 rows recommended)
INSERT INTO logs VALUES 
    ('log1'), ('log2'), ('log3'), ... ('log1000');

3. Design Partition Keys Carefully

-- ✅ Good example: Monthly partitions
CREATE TABLE events (
    timestamp DateTime,
    data String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)  -- Separate data by month
ORDER BY timestamp;

-- Query scans specific partitions only
SELECT * FROM events 
WHERE timestamp >= '2025-11-01' 
  AND timestamp < '2025-12-01';

💡 Performance Optimization Tips

1. Use Materialized Views

-- Pre-compute frequently used aggregations
CREATE MATERIALIZED VIEW daily_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, country)
AS SELECT
    toDate(timestamp) as date,
    country,
    count() as pageviews,
    uniq(user_id) as unique_users,
    avg(response_time) as avg_response
FROM web_logs
GROUP BY date, country;

-- Now this query is lightning fast!
SELECT * FROM daily_stats WHERE date = today();

2. Choose Appropriate Data Types

-- ❌ Inefficient
CREATE TABLE bad_example (
    id String,           -- Numbers as strings
    amount Float64,      -- Float when not needed
    created String       -- Dates as strings
);

-- ✅ Efficient
CREATE TABLE good_example (
    id UInt64,           -- Proper integer type
    amount Decimal(10,2),-- Decimal for money
    created DateTime     -- Date type
);

13. Learning Resources and Community

📚 Official Documentation and Learning

ClickHouse Official Docs: https://clickhouse.com/docs
ClickHouse Academy: Free online courses https://clickhouse.com/learn
ClickHouse GitHub: https://github.com/ClickHouse/ClickHouse
Example Datasets: https://clickhouse.com/docs/en/getting-started/example-datasets

👥 Community and Support

ClickHouse Slack: Real-time Q&A and discussions
GitHub Discussions: https://github.com/ClickHouse/ClickHouse/discussions
Stack Overflow: Use the clickhouse tag
Official Blog: https://clickhouse.com/blog

🎥 Recommended Videos and Tutorials

Getting Started official video: ClickHouse Quickstart
ClickHouse Meetup videos: Real-world use case presentations

Wrapping Up…

ClickHouse is reshaping the landscape of big data analytics. Anthropic stated that “ClickHouse played an instrumental role in helping us develop and ship Claude 4,” while Tesla noted that “ClickHouse checked every box. We have availability, we have speed, we have durability. We have everything we could want.”

Of course, it’s not a silver bullet for every project. For transaction processing or frequent updates, traditional RDBMS might be more suitable.

But consider ClickHouse if you need:

✅ Fast analysis of large-scale data
✅ Real-time dashboards
✅ Efficient log data storage and analysis
✅ Storage cost reduction
✅ An open-source solution

If you need an OLAP database that delivers solid performance for small to medium-scale services while being quick to set up, I recommend giving ClickHouse a try.

Ready to upgrade your data analytics infrastructure? Start today with ClickHouse Cloud for free!

Post Views: 11