Apache Hop? Building Data Pipelines Without Code (ETL) - 헤이든의 전산실 (Hayden's Server Room)

When working on data engineering projects, implementing complex ETL operations through coding can be time-consuming. Writing code to connect, transform, and load data from multiple sources often distracts from focusing on the critical business logic.

This guide introduces Apache Hop, an open-source tool that addresses these challenges. With drag-and-drop visual pipeline design, you can increase productivity while reducing code.

Apache Hop

Table of Contents

1. What is Apache Hop? A New Approach to Visual Data Processing

Apache Hop (Hop Orchestration Platform) is an open-source data integration platform supporting all aspects of data and metadata orchestration. As of October 2025, the latest version is 2.15, released on August 20, 2025.

Beyond being a simple ETL tool, Hop provides a visual development environment that allows data engineers to focus on “what” needs to be done rather than “how” to do it. This means concentrating on implementing business logic instead of writing complex code.

For Pentaho Kettle/PDI Users

If you’ve been researching Apache Hop, you’ve likely encountered references to Pentaho or Kettle. Apache Hop began as a fork of Pentaho Data Integration (Kettle) in late 2019 but has since evolved into a completely independent project with its own roadmap and development direction.

While Pentaho Community Edition hasn’t received updates or security patches since November 2022, Apache Hop continues to evolve with an active community, making it an excellent alternative for Pentaho users. For a detailed comparison, visit the Apache Hop official comparison page.

2. Why Choose Apache Hop? Key Advantages

Intuitive Visual Development Environment (Hop GUI)

Design workflows and pipelines through an intuitive drag-and-drop interface. Complex data processing logic is represented visually, making collaboration and maintenance significantly easier. Diagrams are more comprehensible than reading code.

Design Once, Run Anywhere

Workflows designed in Hop GUI can run on various platforms:

Local Hop Engine: Development or small-scale operations
Apache Spark: Large-scale batch processing
Apache Flink: Real-time stream processing
Google Dataflow: Google Cloud environment
AWS EMR: Amazon Cloud environment

Pipelines created in development environments can run directly on production distributed processing systems without modification.

Metadata-Driven Architecture for Flexible Management

Hop operates entirely on metadata, with every object type defining how data should be read, manipulated, and written through metadata. This enables flexible configuration changes and easy creation of reusable components. For example, define database connection information once and reuse it across all pipelines.

Development to Deployment: Project and Environment Management

Hop clearly separates Projects and Environments concepts. Apply different configurations for Development, Test, and Production environments, providing an efficient structure from a DevOps perspective. Maintain the same code while using different database servers and file paths for each environment.

Lightweight and Fast Execution

Hop’s installation size is 75% smaller than Pentaho and starts in seconds. Select only needed plugins to maintain a lightweight environment. No coffee break required!

3. Getting Started with Apache Hop: Installation to First Run

System Requirements

Apache Hop requires Java 17 installed on your system. Check the Apache Hop official documentation for detailed information about supported JVMs.

Step-by-Step Installation Guide

Step 1: Install Java

Download and install Java 17 or higher from the Oracle Java download page
Verify installation by running java -version in terminal or command prompt

Step 2: Download Apache Hop

Download the latest binary from the Apache Hop official download page
As of October 2025, the latest version is 2.15
Download the binary (not source code) as it includes all dependencies

Step 3: Extract and Run

Extract the downloaded file to your preferred location.

# For Windows
hop-gui.bat

# For Linux/Mac
./hop-gui.sh

Hop GUI starts within seconds!

Understanding the Hop GUI Interface

When you start Hop GUI, you’ll see a clean, organized interface. Let’s explore the main areas:

Main Toolbar: Located at the top, providing functions for creating new files, managing Projects and Environments, saving, and more
Perspectives Toolbar: Menu for switching work views
- Data Orchestration: Design pipelines and workflows
- Metadata: Manage connection information and configurations
- Execution Information: View execution logs and results
Canvas: Workspace for drawing pipelines and workflows

For detailed interface descriptions, refer to the Hop GUI official guide.

4. Building Your First Pipeline: A Step-by-Step Tutorial

Let’s create your first pipeline to understand Apache Hop’s core concepts. We’ll implement a simple process from data generation to CSV file storage.

Creating a Project

All work in Apache Hop is managed at the project level. First, create a project.

Click the Projects icon in the Main Toolbar
Select New Project
Enter project name (e.g., “my-first-hop-project”)
Specify Home Folder path (use Browse button to select or create folder)
Click OK

Once created, all pipelines, workflows, and metadata will be stored in this project folder.

Creating a Pipeline

To create a Pipeline:

Click New option in the Main Toolbar
Select Pipeline option
Or use shortcut Ctrl+N and select Pipeline

You’re ready when a blank canvas appears!

Step 1: Add Generate Rows Transform

Transforms are the basic units for processing data in pipelines.

Click on blank space in the pipeline canvas
Type “generate rows” in the Search box
Select Generate Rows transform
Double-click the transform on canvas to open settings:
- Transform name: Enter “Sample Data Generation”
- Limit: 10 (generate 10 rows)
- Never ending: Uncheck
- Navigate to Fields tab: Click “Get Fields” button to auto-add sample fields
- Or manually add fields (e.g., “name”, “age”, “email”)
Click OK

Step 2: Add Text File Output Transform

Now add a transform to save the generated data as a CSV file.

Click on blank space in canvas again
Search for “text file output”
Select Text File Output transform
Double-click to configure:
- Transform name: “Save to CSV”
- Filename: Specify output file path (e.g., C:\temp\output or /tmp/output)
- Extension: csv
- Include header: Check (include column names in first row)
- Fields tab: Click “Get Fields” button to auto-import fields from previous transform
Click OK

Step 3: Create a Hop (Connection)

Connect the two transforms to allow data flow. In Hop, a ‘Hop’ refers to a connection line.

Connection methods:

Method 1: Drag from “Sample Data Generation” transform to “Save to CSV” transform
Method 2: Hold Shift key and click the two transforms in order

Success when the arrow connects them!

Saving and Running the Pipeline

To Save:

File → Save or Ctrl+S
Enter filename (e.g., “my-first-pipeline.hpl”)

To Run:

Click Run → Launch in the Main Toolbar
In the execution settings window:
- Pipeline run configuration: Select “local” (run locally)
- Click Run button

After execution completes, check the results in the Logging window at the bottom, and verify the CSV file was created at the specified path. Open the file to confirm 10 rows were properly generated!

5. Essential Core Concepts

Understanding these core concepts is crucial for effectively using Apache Hop.

Pipelines: Data Processing Executors

Pipelines perform actual data processing tasks: reading, transforming, and storing data. In Apache Hop, transforms within pipelines execute in parallel, enabling fast data processing. This is a key difference from typical sequential processing.

Pipeline file extension is .hpl (Hop Pipeline).

Workflows: Task Orchestrators

Workflows are orchestration layers that execute multiple tasks sequentially. When you create a workflow, Apache Hop automatically adds a Start action.

Workflow use cases:

Check file existence → Execute pipeline → Send email notification on success
Database backup → Data extraction → Validation → Loading → Report generation

Workflow file extension is .hwf (Hop Workflow).

Transforms: Data Processing Building Blocks

Transforms are individual units that process data within pipelines. Apache Hop provides hundreds of transforms, each handling specific data processing functions.

Main Transform Categories:

Category	Example Transforms	Purpose
Input	Text File Input, Table Input, CSV Input, Excel Input, JSON Input, REST Client	Read data from various sources
Transform	Filter Rows, Calculator, String Operations, Merge Join, Sort Rows, Select Values	Filter, calculate, process strings, join, sort data
Output	Text File Output, Table Output, Insert/Update, Delete, Excel Writer	Save data to files or databases
Lookup	Database Lookup, Stream Lookup, Fuzzy Match	Query information from other data sources
Utility	Block Until Transforms Finish, Copy Rows to Result, Write to Log	Flow control and debugging

Each transform’s settings window includes a Help button in the lower left for accessing relevant documentation.

Actions: Workflow Task Units

Actions are task units executed within workflows. While transforms process data, actions execute operations.

Main Action Types:

File Management: Copy, move, delete, compress/decompress files, FTP/SFTP transfer
Database: Execute SQL, create/delete tables, bulk load
Execution: Execute pipelines, workflows, shell scripts
Communication: Send emails, HTTP requests
Control: Conditional branching, loops, success/failure handling

Metadata: Reusable Configurations

Metadata objects are configurations reusable across multiple pipelines and workflows:

Relational Database Connections: Database connection information
File Definitions: File structure definitions
Pipeline Log: Logging configuration
Variable Resolvers: Environment variable management (added in 2.12)

Metadata is managed at the project level, so defining it once makes it available throughout the project.

6. Real-World Use Cases: Practical Applications

Let’s explore specific examples of how to apply this in practice.

Example 1: Extract Data from Database to CSV

Scenario: Daily backup of customer table from MySQL database to CSV file

Implementation Steps:

Create Database Connection Metadata
- Right-click Relational Database Connections in left metadata panel
- Select New
- Enter connection information:
  - Name: “CustomerDB”
  - Connection Type: MySQL
  - Host Name: localhost
  - Database Name: customer_db
  - Port: 3306
  - Enter Username and Password
- Click Test button to verify connection
Configure Pipeline
- Add Table Input transform:
  - Connection: Select “CustomerDB”
  - Enter SQL: SELECT * FROM customers WHERE created_date >= CURDATE()
- Add Text File Output transform:
  - Filename: ${PROJECT_HOME}/output/customers_${Internal.Job.Date}.csv
  - Encoding: UTF-8
  - Separator: Comma
- Connect the two transforms with a Hop
Schedule with Workflow
- Create workflow
- Action sequence: Start → Check if file exists → Pipeline → Mail
- Conditional branching: Skip if file already exists

Example 2: Merge Multiple CSV Files and Cleanse Data

Scenario: Merge CSV files from multiple locations, remove duplicates, and save to database

Implementation Steps:

CSV File Input transform
- File or directory: /data/sales/
- Regular Expression: sales_.*\.csv (all CSVs starting with sales_)
- Add filenames to result: Check to track source file
Select Values transform
- Select only needed columns and rename
- Remove unnecessary spaces (Trim type: both)
Filter Rows transform
- Set conditions: amount > 0 AND date IS NOT NULL
- Matching rows go to next step, non-matching rows output to log
Sort Rows transform
- Sort criteria: customer_id, date
- Pre-sort for duplicate removal
Unique Rows transform
- Fields to compare: customer_id, date
- Remove duplicate rows
Table Output transform
- Target table: sales_consolidated
- Commit size: 1000 (performance optimization)

This single pipeline can cleanse and consolidate dozens of CSV files in seconds.

Example 3: Collect and Transform REST API Data

Scenario: Collect JSON data from external API, extract required fields, and save to database

Implementation Steps:

Add REST Client transform
- URL: https://api.example.com/products?page=${page}
- HTTP Method: GET
- Header: Authorization: Bearer ${API_TOKEN}
- Result fieldname: json_response
JSON Input transform
- Source: from field
- Source field: json_response
- Define JSON paths:
  - $.data[*].id → product_id
  - $.data[*].name → product_name
  - $.data[*].price → price
  - $.data[*].category.name → category
Calculator transform
- Add price calculation (e.g., price with VAT = price * 1.1)
- Add current timestamp
Table Output transform
- Save to target table

Additional Tip: Since Apache Hop 2.11, you can use the Language Model Chat transform to analyze or classify collected text data with LLM. For example, automatically categorize product descriptions!

Find more examples in the config/projects/samples folder of your Hop installation, and explore various samples at the Apache Hop Samples page.

7. Advanced Features: Taking It Further

Beyond the basics, let’s explore Apache Hop’s powerful advanced features.

Distributed Processing with Apache Beam Integration

Hop is one of the first GUI-based pipeline designers with native Apache Beam support. This enables distributed processing of large-scale data.

Execution Configuration Examples:

# Local execution
./hop-run.sh --project myproject \
             --file pipelines/data-processing.hpl \
             --runconfig local

# Execute on Google Dataflow
./hop-run.sh --project myproject \
             --file pipelines/data-processing.hpl \
             --runconfig DataflowRunner

# Execute on Apache Spark
./hop-run.sh --project myproject \
             --file pipelines/data-processing.hpl \
             --runconfig SparkRunner

Pipeline code remains identical; only the Run Configuration changes to execute on different engines. This is the actual implementation of “Design Once, Run Anywhere”!

Separate Configuration with Environment Variables

Setting the HOP_CONFIG_FOLDER environment variable allows storing configuration outside the Hop installation folder, facilitating version upgrades and managing multiple Hop instances.

Configuration for Linux/Mac:

# Add to ~/.bashrc or ~/.zshrc
export HOP_CONFIG_FOLDER=/home/user/hop-config
export HOP_AUDIT_FOLDER=/home/user/hop-audit

# Apply
source ~/.bashrc

Configuration for Windows:

Search for “environment variables” in Start menu
Click System Properties → Environment Variables
Add new User Variable:
- Variable name: HOP_CONFIG_FOLDER
- Variable value: C:\Users\YourName\hop-config

With this configuration, upgrading Hop versions preserves project lists, recent files, and environment settings.

Version Control with Git Integration

Making your project folder a Git repository significantly improves team collaboration. Since Apache Hop 2.11, File Explorer perspective supports direct Git integration.

Git Integration Features:

Create, switch, delete, and merge branches
Color-coded staged files
View commit history
Enhanced authentication and authorization

Refer to the Git Integration Guide for more details.

Ensure Data Quality with Unit Tests

Write unit tests for pipelines to verify data is processed as expected.

Unit Test Use Cases:

Validate input data field count and types
Confirm transformation logic produces correct results
Test exception handling under specific conditions

Data pipelines are software. Test them like code for stable operations!

8. Performance Optimization: Making It Faster

As you use Apache Hop, you’ll encounter situations requiring performance improvements. Here are practical optimization tips.

Increase Speed with Parallel Processing

Pipelines work like networks—the slowest transform limits overall speed. Hop GUI displays slow transforms with dotted lines during execution, allowing you to identify and optimize them.

Optimization Methods:

Increase Parallel Copies
- Right-click slow transform → Settings
- Increase “Copies to start” (e.g., 1 → 4)
- Consider CPU core count for appropriate setting
Adjust Batch Size
- Increase Commit size for database I/O (e.g., 100 → 1000)
- Larger batches are generally faster than smaller ones
Remove Unnecessary Transforms
- Remove unused fields in Select Values
- Eliminate duplicate sorts or filters

Important Note: More parallelism isn’t always better. Measure, test, and iterate!

Optimize Memory Settings

By default, Hop uses a maximum of 2GB heap memory. Increase this for large-scale data processing.

Configuration Method:

Open hop-gui.sh (or hop-gui.bat) in a text editor and modify:

# Default setting
HOP_OPTIONS="-Xmx2048m"

# Increase to 4GB
HOP_OPTIONS="-Xmx4096m"

# Increase to 8GB (large-scale data processing)
HOP_OPTIONS="-Xmx8192m"

Or set via environment variable:

export HOP_OPTIONS="-Xmx8192m"

Consider adding JVM garbage collection options:

HOP_OPTIONS="-Xmx8192m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

Track Issues with Logging Strategy

Configuring logging to capture workflow and pipeline execution is crucial for tracking issues when they occur.

Logging Options:

Pipeline Log / Workflow Log: Store execution history in database
Logging Reflection: Record per-transform processing statistics
Neo4j Logging: Logging to Neo4j graph database enables visual review in Hop GUI’s Neo4j perspective

Configure logging settings using the Apache Hop Logging Guide.

Database Optimization Tips

Use Indexes: Verify database indexes exist for Lookup operations
Use Bulk Loaders: For mass data loading, use dedicated transforms like MySQL Bulk Loader, PostgreSQL Bulk Loader instead of Table Output
Connection Pooling: Enable connection pooling in database connection settings

9. Migrating from Pentaho to Apache Hop

If you’re using Pentaho PDI, consider transitioning to Apache Hop. The migration process is simpler than you might think.

Using Migration Tools

Import in Hop GUI:

In Hop GUI, select File → Import from Kettle/PDI
Or use shortcut Ctrl+I
In import settings window:
- Input folder: Select Kettle/PDI project folder
- Target folder: Select Apache Hop project folder
Click Import button

After a few seconds, a migration summary window appears showing converted file list and notes.

Using Command Line Tool:

# Linux/Mac
./hop-import.sh --input-folder /path/to/kettle/project \
                --output-folder /path/to/hop/project

# Windows
hop-import.bat --input-folder C:\kettle\project ^
               --output-folder C:\hop\project

Key Changes to Review

During migration, these items are automatically converted:

Pentaho/Kettle	Apache Hop	Description
Transformations (.ktr)	Pipelines (.hpl)	File extension change
Jobs (.kjb)	Workflows (.hwf)	File extension change
Steps	Transforms	Terminology change
Entries	Actions	Terminology change
Kitchen	hop-run	Unified command line execution tool
Pan	hop-run	Unified command line execution tool

Post-Migration Notes

Database Connection Cleanup: If multiple database connections with the same name but different settings are found, a connections.csv file is created in the project folder. Open this file to manually clean up duplicate connections.

Removed Features:

JNDI: Removed as feature hasn’t been updated in over 10 years
Formula Step: Replaced with more efficient Calculator transform
Repositories: File-based and database repositories removed (Git recommended)

Newly Added Features:

Projects and Environments concepts
Apache Beam integration
Enhanced metadata management
Git integration

For detailed migration guidance, see the Apache Hop official migration documentation.

10. Learning Together: Community and Resources

Apache Hop grows with an active open-source community. Here are resources to help you learn.

Official Resources

Official Website: https://hop.apache.org/
Official Documentation: https://hop.apache.org/manual/latest/
- Version-specific documentation (2.10 to latest)
- Getting Started guide
- Transform/Action reference
GitHub Repository: https://github.com/apache/hop
- Source code
- Issue tracker
- Release notes
Download Page: https://hop.apache.org/download/

Community Participation

Apache Hop is an Apache Software Foundation project, developed entirely community-driven. Anyone can participate through various contribution methods:

Mailing Lists: Ask questions and discuss
Bug Reporting: Suggest bugs or improvements on GitHub Issues
Documentation Improvements: Fix documentation typos or add examples
Code Contributions: Develop new features or fix bugs
Translation: Participate in translating Hop GUI to various languages
Tutorial Creation: Share usage through blogs or videos

Contributing isn’t just about writing code! Check the Contribution Guide for detailed methods.

Recommended Learning Path

Phase 1 – Learn Basics:

Explore Samples project included with Hop installation (config/projects/samples)
Create simple pipeline (Read CSV → Transform → Save)
Use Help button in each transform to check documentation

Phase 2 – Real Projects:

Connect to data sources you use in practice
Configure projects and environments
Start version control with Git

Phase 3 – Advanced Features:

Test distributed processing with Apache Beam
Write unit tests
Develop custom plugins (if needed)

Useful Community Blogs

know.bi: https://www.know.bi – Blog by Apache Hop founding members
datavin3.be: https://www.datavin3.be – Apache Hop tutorials and best practices
Lean With Data: https://www.leanwithdata.com – Pentaho to Hop transition guide

Frequently Asked Questions (FAQ)

Q: Is Apache Hop free? A: Yes! It’s completely free under Apache License 2.0 and can be freely used for commercial purposes.

Q: Is it compatible with Pentaho? A: Not directly compatible, but easy to migrate using import tools.

Q: Which databases are supported? A: Supports most major databases including MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Neo4j, and Cassandra.

Q: Can it run in the cloud? A: Yes! Deploy as Docker container or run on cloud platforms like AWS EMR and Google Dataflow.

Apache Hop is a powerful yet user-friendly tool meeting modern data engineering requirements. Combining visual development environment, flexible execution options, and active community support, it serves from small-scale data processing to large enterprise environments.

Apache Hop’s Core Value:

Productivity: Fast development with drag-and-drop instead of code
Flexibility: Design once, run anywhere
Scalability: Process from small-scale to petabyte-level data
Openness: Fully open-source, community-driven development

For Pentaho PDI users especially, it maintains familiar concepts while providing significantly improved user experience, making it worth actively considering the transition. Being an open-source project, you can use it freely without licensing costs.

If you’ve been considering data pipeline construction, try Apache Hop based on today’s guide. Start immediately from the official download page, and you’ll quickly become familiar using official documentation and sample projects. 🙂

Post Views: 10

1. What is Apache Hop? A New Approach to Visual Data Processing

For Pentaho Kettle/PDI Users

2. Why Choose Apache Hop? Key Advantages

Intuitive Visual Development Environment (Hop GUI)

Design Once, Run Anywhere

Metadata-Driven Architecture for Flexible Management

Development to Deployment: Project and Environment Management

Lightweight and Fast Execution

3. Getting Started with Apache Hop: Installation to First Run

System Requirements

Step-by-Step Installation Guide

Understanding the Hop GUI Interface

4. Building Your First Pipeline: A Step-by-Step Tutorial

Creating a Project

Creating a Pipeline

Saving and Running the Pipeline

5. Essential Core Concepts

Pipelines: Data Processing Executors

Workflows: Task Orchestrators

Transforms: Data Processing Building Blocks

Actions: Workflow Task Units

Metadata: Reusable Configurations

6. Real-World Use Cases: Practical Applications

Example 1: Extract Data from Database to CSV

Example 2: Merge Multiple CSV Files and Cleanse Data

Example 3: Collect and Transform REST API Data

7. Advanced Features: Taking It Further

Distributed Processing with Apache Beam Integration

Separate Configuration with Environment Variables

Version Control with Git Integration

Ensure Data Quality with Unit Tests

8. Performance Optimization: Making It Faster

Increase Speed with Parallel Processing

Optimize Memory Settings

Track Issues with Logging Strategy

Database Optimization Tips

9. Migrating from Pentaho to Apache Hop

Using Migration Tools

Key Changes to Review

Post-Migration Notes

10. Learning Together: Community and Resources

Official Resources

Community Participation

Recommended Learning Path

Useful Community Blogs

Frequently Asked Questions (FAQ)

이 글 공유하기:

관련

Leave a ReplyCancel reply