When working on data engineering projects, implementing complex ETL operations through coding can be time-consuming. Writing code to connect, transform, and load data from multiple sources often distracts from focusing on the critical business logic.
This guide introduces Apache Hop, an open-source tool that addresses these challenges. With drag-and-drop visual pipeline design, you can increase productivity while reducing code.
1. What is Apache Hop? A New Approach to Visual Data Processing
Apache Hop (Hop Orchestration Platform) is an open-source data integration platform supporting all aspects of data and metadata orchestration. As of October 2025, the latest version is 2.15, released on August 20, 2025.
Beyond being a simple ETL tool, Hop provides a visual development environment that allows data engineers to focus on “what” needs to be done rather than “how” to do it. This means concentrating on implementing business logic instead of writing complex code.
For Pentaho Kettle/PDI Users
If you’ve been researching Apache Hop, you’ve likely encountered references to Pentaho or Kettle. Apache Hop began as a fork of Pentaho Data Integration (Kettle) in late 2019 but has since evolved into a completely independent project with its own roadmap and development direction.
While Pentaho Community Edition hasn’t received updates or security patches since November 2022, Apache Hop continues to evolve with an active community, making it an excellent alternative for Pentaho users. For a detailed comparison, visit the Apache Hop official comparison page.
2. Why Choose Apache Hop? Key Advantages
Intuitive Visual Development Environment (Hop GUI)
Design workflows and pipelines through an intuitive drag-and-drop interface. Complex data processing logic is represented visually, making collaboration and maintenance significantly easier. Diagrams are more comprehensible than reading code.
Design Once, Run Anywhere
Workflows designed in Hop GUI can run on various platforms:
- Local Hop Engine: Development or small-scale operations
- Apache Spark: Large-scale batch processing
- Apache Flink: Real-time stream processing
- Google Dataflow: Google Cloud environment
- AWS EMR: Amazon Cloud environment
Pipelines created in development environments can run directly on production distributed processing systems without modification.
Metadata-Driven Architecture for Flexible Management
Hop operates entirely on metadata, with every object type defining how data should be read, manipulated, and written through metadata. This enables flexible configuration changes and easy creation of reusable components. For example, define database connection information once and reuse it across all pipelines.
Development to Deployment: Project and Environment Management
Hop clearly separates Projects and Environments concepts. Apply different configurations for Development, Test, and Production environments, providing an efficient structure from a DevOps perspective. Maintain the same code while using different database servers and file paths for each environment.
Lightweight and Fast Execution
Hop’s installation size is 75% smaller than Pentaho and starts in seconds. Select only needed plugins to maintain a lightweight environment. No coffee break required!
3. Getting Started with Apache Hop: Installation to First Run
System Requirements
Apache Hop requires Java 17 installed on your system. Check the Apache Hop official documentation for detailed information about supported JVMs.
Step-by-Step Installation Guide
Step 1: Install Java
- Download and install Java 17 or higher from the Oracle Java download page
- Verify installation by running
java -version
in terminal or command prompt
Step 2: Download Apache Hop
- Download the latest binary from the Apache Hop official download page
- As of October 2025, the latest version is 2.15
- Download the binary (not source code) as it includes all dependencies
Step 3: Extract and Run
Extract the downloaded file to your preferred location.
# For Windows
hop-gui.bat
# For Linux/Mac
./hop-gui.sh
Hop GUI starts within seconds!
Understanding the Hop GUI Interface
When you start Hop GUI, you’ll see a clean, organized interface. Let’s explore the main areas:
- Main Toolbar: Located at the top, providing functions for creating new files, managing Projects and Environments, saving, and more
- Perspectives Toolbar: Menu for switching work views
- Data Orchestration: Design pipelines and workflows
- Metadata: Manage connection information and configurations
- Execution Information: View execution logs and results
- Canvas: Workspace for drawing pipelines and workflows
For detailed interface descriptions, refer to the Hop GUI official guide.
4. Building Your First Pipeline: A Step-by-Step Tutorial
Let’s create your first pipeline to understand Apache Hop’s core concepts. We’ll implement a simple process from data generation to CSV file storage.
Creating a Project
All work in Apache Hop is managed at the project level. First, create a project.
- Click the Projects icon in the Main Toolbar
- Select New Project
- Enter project name (e.g., “my-first-hop-project”)
- Specify Home Folder path (use Browse button to select or create folder)
- Click OK
Once created, all pipelines, workflows, and metadata will be stored in this project folder.
Creating a Pipeline
To create a Pipeline:
- Click New option in the Main Toolbar
- Select Pipeline option
- Or use shortcut Ctrl+N and select Pipeline
You’re ready when a blank canvas appears!
Step 1: Add Generate Rows Transform
Transforms are the basic units for processing data in pipelines.
- Click on blank space in the pipeline canvas
- Type “generate rows” in the Search box
- Select Generate Rows transform
- Double-click the transform on canvas to open settings:
- Transform name: Enter “Sample Data Generation”
- Limit: 10 (generate 10 rows)
- Never ending: Uncheck
- Navigate to Fields tab: Click “Get Fields” button to auto-add sample fields
- Or manually add fields (e.g., “name”, “age”, “email”)
- Click OK
Step 2: Add Text File Output Transform
Now add a transform to save the generated data as a CSV file.
- Click on blank space in canvas again
- Search for “text file output”
- Select Text File Output transform
- Double-click to configure:
- Transform name: “Save to CSV”
- Filename: Specify output file path (e.g.,
C:\temp\output
or/tmp/output
) - Extension: csv
- Include header: Check (include column names in first row)
- Fields tab: Click “Get Fields” button to auto-import fields from previous transform
- Click OK
Step 3: Create a Hop (Connection)
Connect the two transforms to allow data flow. In Hop, a ‘Hop’ refers to a connection line.
Connection methods:
- Method 1: Drag from “Sample Data Generation” transform to “Save to CSV” transform
- Method 2: Hold Shift key and click the two transforms in order
Success when the arrow connects them!
Saving and Running the Pipeline
To Save:
- File → Save or Ctrl+S
- Enter filename (e.g., “my-first-pipeline.hpl”)
To Run:
- Click Run → Launch in the Main Toolbar
- In the execution settings window:
- Pipeline run configuration: Select “local” (run locally)
- Click Run button
After execution completes, check the results in the Logging window at the bottom, and verify the CSV file was created at the specified path. Open the file to confirm 10 rows were properly generated!
5. Essential Core Concepts
Understanding these core concepts is crucial for effectively using Apache Hop.
Pipelines: Data Processing Executors
Pipelines perform actual data processing tasks: reading, transforming, and storing data. In Apache Hop, transforms within pipelines execute in parallel, enabling fast data processing. This is a key difference from typical sequential processing.
Pipeline file extension is .hpl
(Hop Pipeline).
Workflows: Task Orchestrators
Workflows are orchestration layers that execute multiple tasks sequentially. When you create a workflow, Apache Hop automatically adds a Start action.
Workflow use cases:
- Check file existence → Execute pipeline → Send email notification on success
- Database backup → Data extraction → Validation → Loading → Report generation
Workflow file extension is .hwf
(Hop Workflow).
Transforms: Data Processing Building Blocks
Transforms are individual units that process data within pipelines. Apache Hop provides hundreds of transforms, each handling specific data processing functions.
Main Transform Categories:
Category | Example Transforms | Purpose |
---|---|---|
Input | Text File Input, Table Input, CSV Input, Excel Input, JSON Input, REST Client | Read data from various sources |
Transform | Filter Rows, Calculator, String Operations, Merge Join, Sort Rows, Select Values | Filter, calculate, process strings, join, sort data |
Output | Text File Output, Table Output, Insert/Update, Delete, Excel Writer | Save data to files or databases |
Lookup | Database Lookup, Stream Lookup, Fuzzy Match | Query information from other data sources |
Utility | Block Until Transforms Finish, Copy Rows to Result, Write to Log | Flow control and debugging |
Each transform’s settings window includes a Help button in the lower left for accessing relevant documentation.
Actions: Workflow Task Units
Actions are task units executed within workflows. While transforms process data, actions execute operations.
Main Action Types:
- File Management: Copy, move, delete, compress/decompress files, FTP/SFTP transfer
- Database: Execute SQL, create/delete tables, bulk load
- Execution: Execute pipelines, workflows, shell scripts
- Communication: Send emails, HTTP requests
- Control: Conditional branching, loops, success/failure handling
Metadata: Reusable Configurations
Metadata objects are configurations reusable across multiple pipelines and workflows:
- Relational Database Connections: Database connection information
- File Definitions: File structure definitions
- Pipeline Log: Logging configuration
- Variable Resolvers: Environment variable management (added in 2.12)
Metadata is managed at the project level, so defining it once makes it available throughout the project.
6. Real-World Use Cases: Practical Applications
Let’s explore specific examples of how to apply this in practice.
Example 1: Extract Data from Database to CSV
Scenario: Daily backup of customer table from MySQL database to CSV file
Implementation Steps:
- Create Database Connection Metadata
- Right-click Relational Database Connections in left metadata panel
- Select New
- Enter connection information:
- Name: “CustomerDB”
- Connection Type: MySQL
- Host Name: localhost
- Database Name: customer_db
- Port: 3306
- Enter Username and Password
- Click Test button to verify connection
- Configure Pipeline
- Add Table Input transform:
- Connection: Select “CustomerDB”
- Enter SQL:
SELECT * FROM customers WHERE created_date >= CURDATE()
- Add Text File Output transform:
- Filename:
${PROJECT_HOME}/output/customers_${Internal.Job.Date}.csv
- Encoding: UTF-8
- Separator: Comma
- Filename:
- Connect the two transforms with a Hop
- Add Table Input transform:
- Schedule with Workflow
- Create workflow
- Action sequence: Start → Check if file exists → Pipeline → Mail
- Conditional branching: Skip if file already exists
Register this workflow with cron (Linux) or Task Scheduler (Windows) for complete automation!
Example 2: Merge Multiple CSV Files and Cleanse Data
Scenario: Merge CSV files from multiple locations, remove duplicates, and save to database
Implementation Steps:
- CSV File Input transform
- File or directory:
/data/sales/
- Regular Expression:
sales_.*\.csv
(all CSVs starting with sales_) - Add filenames to result: Check to track source file
- File or directory:
- Select Values transform
- Select only needed columns and rename
- Remove unnecessary spaces (Trim type: both)
- Filter Rows transform
- Set conditions: amount > 0 AND date IS NOT NULL
- Matching rows go to next step, non-matching rows output to log
- Sort Rows transform
- Sort criteria: customer_id, date
- Pre-sort for duplicate removal
- Unique Rows transform
- Fields to compare: customer_id, date
- Remove duplicate rows
- Table Output transform
- Target table: sales_consolidated
- Commit size: 1000 (performance optimization)
This single pipeline can cleanse and consolidate dozens of CSV files in seconds.
Example 3: Collect and Transform REST API Data
Scenario: Collect JSON data from external API, extract required fields, and save to database
Implementation Steps:
- Add REST Client transform
- URL:
https://api.example.com/products?page=${page}
- HTTP Method: GET
- Header:
Authorization: Bearer ${API_TOKEN}
- Result fieldname: json_response
- URL:
- JSON Input transform
- Source: from field
- Source field: json_response
- Define JSON paths:
$.data[*].id
→ product_id$.data[*].name
→ product_name$.data[*].price
→ price$.data[*].category.name
→ category
- Calculator transform
- Add price calculation (e.g., price with VAT = price * 1.1)
- Add current timestamp
- Table Output transform
- Save to target table
Additional Tip: Since Apache Hop 2.11, you can use the Language Model Chat transform to analyze or classify collected text data with LLM. For example, automatically categorize product descriptions!
Find more examples in the config/projects/samples
folder of your Hop installation, and explore various samples at the Apache Hop Samples page.
7. Advanced Features: Taking It Further
Beyond the basics, let’s explore Apache Hop’s powerful advanced features.
Distributed Processing with Apache Beam Integration
Hop is one of the first GUI-based pipeline designers with native Apache Beam support. This enables distributed processing of large-scale data.
Execution Configuration Examples:
# Local execution
./hop-run.sh --project myproject \
--file pipelines/data-processing.hpl \
--runconfig local
# Execute on Google Dataflow
./hop-run.sh --project myproject \
--file pipelines/data-processing.hpl \
--runconfig DataflowRunner
# Execute on Apache Spark
./hop-run.sh --project myproject \
--file pipelines/data-processing.hpl \
--runconfig SparkRunner
Pipeline code remains identical; only the Run Configuration changes to execute on different engines. This is the actual implementation of “Design Once, Run Anywhere”!
Separate Configuration with Environment Variables
Setting the HOP_CONFIG_FOLDER environment variable allows storing configuration outside the Hop installation folder, facilitating version upgrades and managing multiple Hop instances.
Configuration for Linux/Mac:
# Add to ~/.bashrc or ~/.zshrc
export HOP_CONFIG_FOLDER=/home/user/hop-config
export HOP_AUDIT_FOLDER=/home/user/hop-audit
# Apply
source ~/.bashrc
Configuration for Windows:
- Search for “environment variables” in Start menu
- Click System Properties → Environment Variables
- Add new User Variable:
- Variable name:
HOP_CONFIG_FOLDER
- Variable value:
C:\Users\YourName\hop-config
- Variable name:
With this configuration, upgrading Hop versions preserves project lists, recent files, and environment settings.
Version Control with Git Integration
Making your project folder a Git repository significantly improves team collaboration. Since Apache Hop 2.11, File Explorer perspective supports direct Git integration.
Git Integration Features:
- Create, switch, delete, and merge branches
- Color-coded staged files
- View commit history
- Enhanced authentication and authorization
Refer to the Git Integration Guide for more details.
Ensure Data Quality with Unit Tests
Write unit tests for pipelines to verify data is processed as expected.
Unit Test Use Cases:
- Validate input data field count and types
- Confirm transformation logic produces correct results
- Test exception handling under specific conditions
Data pipelines are software. Test them like code for stable operations!
8. Performance Optimization: Making It Faster
As you use Apache Hop, you’ll encounter situations requiring performance improvements. Here are practical optimization tips.
Increase Speed with Parallel Processing
Pipelines work like networks—the slowest transform limits overall speed. Hop GUI displays slow transforms with dotted lines during execution, allowing you to identify and optimize them.
Optimization Methods:
- Increase Parallel Copies
- Right-click slow transform → Settings
- Increase “Copies to start” (e.g., 1 → 4)
- Consider CPU core count for appropriate setting
- Adjust Batch Size
- Increase Commit size for database I/O (e.g., 100 → 1000)
- Larger batches are generally faster than smaller ones
- Remove Unnecessary Transforms
- Remove unused fields in Select Values
- Eliminate duplicate sorts or filters
Important Note: More parallelism isn’t always better. Measure, test, and iterate!
Optimize Memory Settings
By default, Hop uses a maximum of 2GB heap memory. Increase this for large-scale data processing.
Configuration Method:
Open hop-gui.sh
(or hop-gui.bat
) in a text editor and modify:
# Default setting
HOP_OPTIONS="-Xmx2048m"
# Increase to 4GB
HOP_OPTIONS="-Xmx4096m"
# Increase to 8GB (large-scale data processing)
HOP_OPTIONS="-Xmx8192m"
Or set via environment variable:
export HOP_OPTIONS="-Xmx8192m"
Consider adding JVM garbage collection options:
HOP_OPTIONS="-Xmx8192m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
Track Issues with Logging Strategy
Configuring logging to capture workflow and pipeline execution is crucial for tracking issues when they occur.
Logging Options:
- Pipeline Log / Workflow Log: Store execution history in database
- Logging Reflection: Record per-transform processing statistics
- Neo4j Logging: Logging to Neo4j graph database enables visual review in Hop GUI’s Neo4j perspective
Configure logging settings using the Apache Hop Logging Guide.
Database Optimization Tips
- Use Indexes: Verify database indexes exist for Lookup operations
- Use Bulk Loaders: For mass data loading, use dedicated transforms like MySQL Bulk Loader, PostgreSQL Bulk Loader instead of Table Output
- Connection Pooling: Enable connection pooling in database connection settings
9. Migrating from Pentaho to Apache Hop
If you’re using Pentaho PDI, consider transitioning to Apache Hop. The migration process is simpler than you might think.
Using Migration Tools
Import in Hop GUI:
- In Hop GUI, select File → Import from Kettle/PDI
- Or use shortcut Ctrl+I
- In import settings window:
- Input folder: Select Kettle/PDI project folder
- Target folder: Select Apache Hop project folder
- Click Import button
After a few seconds, a migration summary window appears showing converted file list and notes.
Using Command Line Tool:
# Linux/Mac
./hop-import.sh --input-folder /path/to/kettle/project \
--output-folder /path/to/hop/project
# Windows
hop-import.bat --input-folder C:\kettle\project ^
--output-folder C:\hop\project
Key Changes to Review
During migration, these items are automatically converted:
Pentaho/Kettle | Apache Hop | Description |
---|---|---|
Transformations (.ktr) | Pipelines (.hpl) | File extension change |
Jobs (.kjb) | Workflows (.hwf) | File extension change |
Steps | Transforms | Terminology change |
Entries | Actions | Terminology change |
Kitchen | hop-run | Unified command line execution tool |
Pan | hop-run | Unified command line execution tool |
Post-Migration Notes
Database Connection Cleanup: If multiple database connections with the same name but different settings are found, a connections.csv
file is created in the project folder. Open this file to manually clean up duplicate connections.
Removed Features:
- JNDI: Removed as feature hasn’t been updated in over 10 years
- Formula Step: Replaced with more efficient Calculator transform
- Repositories: File-based and database repositories removed (Git recommended)
Newly Added Features:
- Projects and Environments concepts
- Apache Beam integration
- Enhanced metadata management
- Git integration
For detailed migration guidance, see the Apache Hop official migration documentation.
10. Learning Together: Community and Resources
Apache Hop grows with an active open-source community. Here are resources to help you learn.
Official Resources
- Official Website: https://hop.apache.org/
- Official Documentation: https://hop.apache.org/manual/latest/
- Version-specific documentation (2.10 to latest)
- Getting Started guide
- Transform/Action reference
- GitHub Repository: https://github.com/apache/hop
- Source code
- Issue tracker
- Release notes
- Download Page: https://hop.apache.org/download/
Community Participation
Apache Hop is an Apache Software Foundation project, developed entirely community-driven. Anyone can participate through various contribution methods:
- Mailing Lists: Ask questions and discuss
- Bug Reporting: Suggest bugs or improvements on GitHub Issues
- Documentation Improvements: Fix documentation typos or add examples
- Code Contributions: Develop new features or fix bugs
- Translation: Participate in translating Hop GUI to various languages
- Tutorial Creation: Share usage through blogs or videos
Contributing isn’t just about writing code! Check the Contribution Guide for detailed methods.
Recommended Learning Path
Phase 1 – Learn Basics:
- Explore Samples project included with Hop installation (
config/projects/samples
) - Create simple pipeline (Read CSV → Transform → Save)
- Use Help button in each transform to check documentation
Phase 2 – Real Projects:
- Connect to data sources you use in practice
- Configure projects and environments
- Start version control with Git
Phase 3 – Advanced Features:
- Test distributed processing with Apache Beam
- Write unit tests
- Develop custom plugins (if needed)
Useful Community Blogs
- know.bi: https://www.know.bi – Blog by Apache Hop founding members
- datavin3.be: https://www.datavin3.be – Apache Hop tutorials and best practices
- Lean With Data: https://www.leanwithdata.com – Pentaho to Hop transition guide
Frequently Asked Questions (FAQ)
Q: Is Apache Hop free? A: Yes! It’s completely free under Apache License 2.0 and can be freely used for commercial purposes.
Q: Is it compatible with Pentaho? A: Not directly compatible, but easy to migrate using import tools.
Q: Which databases are supported? A: Supports most major databases including MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Neo4j, and Cassandra.
Q: Can it run in the cloud? A: Yes! Deploy as Docker container or run on cloud platforms like AWS EMR and Google Dataflow.
Apache Hop is a powerful yet user-friendly tool meeting modern data engineering requirements. Combining visual development environment, flexible execution options, and active community support, it serves from small-scale data processing to large enterprise environments.
Apache Hop’s Core Value:
- Productivity: Fast development with drag-and-drop instead of code
- Flexibility: Design once, run anywhere
- Scalability: Process from small-scale to petabyte-level data
- Openness: Fully open-source, community-driven development
For Pentaho PDI users especially, it maintains familiar concepts while providing significantly improved user experience, making it worth actively considering the transition. Being an open-source project, you can use it freely without licensing costs.
If you’ve been considering data pipeline construction, try Apache Hop based on today’s guide. Start immediately from the official download page, and you’ll quickly become familiar using official documentation and sample projects. 🙂