How to Build an Automated Recon Pipeline with Python and Luigi - Part I (Setup and Scope)

Jan 22, 2020 | 13 minutes read

Tags: how to, bug bounty, hack the box, python, recon, luigi

Welcome to part one of a multi-part series demonstrating how to build an automated pipeline for target reconnaissance. The target in question could be the target of a pentest, bug bounty, or capture the flag challenge (shout out to my HTB peoples!). Each post in this series has an associated git tag in the repository for readers’ ease of use. By the end of the series, we’ll have built a functional recon pipeline that can be tailored to fit whatever needs you have.

Part I will:

  • Provide an overview of Luigi and why we’re using it
  • Setup our development environment (python virtual environment, git repository, luigi, etc…)
  • Build stage 0 of our pipeline (Target Scope)

Part I’s git tags:

  • pipenv-install
  • stage-0

As this is a ‘how-to’ series, don’t be concerned if you don’t know about a particular topic to be covered. All of the steps are clearly laid out. The roadmap below outlines topics covered in future posts.


  • Target scope <– this post
  • Port scanning I
  • Port scanning II
  • Subdomain enumeration
  • Web scanning
    • Screenshots
    • Subdomain takeover
    • CORS misconfiguration
    • Forced browsing
    • Tech stack identification
  • Data storage
  • Visualization / reporting
  • Slack integration

All right, enough with the intro, let’s dive in!

Note to Readers: If you find yourself wanting to know more about classes and Object Oriented Programming (OOP) @0xghostwriter recommends this youtube series on the subject. Special thanks to ghostwriter for reaching out and sharing!


Luigi; An Overview

Luigi is a python library written by the folks at Spotify. Its purpose is to chain multiple tasks together and automate them. The tasks can be just about anything. According to the documentation:

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Imagine you have a tool that needs to run to produce output. Another tool uses that output as its input (i.e., nmap scan produces xml; xml sent to the next tool as input). Consider the next logical step; a third tool uses the output from the second tool as its input. This is the type of scenario that Luigi was built to handle.

A naive approach to automating this sort of behavior is to write a wrapper script that executes each tool in turn, hoping that no tool in the chain runs into any errors. If it does, the script likely needs to be rerun from the beginning. Luigi, on the other hand, can recover from the last successful chain in the pipeline. For instance, say that we’ve run masscan and nmap successfully but the pipeline breaks while running the third tool nikto. On the next run of the pipeline, Luigi picks up from where it left off, skipping the two successful scans.

Luigi also has a lot of pretty cool features, such as its task scheduler, dependency visualizer, process synchronization, error notifications, task status monitoring, admin web panel and a whole bunch of other stuff. We’ll be using some of these pieces naturally as they come up in development. In short, Luigi is pretty legit. Before we move past Luigi, we need to discuss a few fundamental ideas about how it works; let’s do that now.

Luigi; Core Concepts

There are two fundamental building blocks of Luigi; Tasks and Targets. Each Target corresponds to a file on disk or some observable checkpoint (row in a database, file in an S3 bucket, remote target responsiveness, etc). Targets are fairly straightforward.

Tasks are the more interesting of the two concepts. Tasks are a single unit of work. Tasks define what happens during that section of the pipeline. Tasks take Targets as input, and (usually) create Targets as output. Additionally, Tasks can specify their dependence on another class. Here is a visualization of a simple Task dependency and the related Targets.


In the image, the Database dump Task expects a DB Target as input. After successful execution, it produces the dump.txt Target. Compute Toplist Task uses the dump.txt Target as its input. The Compute Toplist Task creates the toplist.txt Target. Also, the Compute Toplist Task requires the Dump Database Task. We’ll see many of these relationships written out in code as we progress.

A simple idea to understand about Luigi is that one can specify what one wants to build, and then backtrack to find out what is required to fulfill the request. If we were executing our above example, we would tell Luigi that we want to run the Compute Toplist Task. Luigi would then walk that Task’s dependencies backward (including any other dependencies found along the way) until reaching the beginning of the pipeline. Once luigi finds the beginning Task, execution begins. If this sounds similar to how GNU’s Make utility works, it should, Luigi’s creator based Luigi’s design on Make.

That’s enough background to get us started. We’ll be diving into code later that demonstrates some of what we’ve already discussed. Before we can get to the code, we need to set up our development environment; let’s begin!

Development Environment

Prerequisites (kind of)

This guide assumes a few things about your operating system/environment.

  • Linux (kali assumed)
  • Running systemd as its init system
  • Has python 3.6+ installed
  • Has git installed

We won’t cover how to install python (though on linux, it should just ‘be there’), we also won’t cover startup scripts for different init systems. If you don’t meet one or more of these requirements, that’s ok. Just understand that where you deviate from requirements, you’re on your own (you can @me on twitter if you’re hard stuck and we’ll work it out).

Install Luigi

Our first step is to install luigi. We’ll do this inside of a python virtual environment. My virtual environment manager preference is pipenv. Let’s get pipenv installed.

apt install pipenv

After that, we’ll clone the git repository we’ll be working with throughout these posts. We’re going to be using git tags to track significant checkpoints within the code. As such, the command below is how we’ll grab the baseline repository.

git clone --branch pipenv-install
git options used:

        Clone a repository into a new directory
        checkout <branch> instead of the remote's HEAD (can be used for tags as well)

Now we have a place to work! Let’s use the Pipfiles included in our repository to install luigi.

cd recon-pipeline
pipenv install

If everything went well, we should see output similar to what’s below.

Creating a virtualenv for this project…
Pipfile: /opt/recon-pipeline/Pipfile
Using /usr/bin/python3.7m (3.7.3) to create virtualenv…
⠴ Creating virtual environment...Using base prefix '/usr'
New python executable in /home/epi/.local/share/virtualenvs/recon-pipeline-nDSyRWzr/bin/python3.7m
Also creating executable in /home/epi/.local/share/virtualenvs/recon-pipeline-nDSyRWzr/bin/python
Installing setuptools, pip, wheel...
Running virtualenv with interpreter /usr/bin/python3.7m

✔ Successfully created virtual environment! 
Virtualenv location: /home/epi/.local/share/virtualenvs/recon-pipeline-nDSyRWzr
Installing dependencies from Pipfile.lock (e32771)…
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 7/7 — 00:00:01
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.

To make use of our virtual environment, we use the command below (it’s also in the pipenv output).

pipenv shell

A simple test while in our virtual environment tells us if everything worked correctly.

python -c 'import luigi'

If there is no error output, we’ve successfully installed luigi!

Stage 0 - Target Scope

Now that we have luigi installed, we can create our first Task. We need a way to feed input into our pipeline. Specifically, we’ll want to define the scope of our target. According to hackerone scope is defined as

A collection of assets that hackers are to hack on.

For us, this boils down to either a list of ip addresses or a list of domains. Instead of trying to automate some method of pulling in scope files from bugcrowd, hackerone, synack, or some other platform, we can instead manually create the scope file and place it on disk for luigi to ingest. This approach allows us to use the pipeline for any of the bug bounty platforms, pentest targets, hack the box/CTFs, etc…

Though, there is a contract of sorts that we’ll place upon ourselves with the scope file. We’ll eventually take different actions later in the pipeline based on whether or not the file contains ip addresses or domains. That means that, for the sake of simplicity, the scope file should only contain either ip addresses or domain names, not both.

Directory Hierarchy

To structure our directory layout, we’ll begin by creating a python module inside of our repository.

mkdir recon
touch recon/

After that, inside of our recon module, we’ll create

touch recon/

Now our directory structure should look like this:

├── Pipfile
├── Pipfile.lock
└── recon

Anatomy of a Task

Let’s spend a few minutes to look at the basics of the luigi Task class.

A Task describes a unit of work and is the base unit of work in luigi. To create a luigi Task, we’ll need to create a class that inherits from luigi.Task. We’ll also need to override a few methods:

  • run() - contains the logic to be performed by this Task
  • output() - the output Target that this task creates (e.g., a file, database entry, etc…)
  • requires() - the list of Tasks that this Task depends on

Each piece of functionality we add to the pipeline is some form of Task, so it’s essential to cover the basics before continuing.


Now that we have a file to work in, and we’ve covered the bare-bones essentials of the Task class, let’s start taking a look at some code! holds our Task class that handles our scope file. Recall that this file is generated manually by the user. Typically, luigi Tasks get their input from some source, so ours is a special case for which the luigi creators planned. In luigi, when we need to say that a source outside of luigi generates the Task’s output, we use an ExternalTask. An ExternalTask is a subclass of luigi.Task discussed above, and doesn’t require overriding the run() method.

 1import shutil
 2import logging
 3import ipaddress
 5import luigi
 8class TargetList(luigi.ExternalTask):
 9    target_file = luigi.Parameter()
10    -------------8<-------------

Each luigi Task can have Parameters. A Parameter handles creating the class’s constructor and a command-line parser option for that particular Task. We’ll see how to use Parameters from the command line shortly.

11    def output(self):
12        try:
13            with open(self.target_file) as f:
14                first_line = f.readline()
15                ipaddress.ip_interface(first_line.strip())  # is it a valid ip/network?
16        except OSError as e:
17            # can't open file; log error / return nothing
18            return logging.error(f"opening {self.target_file}: {e.strerror}")
19        except ValueError as e:
20            # exception thrown by ip_interface; domain name assumed
21            logging.debug(e)
22            with_suffix = f"{self.target_file}.domains"
23        else:
24            # no exception thrown; ip address found
25            with_suffix = f"{self.target_file}.ips"
27        shutil.copy(self.target_file, with_suffix)  # copy file with new extension
28        return luigi.LocalTarget(with_suffix)

Parameters are how we’ll pass user-controlled input to our class. In this case, it is the path to our scope file. A LocalTarget represents a local file on the file system. The LocalTarget here is what this particular Task produced and what it passes to tasks further down the pipeline.

The high-level description of this Task is that it opens the file specified by the user in the --target-file command-line option (seen below). It reads the first line to determine whether the file contains ip addresses or domain names (remember our contract of only one or the other?). After making that determination, it copies the target_file with either .ips or .domains appended to the filename. That’s it. The LocalTarget returned from this Task is available to the next Task in the pipeline by calling self.input().

We can update our local source code to what’s seen above (with docstrings/comments) by running the following command.

git checkout stage-0

Luigi Command Execution

To run the pipeline, we’ll need to set our PYTHONPATH environment variable to the path of our project on disk. We can set the environment variable in a few ways; outlined below are two solutions.

  1. Prepend PYTHONPATH=/path/to/recon-pipline to any luigi pipeline command being run.
  2. Add export PYTHONPATH=/path/to/recon-pipeline to your .bashrc

We also need to specify --local-scheduler on the command line. While the --local-scheduler flag is useful for development purposes, it’s not recommended for production usage. There is also a centralized scheduler that runs as a system service and serves two purposes:

  • Make sure two instances of the same task are not running simultaneously
  • Provide visualization of everything that’s going on.

For now, we’ll stick with --local-scheduler. As our pipeline becomes larger, we’ll swap over to the central scheduler.

With our PYTHONPATH setup, luigi commands take on the following structure (prepend PYTHONPATH if not exported from .bashrc):


We can get options for each module by running luigi --module PACKAGENAME.MODULENAME CLASSNAME --help

An example help statement:

luigi --module recon.targets TargetList --help

usage: luigi [--local-scheduler] [--module CORE_MODULE] [--help] [--help-all]
             [--TargetList-target-file TARGETLIST_TARGET_FILE]
             [--target-file TARGET_FILE]
             [Required root task]

positional arguments:
  Required root task    Task family to run. Is not optional.

optional arguments:
  --local-scheduler     Use an in-memory central scheduler. Useful for
  --module CORE_MODULE  Used for dynamic loading of modules
  --help                Show most common flags and all task-specific flags
  --help-all            Show all command line flags
  --TargetList-target-file TARGETLIST_TARGET_FILE
  --target-file TARGET_FILE

Notice the --target-file option that we specified as a Parameter in our code above. Putting it all together, we can see an example scope file command, where tesla is the name of the file, and it is located in the current directory (ensure you’re in your python virtual environment).

echo > tesla
PYTHONPATH=$(pwd) luigi --local-scheduler --module recon.targets TargetList --target-file tesla

DEBUG: Checking if TargetList(target_file=tesla) is complete
INFO: Informed scheduler that task   TargetList_tesla_591d3b1ff1   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=092373507, workers=1, host=main, username=epi, pid=13645) was stopped. Shutting down Keep-Alive thread
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 complete ones were encountered:
    - 1 TargetList(target_file=tesla)

Did not run any tasks
This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

After running the command above, we see a new file in our current directory named tesla.ips.

cat tesla.ips

Finalized Code

Here we have the finalized code with comments.

 1import shutil
 2import logging
 3import ipaddress
 5import luigi
 8class TargetList(luigi.ExternalTask):
 9    """ External task.  `TARGET_FILE` is generated manually by the user from target's scope. """
11    target_file = luigi.Parameter()
13    def output(self):
14        """ Returns the target output for this task. target_file.ips ||
16        In this case, it expects a file to be present in the local filesystem.
17        By convention, TARGET_NAME should be something like tesla or some other
18        target identifier.  The returned target output will either be target_file.ips
19        or, depending on what is found on the first line of the file.
21        Example:  Given a TARGET_FILE of tesla where the first line is;
22        is written to disk.
24        Returns:
25            luigi.local_target.LocalTarget
26        """
27        try:
28            with open(self.target_file) as f:
29                first_line = f.readline()
30                ipaddress.ip_interface(first_line.strip())  # is it a valid ip/network?
31        except OSError as e:
32            # can't open file; log error / return nothing
33            return logging.error(f"opening {self.target_file}: {e.strerror}")
34        except ValueError as e:
35            # exception thrown by ip_interface; domain name assumed
36            logging.debug(e)
37            with_suffix = f"{self.target_file}.domains"
38        else:
39            # no exception thrown; ip address found
40            with_suffix = f"{self.target_file}.ips"
42        shutil.copy(self.target_file, with_suffix)  # copy file with new extension
43        return luigi.LocalTarget(with_suffix)

That wraps things up for this post. In the next installment, we’ll add masscan into our pipeline!

Additional Resources

  1. Luigi - Building Workflows
  2. Luigi - External Tasks
  3. Luigi - Parameters
  4. Luigi - LocalTarget
  5. git tags

comments powered by Disqus