Blog


How to Build an Automated Recon Pipeline with Python and Luigi - Part IV (Subdomain Enumeration)

Jan 22, 2020 | 16 minutes read

Tags: how-to, bug bounty, hack the box, python, recon, luigi

Welcome back! If you found your way here without reading the prior posts in this series, you may want to start with some of the links to previous posts (below). This post is part four of a multi-part series demonstrating how to build an automated pipeline for target reconnaissance. The target in question could be the target of a pentest, bug bounty, or capture the flag challenge (shout out to my HTB peoples!). By the end of the series, we’ll have built a functional recon pipeline that can be tailored to fit your own needs.

Previous posts:

Part IV will:

  • Add scanning with Amass to our pipeline
  • Parse Amass results for future processing

Part IV’s git tags:

  • stage-5
  • stage-6

To get the repository to the point at which we’ll start, we can run one of the following commands. Which command used depends on if the repository is already present or not.

git clone --branch stage-4 https://github.com/epi052/recon-pipeline.git
git checkout tags/stage-4

Roadmap:

  • Target scope
  • Port scanning I
  • Port scanning II
  • Subdomain enumeration <– this post
  • Web scanning
    • Screenshots
    • Subdomain takeover
    • CORS misconfiguration
    • Forced browsing
    • Tech stack identification
  • Data storage
  • Visualization / reporting
  • Slack integration

Stage 5 - Subdomain Enumeration

This post really marks the point at which I anticipate readers taking the pipeline and tweaking it to suit their needs. There are tons of methodologies that can be used to enumerate subdomains when given a top-level domain name (check out some at pentester.land’s compilation of recon workflows). This post will cover adding OWASP’s amass scanner to the pipeline. I don’t plan on covering subdomain enumeration any further than that. Mainly because this isn’t a series of posts about finding subdomains, it’s about building a pipeline. To that end, we’ll gather subdomains with amass and then, in later posts, proceed to doing interesting things with the subdomains identified. I invite you to use what you’ve learned so far and incorporate your own subdomain tactics into your own pipeline (or if you’re feeling generous, submit them back in the form of a Pull Request!).

Install amass

Before we get to the code, let’s get amass installed. I’m lazy, and snaps are pretty easy to manage, so we’ll install the amass snap.

If you want to install a different way than what’s shown, head over to the Installation Guide.

snap install amass

Once that’s complete, installation is done. Nice, eh?

Scanning with amass

First, we’ll take a moment to figure out what we want our scans to do. There are a lot of options for amass, but we’re going to focus on active subdomain enumeration. A run of amass against tesla.com would look something like what’s below.

Side Note: If you’ve got the time to spend, the talk from BugCrowd’s LevelUp 0x04 shows a lot of different ways to integrate amass into your recon workflow and is likely to answer any questions you have about amass; check it out here

amass enum -active -ip -brute -min-for-recursive 3 -df tesla -json amass.tesla.json
amass options used:

    enum
        Perform DNS enumeration and network mapping of systems exposed to the Internet
    -active
        Enable active recon methods
    -ip
        Show the IP addresses for discovered names
    -brute
        Perform brute force subdomain enumeration
    -min-for-recursive N
        Number of labels in a subdomain before recursive brute forcing
    -df
        Path to a file providing root domain names
    -json
        Path to the JSON output file

Most of the options are self-explanatory. -min-for-recursive may lead to some confusion, so we’ll turn to the amass project leader Jeff Foley for a brief explanation.

Brute forcing will begin on example.com right away. Recursive brute forcing takes place on additional labels, such as the cs.example.com or careers.example.com subdomain names. What if you do not want to start recursive brute forcing on every new subdomain name you discover? What if you would like some evidence that careers.example.com is worth brute forcing?

If you specify the ‘-min-for-recursive 2’ flag, two labals need to show up on careers.example.com before recursive brute forcing will begin, such as the www.careers.example.com and support.careers.example.com subdomain names. The flag allows you to control when recursive brute forcing will be triggered.

So, the lower the number passed to -min-for-recursive, the more aggressive our recursion profile. Good to know.

We’ll move forward with the command structure above, however, if it’s too agressive for your particular use case, please feel free to tweak it as you see fit. The amass user’s guide is a great resource if you want to change the command at all.

amass.AmassScan

With our plan in place, let’s look at the code. For our AmassScan class, we’ll use the ExternalProgramTask class as our base, just like our Masscan class.

11@inherits(TargetList)
12class AmassScan(ExternalProgramTask):
13    exempt_list = luigi.Parameter(default="")

There is an important thing to note in our code above, and that is how execution of the pipeline will flow. When we specify that AmassScan inherits from TargetList, we’re saying that AmassScan will be hierarchically located directly below targets.TargetList and a sibling of masscan.Masscan (remember the first two posts? I know it’s been a minute).

We essentially create a second branch in our pipeline that handles domains while the other handles ip addresses.

task-depends

For now, this is sufficient. Later on in this post we’ll cover how to tie the two branches together!

amass.AmassScan Parameters

We’re including a new Parameter in AmassScan called exempt_list. The reason for this Parameter is that some bug bounty scopes have expressly verboten subdomains and/or top-level domains. At the time of this writing, the Xfinity program on bugcrowd forbade any exploitation of login.xfinity.com (shown below).

blacklist

When a program has out of scope domains/subdomains, we don’t want to waste time by including them in our pipeline. That’s where amass’s -blf option comes in! -blf accepts a Path to a file providing blacklisted subdomains. Using our earlier amass example as a baseline, an amass run against xfinity may look something like what’s below.

amass enum -active -ip -brute -min-for-recursive 3 -df xfinity -json amass.xfinity.json -blf xfinity.blacklist

The Standard Glue

Next, let’s check our standard functions that make up these Tasks.

15    def requires(self):
16        return TargetList(self.target_file)
17
18    def output(self):
19        return luigi.LocalTarget(f"amass.{self.target_file}.json")

Staying true to previous Tasks, we’ll let luigi know that executing this Task will produce a file named amass.TARGET_FILE.json. Additionally, a TARGET_FILE must be present.

Enumerate All the Things!

Now we’ll explore the .run method, which as you know by now, constitutes the core logic of the Task. Recall that when we inherit from ExternalProgramTask, all we need to do is return a list from the overridden .program_args method. That list is then passed to the subprocess module for execution.

21    def program_args(self):
22        command = [
23            "amass",
24            "enum",
25            "-active",
26            "-ip",
27            "-brute",
28            "-min-for-recursive",
29            "3",
30            "-df",
31            self.input().path,
32            "-json",
33            f"amass.{self.target_file}.json",
34        ]
35
36        if self.exempt_list:
37            command.append("-blf")  # Path to a file providing blacklisted subdomains
38            command.append(self.exempt_list)
39
40        return command

There’s not much going on here. The command is broken up across a single list. The result of running targets.TargetList is passed to the -df option and we specifiy the output path of our JSON file. Lastly, if there are out-of-scope domains, the -blf option and its argument are appended to the list. That’s it; eazy peazy lemon squeezy!

Finalized Code

As usual, here’s the finalized code.

 1import json
 2import ipaddress
 3
 4import luigi
 5from luigi.util import inherits
 6from luigi.contrib.external_program import ExternalProgramTask
 7
 8from recon.targets import TargetList
 9
10
11@inherits(TargetList)
12class AmassScan(ExternalProgramTask):
13    """ Run amass scan to perform subdomain enumeration of given domain(s).
14
15    Expects TARGET_FILE.domains file to be a text file with one top-level domain per line.
16
17    Commands are similar to the following
18
19    amass enum -ip -brute -active -min-for-recursive 3 -df tesla -json amass.tesla.json
20
21    Args:
22        exempt_list: Path to a file providing blacklisted subdomains, one per line.
23        target_file: specifies the file on disk containing a list of ips or domains *--* Required by upstream Task
24    """
25
26    exempt_list = luigi.Parameter(default="")
27
28    def requires(self):
29        """ AmassScan depends on TargetList to run.
30
31        TargetList expects target_file as a parameter.
32
33        Returns:
34            luigi.ExternalTask - TargetList
35        """
36        return TargetList(self.target_file)
37
38    def output(self):
39        """ Returns the target output for this task.
40
41        Naming convention for the output file is amass.TARGET_FILE.json.
42
43        Returns:
44            luigi.local_target.LocalTarget
45        """
46        return luigi.LocalTarget(f"amass.{self.target_file}.json")
47
48    def program_args(self):
49        """ Defines the options/arguments sent to amass after processing.
50
51        Returns:
52            list: list of options/arguments, beginning with the name of the executable to run
53        """
54        command = [
55            "amass",
56            "enum",
57            "-active",
58            "-ip",
59            "-brute",
60            "-min-for-recursive",
61            "3",
62            "-df",
63            self.input().path,
64            "-json",
65            f"amass.{self.target_file}.json",
66        ]
67
68        if self.exempt_list:
69            command.append("-blf")  # Path to a file providing blacklisted subdomains
70            command.append(self.exempt_list)
71
72        return command

Stage 6 - Processing Amass Output

With amass execution complete, we now need to process the results. Our goal in this section is to take amass’s JSON results and yank out each ip address (v4 and v6) as well as each subdomain. The reasoning is that tools further down the pipeline may expect one or the other, so we’ll be prepared in either case. Let’s goooooooo!

amass.ParseAmassOutput

We’ll begin with more of the same standard code we’re used to.

78@inherits(AmassScan)
79class ParseAmassOutput(luigi.Task):
80    def requires(self):
81        args = {"target_file": self.target_file, "exempt_list": self.exempt_list}
82        return AmassScan(**args)

Nothing out of the ordinary with the code above. However, we want this particular Task to produce three files, one for ipv4, one for ipv6 and a third for subdomains. We haven’t returned anything except single files or folders thus far, but luigi makes it simple to do exactly what we need and is demonstrated below.

7    def output(self):
8        return {
9            "target-ips": luigi.LocalTarget(f"{self.target_file}.ips"),
10            "target-ip6s": luigi.LocalTarget(f"{self.target_file}.ip6s"),
11            "target-subdomains": luigi.LocalTarget(f"{self.target_file}.subdomains"),
12        }

Parse that JSON

To round out our ParseAmassOutput class, we have the .run method. Our job here is to parse the JSON file produced by AmassScan and categorize the results into ip address and subdomain files.

Before we can start parsing the JSON, we need to take a look at the output file and see what we’re dealing with. Below we see an example entry produced by amass.

{
    "Timestamp": "2019-09-22T19:20:13-05:00",
    "name": "beta-partners.tesla.com",
    "domain": "tesla.com",
    "addresses": [
    {
        "ip": "209.133.79.58",
        "cidr": "209.133.79.0/24",
        "asn": 394161,
        "desc": "TESLA - Tesla"
    }
    ],
    "tag": "ext",
    "source": "Previous Enum"
}

As stated earlier, our goal is to strip out the subdomains and ip addresses from the JSON file. We’ll begin with creating a set to contain each individual collection of items. We use a set as our container for ports because, by definition, a set is an unordered collection of unique values. So, if for any reason we see the same port/protocol while parsing, it won’t result in additional overhead for the rest of the pipeline.

14        unique_ips = set()
15        unique_ip6s = set()
16        unique_subs = set()

With the data structure selected and initialized, we can open up the JSON file for reading along with one file per set to which we’ll write results.

18        amass_json = self.input().open()
19        ip_file = self.output().get("target-ips").open("w")
20        ip6_file = self.output().get("target-ip6s").open("w")
21        subdomain_file = self.output().get("target-subdomains").open("w")

Everything is in place now to iterate over the JSON entries and parse out what the information we care about. Recall that ‘name’ is the subdomain returned by amass and ‘ip’ can contain either IPv4 or IPv6, so we check for each and add to the appropriate set.

23        with amass_json as aj, ip_file as ip_out, ip6_file as ip6_out, subdomain_file as subdomain_out:
24            for line in aj:
25                entry = json.loads(line)
26                unique_subs.add(entry.get("name"))
27
28                for address in entry.get("addresses"):
29                    ipaddr = address.get("ip")
30                    if isinstance(ipaddress.ip_address(ipaddr), ipaddress.IPv4Address):  # ipv4 addr
31                        unique_ips.add(ipaddr)
32                    elif isinstance(ipaddress.ip_address(ipaddr), ipaddress.IPv6Address):  # ipv6
33                        unique_ip6s.add(ipaddr)

Finally, we can send our results to their respective files.

35            for ip in unique_ips:
36                print(ip, file=ip_out)
37
38            for sub in unique_subs:
39                print(sub, file=subdomain_out)
40
41            for ip6 in unique_ip6s:
42                print(ip6, file=ip6_out)

Finalized Code

Here we have the final code.

78@inherits(AmassScan)
79class ParseAmassOutput(luigi.Task):
80    """ Read amass JSON results and create categorized entries into ip|subdomain files.
81
82    Args:
83        target_file: specifies the file on disk containing a list of ips or domains *--* Required by upstream Task
84        exempt_list: Path to a file providing blacklisted subdomains, one per line. *--* Optional for upstream Task
85    """
86
87    def requires(self):
88        """ ParseAmassOutput depends on AmassScan to run.
89
90        TargetList expects target_file as a parameter.
91        AmassScan accepts exempt_list as an optional parameter.
92
93        Returns:
94            luigi.ExternalTask - TargetList
95        """
96
97        args = {"target_file": self.target_file, "exempt_list": self.exempt_list}
98        return AmassScan(**args)
99
100    def output(self):
101        """ Returns the target output files for this task.
102
103        Naming conventions for the output files are:
104            TARGET_FILE.ips
105            TARGET_FILE.ip6s
106            TARGET_FILE.subdomains
107
108        Returns:
109            dict(str: luigi.local_target.LocalTarget)
110        """
111        return {
112            "target-ips": luigi.LocalTarget(f"{self.target_file}.ips"),
113            "target-ip6s": luigi.LocalTarget(f"{self.target_file}.ip6s"),
114            "target-subdomains": luigi.LocalTarget(f"{self.target_file}.subdomains"),
115        }
116
117    def run(self):
118        """ Parse the json file produced by AmassScan and categorize the results into ip|subdomain files.
119
120        An example (prettified) entry from the json file is shown below
121            {
122              "Timestamp": "2019-09-22T19:20:13-05:00",
123              "name": "beta-partners.tesla.com",
124              "domain": "tesla.com",
125              "addresses": [
126                {
127                  "ip": "209.133.79.58",
128                  "cidr": "209.133.79.0/24",
129                  "asn": 394161,
130                  "desc": "TESLA - Tesla"
131                }
132              ],
133              "tag": "ext",
134              "source": "Previous Enum"
135            }
136        """
137        unique_ips = set()
138        unique_ip6s = set()
139        unique_subs = set()
140
141        amass_json = self.input().open()
142        ip_file = self.output().get("target-ips").open("w")
143        ip6_file = self.output().get("target-ip6s").open("w")
144        subdomain_file = self.output().get("target-subdomains").open("w")
145
146        with amass_json as aj, ip_file as ip_out, ip6_file as ip6_out, subdomain_file as subdomain_out:
147            for line in aj:
148                entry = json.loads(line)
149                unique_subs.add(entry.get("name"))
150
151                for address in entry.get("addresses"):
152                    ipaddr = address.get("ip")
153                    if isinstance(ipaddress.ip_address(ipaddr), ipaddress.IPv4Address):  # ipv4 addr
154                        unique_ips.add(ipaddr)
155                    elif isinstance(ipaddress.ip_address(ipaddr), ipaddress.IPv6Address):  # ipv6
156                        unique_ip6s.add(ipaddr)
157
158            # send gathered results to their appropriate destination
159            for ip in unique_ips:
160                print(ip, file=ip_out)
161
162            for sub in unique_subs:
163                print(sub, file=subdomain_out)
164
165            for ip6 in unique_ip6s:
166                print(ip6, file=ip6_out)

Bonus Round - Linking IP and Domain Branches

In this section, we’re covering the changes we need to make in order to link the two branches. It may be easier to follow while looking at the commit’s diff on github.

As discussed earlier, we have two divergent paths that our pipeline execution can take. It would be much cooler if we could execute the subdomain path and then have it feed into the ip address path (assuming we started with a domain). Fortunately, we can make that dream a reality.

This time around, making our pipeline do what we want is much less intuitive than most of the other luigi code we’ve written. Fear not! Our answer lies in luigi’s handling of dynamic dependencies. Below is an excerpt from the luigi docs.

Sometimes you might not know exactly what other tasks to depend on until runtime. In that case, Luigi provides a mechanism to specify dynamic dependencies. If you yield another Task in the Task.run method, the current task will be suspended and the other task will be run. You can also yield a list of tasks.

So, all we’ll need to do is alter masscan.Masscan a bit to dynamically run the domain path if we receive a list of domains. Recall that our domain path turns subdomains into ips, which can then be fed into masscan.Masscan. Let’s see what that looks like in practice.

First off, we’ll need to import subprocess and our amass.ParseAmassOutput class. We need subprocess because we’re going to change masscan.Masscan to inherit from luigi.Task instead of ExternalProgramTask. That means that we’ll need to handle our own execution of the masscan binary in the .run method. Additionally, we can remove from luigi.contrib.external_program import ExternalProgramTask while we’re updating the import section.

import-change

Next, we’ll need to update our inherits decorator. We need to add ParseAmassOutput to our decorator in order to include that Task’s additional Parameters.

After that, we’ll need to change the class from which we’re inheriting. Use of dynamic dependencies dictates that we inherit from luigi.Task in order to have a .run method to override.

class-change

Due to how we’re going to handle linking the two branches, we can actually remove the entire requires function.

remove-requires

With that complete, we’ll change the method program_args to run.

progargs-to-run

At last, we’re at the real meat of specifying our dynamic dependencies. We’ll begin by yielding from (running) the targets.TargetList Task. The result of the yield statement is the same as if we called self.input() from a normal Task. We can then use the result of running targets.TargetList to determine if we should run amass.ParseAmassOutput or not!

the-meat-change

We have two more small changes to make. The first of those is that we need to change the file that is passed to masscan’s -iL option. Currently, we pass it self.input().path, which corresponds to whatever targets.TargetList would have returned as a result of running the (now deleted) .requires method.

tack-il-change

Additionally, we need to run subprocess.run ourselves, because we no longer inherit from ExternalProgramTask.

subprocess-run-change

With all of those changes in place, we’re left with a dependency graph that looks something like this, huzzah!

joined-task-depends

Finalized Code

Here we have the final code.

 1import json
 2import pickle
 3import logging
 4import subprocess
 5from collections import defaultdict
 6
 7import luigi
 8from luigi.util import inherits
 9
10from recon.targets import TargetList
11from recon.amass import ParseAmassOutput
12from recon.config import top_tcp_ports, top_udp_ports, masscan_config
13
14
15@inherits(TargetList, ParseAmassOutput)
16class Masscan(luigi.Task):
17    """ Run masscan against a target specified via the TargetList Task.
18    Masscan commands are structured like the example below.  When specified, --top_ports is processed and
19    then ultimately passed to --ports.
20    masscan -v --open-only --banners --rate 1000 -e tun0 -oJ masscan.tesla.json --ports 80,443,22,21 -iL tesla.ips
21    The corresponding luigi command is shown below.
22    PYTHONPATH=$(pwd) luigi --local-scheduler --module recon.masscan Masscan --target-file tesla --ports 80,443,22,21
23    Args:
24        rate: desired rate for transmitting packets (packets per second)
25        interface: use the named raw network interface, such as "eth0"
26        top_ports: Scan top N most popular ports
27        ports: specifies the port(s) to be scanned
28        target_file: specifies the file on disk containing a list of ips or domains *--* Required by upstream Task
29        exempt_list: Path to a file providing blacklisted subdomains, one per line. *--* Optional for upstream Task
30    """
31
32    rate = luigi.Parameter(default=masscan_config.get("rate"))
33    interface = luigi.Parameter(default=masscan_config.get("iface"))
34    top_ports = luigi.IntParameter(default=0)  # IntParameter -> top_ports expected as int
35    ports = luigi.Parameter(default="")
36
37    def __init__(self, *args, **kwargs):
38        super(Masscan, self).__init__(*args, **kwargs)
39        self.masscan_output = f"masscan.{self.target_file}.json"
40
41    def output(self):
42        """ Returns the target output for this task.
43        Naming convention for the output file is masscan.TARGET_FILE.json.
44        Returns:
45            luigi.local_target.LocalTarget
46        """
47        return luigi.LocalTarget(self.masscan_output)
48
49    def run(self):
50        """ Defines the options/arguments sent to masscan after processing.
51        Returns:
52            list: list of options/arguments, beginning with the name of the executable to run
53        """
54        if self.ports and self.top_ports:
55            # can't have both
56            logging.error("Only --ports or --top-ports is permitted, not both.")
57            exit(1)
58
59        if not self.ports and not self.top_ports:
60            # need at least one
61            logging.error("Must specify either --top-ports or --ports.")
62            exit(2)
63
64        if self.top_ports < 0:
65            # sanity check
66            logging.error("--top-ports must be greater than 0")
67            exit(3)
68
69        if self.top_ports:
70            # if --top-ports used, format the top_*_ports lists as strings and then into a proper masscan --ports option
71            top_tcp_ports_str = ",".join(str(x) for x in top_tcp_ports[: self.top_ports])
72            top_udp_ports_str = ",".join(str(x) for x in top_udp_ports[: self.top_ports])
73
74            self.ports = f"{top_tcp_ports_str},U:{top_udp_ports_str}"
75            self.top_ports = 0
76
77        target_list = yield TargetList(target_file=self.target_file)
78
79        if target_list.path.endswith("domains"):
80            yield ParseAmassOutput(target_file=self.target_file, exempt_list=self.exempt_list)
81
82        command = [
83            "masscan",
84            "-v",
85            "--open",
86            "--banners",
87            "--rate",
88            self.rate,
89            "-e",
90            self.interface,
91            "-oJ",
92            self.masscan_output,
93            "--ports",
94            self.ports,
95            "-iL",
96            target_list.path.replace("domains", "ips"),
97        ]
98
99        subprocess.run(command)

That wraps things up for this post. In the next installment, we’ll get started with the web scanning portion of our pipeline!


Additional Resources

  1. amass
  2. amass user’s guide
  3. pentester.land’s compilation of recon workflows
  4. LevelUp 0x04 - OWASP Amass – Discovering Internet Exposure
  5. Luigi - Dynamic Dependencies

comments powered by Disqus