Mach-O File Format: Introduction


I've recently been working a lot with parsing Mach-O files, so I'm begining to understand in a fair bit of detail how they are structured and how they work. I've been developing a library, called libhelper, which can parse Mach-O files. Libhelper-macho also powers Img4helper, and HTool.

This is not a complete writeup or documentation covering everything about Mach-O's, and I appreciate this has probably been covered to death. It's not aimed at those who already have an advanced knowledge of how Mach or Darwin works, rather it's aimed at those who are in a position I was a few weeks ago, having limited knowledge of how Mach-O's are structured. However I felt this would be a useful resource, and a good way to kick-off my Blog.

There are multiple types of Mach-O, such as Executable or KEXT Bundles, so I can't cover them all. My aim for this post is to discuss the basics - namely Header, Load Commands and Segment Commands. I may discuss other areas in the future but this is a start.

What are Mach-O files

Mach-O files, or Mach Object Files, are an executable format used on Operating Systems based on the Mach Kernel. This includes Apple's Darwin iOS, macOS, watchOS etc. There are multiple types of Mach-O file, such as executables, object-code, shared and dynamic libraries, kernel extension (KEXT) bundles and even debug companion files.

Mach-O Format

Mach-O files are simply binary files, there isn't particularly anything special about them in that regard. You can read in some bytes into a C structure and boom, you've parsed a Mach-O (or at least part of it). Natively, they can only be run on Mach/Darwin/XNU-based systems, however there are some implementations for loading and executing Mach-O files on Linux. Although you can run simple applications this way, the majority of applications will not work due to reliance on certain macOS libraries, such as /usr/lib/libSystem.B.dylib.

A Mach-O is made up of one Mach header, a number of load commands (specified in the header) and the data. The data is organised into Segments, which are made up of 0 to 255 Sections, and there special load commands to describe them. Mach-O files are organised as follows:

  1. Mach-O Header
  2. Load Commands
  3. Data

The purpose of this article is to discuss, at a higher level, each of these areas of a Mach-O file, how data is organised and how to load this data from a given Mach-O file into relevant C structures.

Starting with the Mach Header. It's purpose is to describe what the file contains, and how the Kernel and Dynamic Linker should handle it. The first 4 bytes are, like with any file, it's "Magic Number". A Magic Number is used to identify a file format. In the case of Mach-O's there are three Magic Numbers that one may come across. 0xfeedface for 32-bit, 0xfeedfacf for 64-bit and 0xcafebabe for Mach Universal Binaries / Object files.

Other properties of a Mach-O Header include the cpu type and sub type which define the architecture the Mach-O is built for (e.g. arm64, x86_64, arm64_32), the number of Load Commands and the size of that area and flags to be passed to the Dynamic Linker. The layout of the header is shown below:

    struct mach_header {
        uint32_t            magic;          // mach magic number
        cpu_type_t          cputype;        // cpu specifier
        cpu_subtype_t       cpusubtype;     // cpu subtype specifier
        uint32_t            filetype;       // type of mach-o e.g. exec, dylib ...
        uint32_t            ncmds;          // number of load commands
        uint32_t            sizeofcmds;     // size of load command region
        uint32_t            flags;          // flags
        uint32_t            reserved;       // *64-bit only* reserved
    };

The Mach-O header takes up 32 bytes for 64-bit files, at 28 bytes for 32-bit files. You can populate the the header structure by memcpy() the correct size into a mach_header structure, and you'll be able to access the header elements as normal.

Load Commands

Load Commands are placed directly after the Mach-O header in the file. They specify the logical structure of the file and the layout of the file in virtual memory.

All Load Commands have a common 8 byte structure which identifies the type of the command and it's size. This common structure is defined as follows:

   struct load_command {
        uint32_t			cmd;			// type of load command
        uint32_t			cmdsize;		// size of load command
    };

There are over a dozen Load Commands, some are common across all Mach-O's and some are only found in certain cases. Load Commands placed after the Mach-O header, with the first being Segment Commands. These are discussed further under Segment Commands.

But Segment Commands are not the only commands that are included in the majority of Mach-O files. The LC_DYLD_INFO and LC_LOAD_DYLINKER commands specify information such as rebase, bind, weak, lazy and export information for the Dynamic Linker, and the path of the Dynamic Linker the Kernel should use to execute the binary respectively. Mach-O's frequently require Dynamic Libraries, especially /usr/lib/libSystem.B.dylib. The LC_DYLIB command defines the path for Linker to find the Dylib, and there can be however many of these commands as are required for the number of Dynamic Libraries.

The offset and sizes for both the symbol table and the string table are defined with LC_SYMTAB, and offsets for local, external, undefined and other types of dynamic symbols are defined with LC_DYSYMTAB

The last command that I will discuss here is LC_MAIN which defines the offset for the entry point, so where the Kernel should start executing the binary from. This is only used for MH_EXECUTE filetypes.

Below is output from an experiemental version of htool showing all of the Load Commands from itself. I've ommited some parts because the output is rather long.

$ htool_debug -l $(which htool_debug)
HTool Version 1.0.0~Alpha; Sat Jan  4 02:53:36 2020; libhelper-1000.742.56.7.4/ALPHA_X86_64 x86_64

LC 00: LC_SEGMENT_64	Off: 0x000000000-0x100000000	__PAGEZERO
    No Section 64 data
LC 01: LC_SEGMENT_64	Off: 0x100000000-0x100012000	__TEXT
    Off: 0x100000b00-0x10000f4cf	59855 bytes		__TEXT.__text
    Off: 0x10000f4d0-0x10000f656	390 bytes		__TEXT.__stubs
    ...
LC 05: LC_DYLD_INFO_ONLY
    Rebase info:     40	bytes at offset 0x14000 (0x14000-0x14028)
    Bind info:       88	bytes at offset 0x14028 (0x14028-0x14080)
    No Weak Bind info
    Lazy Bind info:  1048	bytes at offset 0x14080 (0x14080-0x14498)
    Export info:     3640	bytes at offset 0x14498 (0x14498-0x152d0)
LC 06: LC_SYMTAB
    1434 symbols in file
    symbol table offset: 0x00015478
    string table offset: 0x0001b034
    string table size: 12680 bytes
LC 07: LC_DYSYMTAB
    1196 local symbols at 0
    169 external symbols at 1196
    69 undefined symbols at 1365
    No TOC
    No modtab
    135 indirect symtab entries at 110104
    No External Relocation Entries
    No Local Relocation Entries
LC 08: LC_LOAD_DYLINKER		/usr/lib/dyld
LC 09: LC_UUID			UUID:			3C7070A4-E053-3DA2-99C6-44DA4D6D2055
LC 10: LC_BUILD_VERSION		Build Version:		Platform: macOS,	Minos: 10.15,	SDK: 10.15
                            Tool 0:	 LD (v520.0.0)
LC 11: LC_SOURCE_VERSION	Source Version:		0.0
LC 12: LC_MAIN			Entry Point:		0xd40
LC 13: LC_LOAD_DYLIB		/usr/lib/libSystem.B.dylib
LC 14: LC_RPATH			@loader_path/../libhelper/src
LC 15: LC_RPATH			@loader_path/../editline
LC 16: LC_FUNCTION_STARTS	Offset:   0x152d0, Size:   376 bytes (0x000152d0-0x00015448)
LC 17: LC_DATA_IN_CODE		Offset:   0x15448, Size:   48 bytes (0x00015448-0x00015478)

Going back to struct load_command. Looking at it from the perspective of trying to parse Mach-O's having a constant format for the first 8 bytes of each Load Command makes detecting and parsing them easier. The following is an example of how we can parse a command, using LC_MAIN as an example. The code is based off XNU's loader.h rather than libhelper.

...
    // Offset of the load commands from the base of the file.
    uint32_t offset = sizeof (mach_header_t);

    // Loop round each load command.
    for (int i = 0; i < header->ncmds; i++) {

        // Create a base load command, alloc and copy bytes
        struct load_command *lc = malloc (sizeof (struct load_command));
        memset (lc, '\0', sizeof (struct load_command));
        memcpy (lc, data + offset, sizeof (struct load_command));

        // Verify the copy worked
        if (!lc) return NULL;

        // Here you would check the type of command.
        ...
        if (lc->cmd == LC_MAIN) {

            // Create the command type
            struct entry_point_command *lc_main = malloc (lc->cmdsize);
            memset (lc_main, '\0', lc->cmdsize);

            // Now copy the data
            memcpy (lc_main, data + offset, lc->cmdsize);

            // We can now do as we please with the data.

        }
        ...

        // Increment the offset by the size of the current command
        offset += lc->cmdsize;
    }

If you are interested in learning more about the different types of Load Commands, you can either checkout EXTERNAL_HEADERS/mach-o/loader.h in the XNU sources, or include/libhelper-macho/macho-command-types.h from Libhelper.

Segment Commands

Going back to Segment Commands, the first couple of Load Commands in a Mach-O are either LC_SEGMENT for 32-bit, or LC_SEGMENT_64 for 64-bit. These define an object files Segments.

If you are unfamiliar with how object files work, you have a number of these segments. The __TEXT segment contains the instructions that will be executed by the CPU, and the __DATA segment contains both static local variables and global variables. These are both standard, however you may find additional segments such as __PAGEZERO and __LINKEDIT, and in XNU Kernelcaches, you'll get even more funky segment names like __PRELINK_INFO and __LAST.

Segments are further divided into sections, so for example you'll find __cstring in the __TEXT segment, formatted as __TEXT.__cstring, as a common one.

The Segment Commands in a Mach-O define what regions of the binary data should be mapped into memory as what. So looking at the segment_command_64 struct, there's the segments name as segname, but then we have two sets of address/sizes.

The vmaddr and vmsize define the virtual memory address and size for this segment And fileoff with filesize for the segments location and size within the file. maxprot and initprot define virtual memory protection for the segment in memory, so this may prevent it from being both writable and executable at the same time. Finally is the flags, which are just a way of giving the Kernel options for loading the segment into memory.

    struct segment_command_64 {
        uint32_t    cmd;            /* LC_SEGMENT_64 */
        uint32_t    cmdsize;        /* includes sizeof section_64 structs */
        char        segname[16];    /* segment name */
        uint64_t    vmaddr;         /* memory address of this segment */
        uint64_t    vmsize;         /* memory size of this segment */
        uint64_t    fileoff;        /* file offset of this segment */
        uint64_t    filesize;       /* amount to map from the file */
        vm_prot_t   maxprot;        /* maximum VM protection */
        vm_prot_t   initprot;       /* initial VM protection */
        uint32_t    nsects;         /* number of sections in segment */
        uint32_t    flags;          /* flags */
    };

Like I said, we have segments which are divided into sections. These sections are placed directly after the segment command, are included in the cmdsize and are counted with nsects. Again, sections essentially dividing up segments into more meaningful chunks, for example __TEXT.__text or __TEXT.__const.

To load these, we must take the offset of the segment command in the file, add the size of the segment structure, and then loop through nsects times, incrementing the offset by the size of the section struct each time.

To start, the section structure is defined as follows. Again, there are both section_64 and section structures, with the difference being the 64-bit section_64 struct uses uint64_t for both addr and size, and has a third reserved property at the end of the structure although it is not designated for any optional properties:

    struct section_64 {
        char        sectname[16];   /* name of this section */
        char        segname[16];    /* segment this section goes in */
        uint64_t    addr;           /* memory address of this section */
        uint64_t    size;           /* size in bytes of this section */
        uint32_t    offset;         /* file offset of this section */
        uint32_t    align;          /* section alignment (power of 2) */
        uint32_t    reloff;         /* file offset of relocation entries */
        uint32_t    nreloc;         /* number of relocation entries */
        uint32_t    flags;          /* flags (section type and attributes)*/
        uint32_t    reserved1;      /* reserved (for offset or index) */
        uint32_t    reserved2;      /* reserved (for count or sizeof) */
        uint32_t    reserved3;      /* reserved */
    };

As I just stated, we can load the correct data into that structure by adding sizeof (segment_command_64) to the offset of the command in the file, then add sizeof(section_64) for each of segment->nsects. Here is an example of what I mean (note this time I am using libhelper code to demonstrate):

     // Example code from libhelper/src/macho/macho-segment.c
    mach_segment_info_t *mach_segment_info_load (unsigned char *data, uint32_t offset)
    {
        // Create a new segment info struct and load the segment command
        ...

        // The section commands are placed directly after the segment command.
        uint32_t sectoff = offset + sizeof (mach_segment_command_64_t);
        for (int i = 0; i < (int) segment->nsects; i++) {

            // Load a section 64
            uint32_t ssize = sizeof (mach_section_64_t);
            mach_section_64_t *sect = malloc (ssize);

            // Copy ssize-bytes from the offset in data into the sect struct.
            memset (sect, '\0', ssize);
            memcpy (sect, data + sectoff, ssize);

            // This adds the section to the sections list of a mach_segment_info_t struct.
            seg_inf->sections = h_slist_append (seg_inf->sections, sect);
            sectoff += sizeof (mach_section_64_t);
        }

        seg_inf->segcmd = segment;
        return seg_inf;
    }

The mach_segment_info_t struct is not implemented in XNU's standard loader.h, so if you're writing your own Mach-O parser, please ignore references to Libhelper structs.

Looking at this function in more detail. Two arguments are passed to mach_segment_info_load, an unsigned char *data pointer to the Mach-O loaded in memory, and an uint32_t offset which points to the start of the segment command within that data pointer. This offset is relative to the start of the Mach-O, not the start of the load commands.

Ignoring the code that checks and sets up the mach_segment_command_t, it starts by calculating the offset of the first section. This is done by adding the offset passed to the function to the sizeof() the segment command structure.

The segment command has nsects containing the amount of sections placed after the command. So, we loop round the number of sections from segment->nsects and create mach_section_64_t's for each one. We can use memcpy() to to copy the ssize amount of bytes we need. We can set the start point for the copying by adding the offset to the data pointer. By doing this, we are incrementing the pointer by the offset, resulting in it pointing to, in this case, the start of the current section struct.

Calling h_slist_append() can be ignored. This is simply adding the section to a Statically-linked list in a libhelper macho_t structure.

The last bit of interest here, make sure to increment sectoff by the size of the mach_section_64_t struct, so sectoff will point to the next section structure.

If you are interested, please take a look at libhelper. It has a Mach-O parser that I wrote, and you'll find the example above.

Data

The actual data, so that is instructions and variables, in a Mach-O are stored after the Load Commands region. Depending on the type of Mach-O, the way this region is used varies.

So, for example. An executable - meaning a Mach-O with the filetype of MH_EXECUTE - would have the segment commands laying out the data region, and a LC_MAIN command specifying the offset of the entry point instruction the Kernel should jump too when loading. The Kernel will also start the Dynamic Linker specified in the LC_DYLD_INFO command, and link any specified dylib's with LC_LOAD_DYLIB.

This entire region is mapped out by the segment commands. We can inspect this mapping with Mash, or Mach-O Shell, which is part of HTool. Loading the file, we can inspect a particular segment like so.

(Mash) p seg __TEXT
Segment: __TEXT			Offset: 0x100000000-0x100012000	Size: 73728 bytes
    Off: 0x100000be0-0x10000f4ef	59663 bytes		__TEXT.__text
    Off: 0x10000f4f0-0x10000f676	390 bytes		__TEXT.__stubs
    Off: 0x10000f678-0x10000f912	666 bytes		__TEXT.__stub_helper
    Off: 0x10000f912-0x100011fa2	9872 bytes		__TEXT.__cstring
    Off: 0x100011fa2-0x100011fa4	2 bytes		    __TEXT.__const
    Off: 0x100011fa4-0x100011ff8	84 bytes		__TEXT.__unwind_info

To print a segment, we use p seg __TEXT. This is the short version, if you prefer print segment __TEXT would also work fine. The first line of the output display's the start and end addresses of the __TEXT segment, and it's total size in bytes.

Underneath, slightly indented, are each of the sections contained within the segment. For example, we can see that the __TEXT.__stubs section is 390 bytes, and is located from 0x10000f4f0 to 0x10000f676.

Two things to note about these addresses, first they are the virtual memory addresses, and second they are relative to the start of the data, not the start of the Mach-O. Before this __TEXT segment is a __PAGEZERO segment ranging from 0x000000000 to 0x100000000.

Summary

This is only an introduction to Mach-O files. I'd like to continue writing about them and maybe even write a Mach-O loader for Linux.

I hope I covered this fairly well, any feedback would be greatly appreciated. I aim to write these blog posts more often and hopefully they'll improve over time - both in quality and technical accuracy. For now, you can download Img4helper which you can use to extract Apple Image4 files from the Downloads page linked above, Libhelper sources are available here if you'd like to look at my Mach-O parser, and htool will be available soon.

You can contact me either via Twitter (@h3adsh0tzz) or email (me@h3adsh0tzz.com).