As a hook tool, fishhook is frequently used in iOS development. Understanding the basic principles of fishhook should be an essential skill for an advanced developer. Unfortunately, although I have explored the basic principles of fishhook many times before, I have always had a partial understanding of it. Recently, I have gained a new understanding of fishhook while sorting out relevant knowledge points. If you are like me and don’t know much about the principles of fishhook, then this article will be suitable for you.

It should be emphasized that this article will not start from the basics of using fishhook, nor will it explain line by line with reference to the source code. It will only focus on elaborating on some confusing knowledge points. It is recommended to read some related series of articles first and add some basics. Knowledge then go back and read this article.

Note 1: All codes are based on the 64-bit CPU architecture as an example, and there will be no special explanation below.

Note 2: Please download MachOView to open any Mach-O file for verification.

Note 3: Mach-O structure header file address

MachO file structure

0x01

The Mach-O file structure has three parts. The first part is the header, which describes the key information of the Mach-O file. Its data structure is as follows:

struct mach_header_64 {
	uint32_t	magic;		/* mach magic number identifier */
	cpu_type_t	cputype;	/* cpu specifier */
	cpu_subtype_t	cpusubtype;	/* machine specifier */
	uint32_t	filetype;	/* type of file */
	uint32_t	ncmds;		/* number of load commands */
	uint32_t	sizeofcmds;	/* the size of all the load commands */
	uint32_t	flags;		/* flags */
	uint32_t	reserved;	/* reserved */
};

As shown in the structure above, the key information of the Mach-O file header includes:

  • cputype:CPU type supported by the current file
  • filetype:Current MachO file type
  • ncmds: Number of Load Commands
  • sizeofcmds:Total size of all Commands

Each iOS executable file and dynamic library will be loaded into memory starting from the header.

0x02

The second part is Load Commands . There are different types of Load Commands. Some are used to describe different types of data structures (location, size, type, permissions, etc. in the file), and some are simply used to record information, such as recording: dyldThe path, mainfunction address, UUID, etc. Commands used to record information generally do not appear in the data area (Data).

Different types of Load Commands correspond to different structures, but the first two fields ( cmd/cmdsize) of all Load Commands are the same. Therefore, all Load Commands can be cast to load_commandstructure types through type conversion.

With this load_command, you can calculate the position of the next Load Command through each Load Command cmdsize.

struct load_command {
	uint32_t cmd;		/* type of load command */
	uint32_t cmdsize;`	/* total size of command in bytes */
};

struct segment_command_64 { /* for 64-bit architectures */
	uint32_t	cmd;		/* LC_SEGMENT_64 */
	uint32_t	cmdsize;	/* includes sizeof section_64 structs */
	char		segname[16];	/* segment name */
	uint64_t	vmaddr;		/* memory address of this segment */
	uint64_t	vmsize;		/* memory size of this segment */
	uint64_t	fileoff;	/* file offset of this segment */
	uint64_t	filesize;	/* amount to map from the file */
	vm_prot_t	maxprot;	/* maximum VM protection */
	vm_prot_t	initprot;	/* initial VM protection */
	uint32_t	nsects;		/* number of sections in segment */
	uint32_t	flags;		/* flags */
};

Some articles say load_commandthat it is the base class of all Commands, and you can understand it this way (although not at the code syntax level).

segment_command_64As a Load Command focused type, it is generally used to describe __PAGEZERO、__TEXT、__DATA、__DATA_CONST 、__LINKEDITsections containing actual code data (located in the Data section).

Therefore, segment_command_64the type of Load Command is also called: segment .

segmentIt also contains an important type internally: section . Section is used to describe a group of data of the same type. For example: all code logic is located in the section named __text , all OC class names are located in the section named __objc_classname , and both sections are located in the __TEXT segment.

segment_command_64Introduction to key fields:

  • segname: Current segment name, which can be one of __PAGEZERO, __TEXT, __DATA, __DATA_CONST, and __LINKEDIT
  • vmaddr: The virtual address of the current segment after it is loaded into memory (actually, the ALSR offset must be added to be the real virtual address)
  • vmsize: The size of virtual memory occupied by the current segment
  • fileoff: The offset of the current segment in the Mach-O file, the actual position = the address starting from the header + fileoff
  • filesize: The actual size of the current segment in the Mach-O file, taking into account memory alignment vmsize>=filesize
  • nsectssegment_command_64The number of sections currently included below

Regarding the related content of random address offset (ALSR), you can find relevant information by yourself and will not go into details here.

section has only one type, and its structure is defined as follows:

struct section_64 { /* for 64-bit architectures */
	char		sectname[16];	/* name of this section */
	char		segname[16];	/* segment this section goes in */
	uint64_t	addr;		/* memory address of this section */
	uint64_t	size;		/* size in bytes of this section */
	uint32_t	offset;		/* file offset of this section */
	uint32_t	align;		/* section alignment (power of 2) */
	uint32_t	reloff;		/* file offset of relocation entries */
	uint32_t	nreloc;		/* number of relocation entries */
	uint32_t	flags;		/* flags (section type and attributes)*/
	uint32_t	reserved1;	/* reserved (for offset or index) */
	uint32_t	reserved2;	/* reserved (for count or sizeof) */
	uint32_t	reserved3;	/* reserved */
};

Introduction to section key fields:

  • sectname: The name of the section, which can be __text, __const, __bss, etc.
  • segname: The name of the segment where the current section is located
  • addr: The location of the current section in virtual memory (actually, the ALSR offset must be added to be the real virtual address)
  • size: The size occupied by the current section (disk size and memory size)
  • reserved1: Different section types have different meanings, generally representing offset and index values.
  • flags: Type & attribute tag bit, fishhook uses this tag to find lazy loading table & non-lazy loading table

It should be noted that there is and is only segment_command_64a Command of type containing section.

0x03

Finally, there is the data area (Data), which is the code or data contained in the Mach-O file; all codes or data are organized and arranged according to the description of Load Command. The data or code described by segment_command_64is organized in section as the smallest unit in the Data section, and this part of the content occupies the majority. Segment plus __LINKEDITdata described by other types of Load Command (actually segment) together form the data area.

Note: Although the number of sections contained under the segment with the name __LINKEDIT(type: ) is 0, according to its calculation, it is found:segment_command_64fileoff,filesize

__LINKEDITThe file range pointed by the segment actually includes the location range pointed by other Load Commands (including but not limited to: LC_DYLD_INFO_ONLY, LC_FUNCTION_STARTS, LC_SYMTAB, LC_DYSYMTAB, LC_CODE_SIGNATURE).

The derivation process is as follows:

As shown above in Load Commands, __LINKEDITthe offset in the Mach-O file: 0x394000size is: 0x5B510. The starting address of the Mach-O header is 0x41C000. Therefore, __LINKEDITthe address range in the Mach-O file is: {header + fileoffset, header + fileoffset + filesize}. Substituting into the above equation is {0x41C000+0x394000, 0x41C000+0x394000+0x5B510}the final {0x7B0000,0x80B510}address range of .

From the figure below, the first address after the end of the last section of the segment is the above starting range, and the end address of the file is also the ending range of the above calculation result (the last data address occupies 16 bytes).

So it can be understood this way: the name __LINKEDITLoad Command is a virtual Command. It is used to indicate the total range of the data described by Commands such as LC_DYLD_INFO_ONLY, LC_FUNCTION_STARTS, LC_SYMTAB, LC_DYSYMTAB, LC_CODE_SIGNATURE, etc. in “Files and Memory”, and these Commands themselves describe their own ranges. From the address range, __LINKEDITthese Commands In the parent of the data section, even though it does not have a section itself.

The four key tables of fishhook

The implementation principle of fishhook involves four “tables”. By understanding the relationship between these four tables, you can understand the principle of fishhook and ensure photographic memory.

  • Symbol Table
  • Indirect Symbol Table
  • String Table
  • Lazy loading and non-lazy loading tables (__la_symbol_ptr/__non_la_symbol_ptr)

Symbol table & character table

The symbol table (Symbol Table) and character table (String Table) are described in the Load Command of the LC_SYMTAB type.

struct symtab_command {
	uint32_t	cmd;		/* LC_SYMTAB */
	uint32_t	cmdsize;	/* sizeof(struct symtab_command) */

	uint32_t	symoff;		/*  */
	uint32_t	nsyms;		/*  */

	uint32_t	stroff;		/*  */
	uint32_t	strsize;	/* */
};

The data structure of the symbol table (Symbol Table) content nlist_64is represented by:

struct nlist_64 {
    union {
        uint32_t  n_strx; /* index into the string table */
    } n_un;
    uint8_t n_type;        /* type flag, see below */
    uint8_t n_sect;        /* section number or NO_SECT */
    uint16_t n_desc;       /* see <mach-o/stab.h> */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};

nlist_64The first member n_unrepresents the relative position of the current symbol’s name in the character table (String Table). Other member variables do not need to be concerned here.

The character table (String Table) is a series of character ASCII code data, each string is separated by ‘\0’.

indirect symbol table

The indirect symbol table (Indirect Symbol Table) dysymtab_commandis described in the Load Command (type LC_DYSYMTAB) of the structure.

struct dysymtab_command {
    uint32_t cmd;	/* LC_DYSYMTAB */
    uint32_t cmdsize;	/* sizeof(struct dysymtab_command) */

    /* */
    
    uint32_t indirectsymoff; /*  */
    uint32_t nindirectsyms;  /*  */

    /*  */
};	

The indirect symbol table is essentially int32an array composed of elements. The value stored in the element represents the relative position of the current symbol in the symbol table (Symbol Table).

Lazy loading and non-lazy loading tables

The lazy loading and non-lazy loading tables are located in __DATA/__DATA_CONSTthe section below segment.

Lazy loading and non-lazy loading tables have the following characteristics:

  • When the current executable file or dynamic library references an external dynamic library symbol, when the corresponding symbol is called, it will jump to the address specified in the lazy loading and non-lazy loading tables for execution.
  • The lazy loading table is bound when the symbol is called for the first time. Before binding, it points to the stub function. The stub function completes the symbol binding. After the binding is completed, it is the real code address of the corresponding symbol.
  • The non-lazy loading table is bound by dyld when the current Mach-O is loaded into the memory. The value in the non-lazy loading table before binding is 0x00. After binding, it is also the real code address of the corresponding symbol.

Blackboard knowledge point: The function of fishhook is to change the function address saved in the lazy loading and non-lazy loading tables.

Since the lazy loading and non-lazy loading tables do not contain any symbol character information, we cannot directly find the corresponding position of the target function in the table through the lazy loading table and the non-lazy loading table, and therefore cannot replace it. Therefore, it is necessary to use the relationship between the indirect symbol table (Indirect Symbol Table), symbol table (Symbol Table), and character table (String Table) to find the corresponding symbol name in the table to confirm its location.

How to find the target function address

Here is a schematic diagram provided by fishhook. You can understand it by yourself before reading below:

When referencing an external function, you need to use the symbol name to determine the location of the function address in the lazy loading and non-lazy loading tables. The specific process is as follows:

  1. The index of the function address in the lazy loading table and the non-lazy loading table corresponds to the position in the indirect symbol table (Indirect Symbol Table);Taking the function address in the table ias an example, the corresponding relationship can be expressed by a pseudo formula:The offset of the indirect symbol table. The =start address of the indirect symbol table. +The offset specified by the lazy loading table or the non-lazy loading table (the reserved1 field of the section where it is located).+ i
  2. For an array of type int32 saved in the indirect symbol table, use the “offset of the indirect symbol table” calculated in the previous step as an index to get the value in the array to get the position of the symbol in the symbol.We also get an equivalent pseudo-formula: the offset of the symbol table, =the start address of the indirect symbol table, +the offset of the indirect symbol table
  3. The data stored in the symbol table is of nlist_64type, and the value of the first field ( n_un.n_strx) is the offset of the current symbol name in the character table.Equivalent pseudo formula: offset of the symbol name in the character table = (starting address of +the symbol table offset of the symbol table).n_un.n_strx
  4. According to the offset obtained above, go to the character table and retrieve the corresponding string ( \0ending with)iEquivalent pseudo-formula: the function name in the lazy loading table and the non-lazy loading table, =the starting address of the character table, +the offset of the symbol name in the character table

At this point, we substitute the formula from bottom to top and combine the three pseudo formulas to get:

iThe first function name in the lazy loading table or non-lazy loading table. =The starting address of the + (character table. The starting address of the symbol +table. The starting position of the indirect symbol table . +The offset specified by the lazy loading table or non-lazy loading table (the reserved1 field of the section where it is located) + i).n_un.n_strx

Now, what is not known in the above formula are the three starting addresses:

  • The starting address of the character table (String Table)
  • The starting address of the symbol table (Symbol)
  • Indirect Symbol Table start address

The number of function addresses in the lazy loading table or non-lazy loading table can also be calculated through sizethe field of the corresponding section (see section_64the description in the structure above for details), formula: ( section->size / sizeof(void *)).

The relationship between the four fishhook tables should be very clear at this point. What fishhook does is nothing more than finding the external function that matches the target function name in the lazy loading table and the non-lazy loading table through this formula. Once a match is found, its address will be changed. is the custom function address.

What is linkedit_base

If other factors are not considered, the starting addresses of the above three tables can actually be obtained directly through the Mach-O header address + the corresponding offset. Take the Symbol Table as an example:

The starting address of the Mach-O header is as mentioned above: 0x41C000. Calculate 0x41C000 + 0x3BECD8 = 0x7DACD8; then use MachOView to view this address. It is indeed the location of the symbol table in the file:

At the same time, the above derivation also proves symtab_command->symoff symtab_command->stroffthat is an offset relative to the Mach-O header, not __LINKEDITan offset relative to ;

The way to calculate the starting address of the symbol table in the fishhook source code is:

nlist_t *symtab = (nlist_t *)(linkedit_base + symtab_cmd->symoff);

As a result, many blog posts say linkedit_basethat symoff is the base address of the __LINKEDIT segment and symoff is the offset relative to the __LINKEDIT segment. This is completely wrong . What can be clearly stated here is:

  • linkedit_baseIt is not __LINKEDITthe starting address of the segment in memory.
  • linkedit_baseIt is not __LINKEDITthe starting address of the segment in memory.
  • linkedit_baseIt is not __LINKEDITthe starting address of the segment in memory.

The calculation method in fishhook linkedit_baseis as follows:

uintptr_t linkedit_base = (uintptr_t)slide + linkedit_segment->vmaddr - linkedit_segment->fileoff;

slideAfter ignoring the random address offset (ALSR) value:

linkedit_base = linkedit_segment->vmaddr - linkedit_segment->fileoff;

linkedit_segment->vmaddr: Represents __LINKEDITthe relative starting position of the segment in “virtual memory” linkedit_segment->fileoff: Represents __LINKEDITthe relative starting position of the segment in the “file”

So what is the significance of subtracting these two?

To answer this question first look at the information given by MachOView:

As shown in the figure above, __LINKEDITseveral facts can be parsed out from the several segments (marked by red boxes) before the segment:

  • The starting address of each segment in the “Mach-O file” is equal to the previous segment File Offset + File Size, and the first segment starts from 0
  • In the same way, the position of each segment in “virtual memory” is equal to the previous segment VM Address + VM Size, and the first segment starts from 0
  • __PAGEZEROand _DATAVM Size > File Sizeand the two values ​​​​in other segments are equal, which means that there are some “vacancies” (occurring due to memory alignment) after the two segments are loaded into virtual memory.
  • __PAGEZEROIt does not occupy the storage space of the Mach-O file, but it occupies 16K space in the virtual memory.

Graphically represented as:

Therefore linkedit_base = linkedit_segment->vmaddr - linkedit_segment->fileoffthe meaning is:

  • After Mach-O is loaded into memory, __LINKEDITthe previous segment has extra space (empty bits) due to memory alignment.
  • After Mach-O is loaded into memory, __LINKEDITthe previous segment has extra space (empty bits) due to memory alignment.
  • After Mach-O is loaded into memory, __LINKEDITthe previous segment has extra space (empty bits) due to memory alignment.

This is linkedit_basethe true meaning in physics, any other definition is wrong.

The VM Size itself __LINKEDIT== File Size means that the three tables it contains, namely the symbol table, character table and indirect symbol table, are memory-aligned. There are no gaps between them, so their offsets in the file + linkedit_baseare equal to their offsets in the memory. actual location in.

  // 
  nlist_t *symtab = (nlist_t *)(linkedit_base + symtab_cmd->symoff);
  // 
  char *strtab = (char *)(linkedit_base + symtab_cmd->stroff);
  // 
  uint32_t *indirect_symtab = (uint32_t *)(linkedit_base + dysymtab_cmd->indirectsymoff);

at last

fishhook has many applications in APM, anti-reverse, performance optimization and other directions. In essence, fishhook is an in-depth application of the Mach-O file structure. I believe it will be easier to look at the structure of the Mach-O file after understanding the principles. Applications related to the Mach-O file structure include the restoration of the symbol table. In the next article, I will learn with you the specific process of symbol table restoration (although the folder has not been created yet😂).

Leave a Reply

Your email address will not be published. Required fields are marked *