As a hook tool, fishhook is frequently used in iOS development. Understanding the basic principles of fishhook should be an essential skill for an advanced developer. Unfortunately, although I have explored the basic principles of fishhook many times before, I have always had a partial understanding of it. Recently, I have gained a new understanding of fishhook while sorting out relevant knowledge points. If you are like me and don’t know much about the principles of fishhook, then this article will be suitable for you.
It should be emphasized that this article will not start from the basics of using fishhook, nor will it explain line by line with reference to the source code. It will only focus on elaborating on some confusing knowledge points. It is recommended to read some related series of articles first and add some basics. Knowledge then go back and read this article.
Note 1: All codes are based on the 64-bit CPU architecture as an example, and there will be no special explanation below.
Note 2: Please download MachOView to open any Mach-O file for verification.
Note 3: Mach-O structure header file address
MachO file structure
0x01
The Mach-O file structure has three parts. The first part is the header, which describes the key information of the Mach-O file. Its data structure is as follows:
struct mach_header_64 {
uint32_t magic; /* mach magic number identifier */
cpu_type_t cputype; /* cpu specifier */
cpu_subtype_t cpusubtype; /* machine specifier */
uint32_t filetype; /* type of file */
uint32_t ncmds; /* number of load commands */
uint32_t sizeofcmds; /* the size of all the load commands */
uint32_t flags; /* flags */
uint32_t reserved; /* reserved */
};
As shown in the structure above, the key information of the Mach-O file header includes:
cputype
:CPU type supported by the current filefiletype
:Current MachO file typencmds
: Number of Load Commandssizeofcmds
:Total size of all Commands
Each iOS executable file and dynamic library will be loaded into memory starting from the header.
0x02
The second part is Load Commands . There are different types of Load Commands. Some are used to describe different types of data structures (location, size, type, permissions, etc. in the file), and some are simply used to record information, such as recording: dyld
The path, main
function address, UUID, etc. Commands used to record information generally do not appear in the data area (Data).
Different types of Load Commands correspond to different structures, but the first two fields ( cmd/cmdsize
) of all Load Commands are the same. Therefore, all Load Commands can be cast to load_command
structure types through type conversion.
With this load_command
, you can calculate the position of the next Load Command through each Load Command cmdsize
.
struct load_command {
uint32_t cmd; /* type of load command */
uint32_t cmdsize;` /* total size of command in bytes */
};
struct segment_command_64 { /* for 64-bit architectures */
uint32_t cmd; /* LC_SEGMENT_64 */
uint32_t cmdsize; /* includes sizeof section_64 structs */
char segname[16]; /* segment name */
uint64_t vmaddr; /* memory address of this segment */
uint64_t vmsize; /* memory size of this segment */
uint64_t fileoff; /* file offset of this segment */
uint64_t filesize; /* amount to map from the file */
vm_prot_t maxprot; /* maximum VM protection */
vm_prot_t initprot; /* initial VM protection */
uint32_t nsects; /* number of sections in segment */
uint32_t flags; /* flags */
};
Some articles say load_command
that it is the base class of all Commands, and you can understand it this way (although not at the code syntax level).
segment_command_64
As a Load Command focused type, it is generally used to describe __PAGEZERO、__TEXT、__DATA、__DATA_CONST 、__LINKEDIT
sections containing actual code data (located in the Data section).
Therefore, segment_command_64
the type of Load Command is also called: segment .
segment
It also contains an important type internally: section . Section is used to describe a group of data of the same type. For example: all code logic is located in the section named __text , all OC class names are located in the section named __objc_classname , and both sections are located in the __TEXT segment.
segment_command_64
Introduction to key fields:
segname
: Current segment name, which can be one of __PAGEZERO, __TEXT, __DATA, __DATA_CONST, and __LINKEDITvmaddr
: The virtual address of the current segment after it is loaded into memory (actually, the ALSR offset must be added to be the real virtual address)vmsize
: The size of virtual memory occupied by the current segmentfileoff
: The offset of the current segment in the Mach-O file, the actual position = the address starting from the header + fileofffilesize
: The actual size of the current segment in the Mach-O file, taking into account memory alignmentvmsize
>=filesize
nsects
:segment_command_64
The number of sections currently included below
Regarding the related content of random address offset (ALSR), you can find relevant information by yourself and will not go into details here.
section has only one type, and its structure is defined as follows:
struct section_64 { /* for 64-bit architectures */
char sectname[16]; /* name of this section */
char segname[16]; /* segment this section goes in */
uint64_t addr; /* memory address of this section */
uint64_t size; /* size in bytes of this section */
uint32_t offset; /* file offset of this section */
uint32_t align; /* section alignment (power of 2) */
uint32_t reloff; /* file offset of relocation entries */
uint32_t nreloc; /* number of relocation entries */
uint32_t flags; /* flags (section type and attributes)*/
uint32_t reserved1; /* reserved (for offset or index) */
uint32_t reserved2; /* reserved (for count or sizeof) */
uint32_t reserved3; /* reserved */
};
Introduction to section key fields:
sectname
: The name of the section, which can be __text, __const, __bss, etc.segname
: The name of the segment where the current section is locatedaddr
: The location of the current section in virtual memory (actually, the ALSR offset must be added to be the real virtual address)size
: The size occupied by the current section (disk size and memory size)reserved1
: Different section types have different meanings, generally representing offset and index values.flags
: Type & attribute tag bit, fishhook uses this tag to find lazy loading table & non-lazy loading table
It should be noted that there is and is only segment_command_64
a Command of type containing section.
0x03
Finally, there is the data area (Data), which is the code or data contained in the Mach-O file; all codes or data are organized and arranged according to the description of Load Command. The data or code described by segment_command_64
is organized in section as the smallest unit in the Data section, and this part of the content occupies the majority. Segment plus __LINKEDIT
data described by other types of Load Command (actually segment) together form the data area.
Note: Although the number of sections contained under the segment with the name __LINKEDIT
(type: ) is 0, according to its calculation, it is found:segment_command_64
fileoff,filesize
__LINKEDIT
The file range pointed by the segment actually includes the location range pointed by other Load Commands (including but not limited to: LC_DYLD_INFO_ONLY, LC_FUNCTION_STARTS, LC_SYMTAB, LC_DYSYMTAB, LC_CODE_SIGNATURE).
The derivation process is as follows:
As shown above in Load Commands, __LINKEDIT
the offset in the Mach-O file: 0x394000
size is: 0x5B510
. The starting address of the Mach-O header is 0x41C000
. Therefore, __LINKEDIT
the address range in the Mach-O file is: {header + fileoffset, header + fileoffset + filesize}
. Substituting into the above equation is {0x41C000+0x394000, 0x41C000+0x394000+0x5B510}
the final {0x7B0000,0x80B510}
address range of .
From the figure below, the first address after the end of the last section of the segment is the above starting range, and the end address of the file is also the ending range of the above calculation result (the last data address occupies 16 bytes).
So it can be understood this way: the name __LINKEDIT
Load Command is a virtual Command. It is used to indicate the total range of the data described by Commands such as LC_DYLD_INFO_ONLY, LC_FUNCTION_STARTS, LC_SYMTAB, LC_DYSYMTAB, LC_CODE_SIGNATURE, etc. in “Files and Memory”, and these Commands themselves describe their own ranges. From the address range, __LINKEDIT
these Commands In the parent of the data section, even though it does not have a section itself.
The four key tables of fishhook
The implementation principle of fishhook involves four “tables”. By understanding the relationship between these four tables, you can understand the principle of fishhook and ensure photographic memory.
- Symbol Table
- Indirect Symbol Table
- String Table
- Lazy loading and non-lazy loading tables (__la_symbol_ptr/__non_la_symbol_ptr)
Symbol table & character table
The symbol table (Symbol Table) and character table (String Table) are described in the Load Command of the LC_SYMTAB type.
struct symtab_command {
uint32_t cmd; /* LC_SYMTAB */
uint32_t cmdsize; /* sizeof(struct symtab_command) */
uint32_t symoff; /* */
uint32_t nsyms; /* */
uint32_t stroff; /* */
uint32_t strsize; /* */
};
The data structure of the symbol table (Symbol Table) content nlist_64
is represented by:
struct nlist_64 {
union {
uint32_t n_strx; /* index into the string table */
} n_un;
uint8_t n_type; /* type flag, see below */
uint8_t n_sect; /* section number or NO_SECT */
uint16_t n_desc; /* see <mach-o/stab.h> */
uint64_t n_value; /* value of this symbol (or stab offset) */
};
nlist_64
The first member n_un
represents the relative position of the current symbol’s name in the character table (String Table). Other member variables do not need to be concerned here.
The character table (String Table) is a series of character ASCII code data, each string is separated by ‘\0’.
indirect symbol table
The indirect symbol table (Indirect Symbol Table) dysymtab_command
is described in the Load Command (type LC_DYSYMTAB) of the structure.
struct dysymtab_command {
uint32_t cmd; /* LC_DYSYMTAB */
uint32_t cmdsize; /* sizeof(struct dysymtab_command) */
/* */
uint32_t indirectsymoff; /* */
uint32_t nindirectsyms; /* */
/* */
};
The indirect symbol table is essentially int32
an array composed of elements. The value stored in the element represents the relative position of the current symbol in the symbol table (Symbol Table).
Lazy loading and non-lazy loading tables
The lazy loading and non-lazy loading tables are located in __DATA/__DATA_CONST
the section below segment.
Lazy loading and non-lazy loading tables have the following characteristics:
- When the current executable file or dynamic library references an external dynamic library symbol, when the corresponding symbol is called, it will jump to the address specified in the lazy loading and non-lazy loading tables for execution.
- The lazy loading table is bound when the symbol is called for the first time. Before binding, it points to the stub function. The stub function completes the symbol binding. After the binding is completed, it is the real code address of the corresponding symbol.
- The non-lazy loading table is bound by dyld when the current Mach-O is loaded into the memory. The value in the non-lazy loading table before binding is 0x00. After binding, it is also the real code address of the corresponding symbol.
Blackboard knowledge point: The function of fishhook is to change the function address saved in the lazy loading and non-lazy loading tables.
Since the lazy loading and non-lazy loading tables do not contain any symbol character information, we cannot directly find the corresponding position of the target function in the table through the lazy loading table and the non-lazy loading table, and therefore cannot replace it. Therefore, it is necessary to use the relationship between the indirect symbol table (Indirect Symbol Table), symbol table (Symbol Table), and character table (String Table) to find the corresponding symbol name in the table to confirm its location.
How to find the target function address
Here is a schematic diagram provided by fishhook. You can understand it by yourself before reading below:
When referencing an external function, you need to use the symbol name to determine the location of the function address in the lazy loading and non-lazy loading tables. The specific process is as follows:
- The index of the function address in the lazy loading table and the non-lazy loading table corresponds to the position in the indirect symbol table (Indirect Symbol Table);Taking the function address in the table
i
as an example, the corresponding relationship can be expressed by a pseudo formula:The offset of the indirect symbol table. The=
start address of the indirect symbol table.+
The offset specified by the lazy loading table or the non-lazy loading table (the reserved1 field of the section where it is located).+
i
- For an array of type int32 saved in the indirect symbol table, use the “offset of the indirect symbol table” calculated in the previous step as an index to get the value in the array to get the position of the symbol in the symbol.We also get an equivalent pseudo-formula: the offset of the symbol table,
=
the start address of the indirect symbol table,+
the offset of the indirect symbol table - The data stored in the symbol table is of
nlist_64
type, and the value of the first field (n_un.n_strx
) is the offset of the current symbol name in the character table.Equivalent pseudo formula: offset of the symbol name in the character table= (
starting address of+
the symbol table offset of the symbol table).n_un.n_strx
- According to the offset obtained above, go to the character table and retrieve the corresponding string (
\0
ending with)i
Equivalent pseudo-formula: the function name in the lazy loading table and the non-lazy loading table,=
the starting address of the character table,+
the offset of the symbol name in the character table
At this point, we substitute the formula from bottom to top and combine the three pseudo formulas to get:
i
The first function name in the lazy loading table or non-lazy loading table. =
The starting address of the +
(
character table. The starting address of the symbol +
table. The starting position of the indirect symbol table . +
The offset specified by the lazy loading table or non-lazy loading table (the reserved1 field of the section where it is located) +
i)
.n_un.n_strx
Now, what is not known in the above formula are the three starting addresses:
- The starting address of the character table (String Table)
- The starting address of the symbol table (Symbol)
- Indirect Symbol Table start address
The number of function addresses in the lazy loading table or non-lazy loading table can also be calculated through size
the field of the corresponding section (see section_64
the description in the structure above for details), formula: ( section->size / sizeof(void *)
).
The relationship between the four fishhook tables should be very clear at this point. What fishhook does is nothing more than finding the external function that matches the target function name in the lazy loading table and the non-lazy loading table through this formula. Once a match is found, its address will be changed. is the custom function address.
What is linkedit_base
If other factors are not considered, the starting addresses of the above three tables can actually be obtained directly through the Mach-O header address + the corresponding offset. Take the Symbol Table as an example:
The starting address of the Mach-O header is as mentioned above: 0x41C000. Calculate 0x41C000 + 0x3BECD8 = 0x7DACD8; then use MachOView to view this address. It is indeed the location of the symbol table in the file:
At the same time, the above derivation also proves symtab_command->symoff
symtab_command->stroff
that is an offset relative to the Mach-O header, not __LINKEDIT
an offset relative to ;
The way to calculate the starting address of the symbol table in the fishhook source code is:
nlist_t *symtab = (nlist_t *)(linkedit_base + symtab_cmd->symoff);
As a result, many blog posts say linkedit_base
that symoff is the base address of the __LINKEDIT segment and symoff is the offset relative to the __LINKEDIT segment. This is completely wrong . What can be clearly stated here is:
linkedit_base
It is not__LINKEDIT
the starting address of the segment in memory.linkedit_base
It is not__LINKEDIT
the starting address of the segment in memory.linkedit_base
It is not__LINKEDIT
the starting address of the segment in memory.
The calculation method in fishhook linkedit_base
is as follows:
uintptr_t linkedit_base = (uintptr_t)slide + linkedit_segment->vmaddr - linkedit_segment->fileoff;
slide
After ignoring the random address offset (ALSR) value:
linkedit_base = linkedit_segment->vmaddr - linkedit_segment->fileoff
;
linkedit_segment->vmaddr
: Represents __LINKEDIT
the relative starting position of the segment in “virtual memory” linkedit_segment->fileoff
: Represents __LINKEDIT
the relative starting position of the segment in the “file”
So what is the significance of subtracting these two?
To answer this question first look at the information given by MachOView:
As shown in the figure above, __LINKEDIT
several facts can be parsed out from the several segments (marked by red boxes) before the segment:
- The starting address of each segment in the “Mach-O file” is equal to the previous segment
File Offset + File Size
, and the first segment starts from 0 - In the same way, the position of each segment in “virtual memory” is equal to the previous segment
VM Address + VM Size
, and the first segment starts from 0 __PAGEZERO
and_DATA
,VM Size > File Size
and the two values in other segments are equal, which means that there are some “vacancies” (occurring due to memory alignment) after the two segments are loaded into virtual memory.__PAGEZERO
It does not occupy the storage space of the Mach-O file, but it occupies 16K space in the virtual memory.
Graphically represented as:
Therefore linkedit_base = linkedit_segment->vmaddr - linkedit_segment->fileoff
the meaning is:
- After Mach-O is loaded into memory,
__LINKEDIT
the previous segment has extra space (empty bits) due to memory alignment. - After Mach-O is loaded into memory,
__LINKEDIT
the previous segment has extra space (empty bits) due to memory alignment. - After Mach-O is loaded into memory,
__LINKEDIT
the previous segment has extra space (empty bits) due to memory alignment.
This is linkedit_base
the true meaning in physics, any other definition is wrong.
The VM Size itself __LINKEDIT
== File Size means that the three tables it contains, namely the symbol table, character table and indirect symbol table, are memory-aligned. There are no gaps between them, so their offsets in the file +
linkedit_base
are equal to their offsets in the memory. actual location in.
//
nlist_t *symtab = (nlist_t *)(linkedit_base + symtab_cmd->symoff);
//
char *strtab = (char *)(linkedit_base + symtab_cmd->stroff);
//
uint32_t *indirect_symtab = (uint32_t *)(linkedit_base + dysymtab_cmd->indirectsymoff);
at last
fishhook has many applications in APM, anti-reverse, performance optimization and other directions. In essence, fishhook is an in-depth application of the Mach-O file structure. I believe it will be easier to look at the structure of the Mach-O file after understanding the principles. Applications related to the Mach-O file structure include the restoration of the symbol table. In the next article, I will learn with you the specific process of symbol table restoration (although the folder has not been created yet😂).