understanding of stability

Application stability is one of the most important performance indicators and is the basis of the APP quality construction system. If there is a problem with the stability of the application, the damage to the product and users will be fatal. This article will organize application stability optimization from the following aspects.

It should be noted that generalized stability is not only a crash problem, but also includes indicators such as freezing, power consumption, and temperature. This article mainly studies from the perspective of crash rate.

Classification of common stability indicators : common indicators used in the industry to measure stability, and standards for measuring excellent stability.
General steps for handling Crash : General steps for handling Crash when it occurs.
Business high availability construction : Methods used to prevent stability problems when maintaining an entire business.
Long-term Crash Control : How to prevent stability degradation over a longer period.
Interview FAQs : Frequently asked questions about stability in interviews.

Classification of common indicators of stability

Exception classification: Exception and ANR

Android crash problems can be divided into two categories: Exception and ANR.

Exception : An exception occurs in the internal code of the application
- JE : Java Exception, an uncaught exception occurs in Java code
- NE : Native Exception, accessing illegal addresses in Native code, active program abort, etc.
ANR : The application is unresponsive, not necessarily caused by code logic.

JE, NE, and ANR can be counted independently or aggregated as the overall crash rate.

PV and UV

It is a measure of application software usage.

PV : Page View, refers to the total display volume of the page, regardless of users. If a user opens the page 100 times, PV=100
UV : Unique View, refers to the statistics of user visits after deduplication. If a user opens the page 100 times, UV=1

Therefore, the crash rate can be counted in both PV and UV dimensions.

PV Crash : Assess the severity of the problem
UV Crash : Assess the impact of the problem within the user community

Incremental crash rate and stock crash rate

Incremental crashes : Crashes caused by the current addition of code are the main cause of fluctuations in the market crash rate. It is necessary to detect and solve it early to avoid bringing it online and causing the stock to collapse. Processing strategy: important and urgent
Stock collapse : caused by stock code, it is a problem that requires continuous follow-up. Solving the stock collapse will help reduce the collapse rate of the online market. Processing strategy: important but not urgent

Crash rate evaluation index

Statistical caliber:

Numerator : JE+NE+ANRtotal number of occurrences
Denominator : PV

<2‰is qualified, and <1‰(level 10,000) is excellent.

General steps for dealing with Crash

When dealing with a crash, there are two steps: collecting on-site data and analyzing the cause of the crash.

Collect field data

A crash that can be reproduced is a good crash

When dealing with a crash, it is crucial to collect on-site information. The scene retains many valuable clues, which serve as a guide for further investigation.

According to different levels, on-site information is divided into three levels: crash itself, runtime status, and system status.

Some mature crash collection platforms (such as bugly and Sentry) can cover most of the following content, but application developers still need to upload customized information such as user operation logs.

crash itself

Mainly the crash stack and process and thread information.

Crash stack : This is the most important information, reflecting the call stack of the function at the time of the crash. If the code is obfuscated, the contents of the stack information need to be deobfuscated before analysis, so it is important to save the mapping file during packaging.
Process and thread information : which process crashed, whether it was in the foreground or background, whether the thread was a UI thread or a worker thread

runtime status

memory information
- System memory usage: /proc/meminfo records the real-time memory status of the system. When the available memory of the system is less than 10% of the total memory, GC will occur frequently, leading to OOM, ANR and other problems.
- Application memory usage: PSS, RSS ), and you can know the physical memory usage of the application at that time; virtual memory information is stored in /proc/self/status. For its specific distribution, you need to check /proc/self/maps
Application operation path : The application itself should record the user’s operation path, currently opened page, running services, etc. through tracking, logs, etc.
Application information : application version, whether it has been hot-fixed, CPU architecture
File information : The amount of open file handle fd. The maximum number of fd allowed to be opened by a single process is 1024. If it exceeds 800, it is in a dangerous state. All fd, that is, the corresponding file names need to be recorded and reported.
Thread information : A single thread will occupy about 2MB of virtual memory. If the total amount exceeds 400, it is dangerous. All thread IDs and names need to be reported.

system status

System hardware information : CPU, ABI, total memory, network connection status
System software information : Android version, Linux kernel version, WebView kernel version, OEM software version, root or not, emulator or not
System log : Logcat, EventLog, etc.

Analyze the cause of the crash

This step is to analyze Crash. During the analysis, it feels like becoming a detective, investigating the crime scene and locating the suspect.

Step One: Single Point Breakthrough

Conduct detailed analysis on a single crash log.

Confirm the severity : will it cause the application to crash, or will the current page function be unavailable, the interface request will fail, or the user will not be aware of it at all.
Confirm priority : Determine processing priority according to severity. Those with higher priority should be processed first.
Observe the collected basic crash information and pay attention to exceptions caused by Android version compatibility . There may be no problems in the new version, but the old version of Android will crash. Depending on the type of crash, the focus of browsing information is different:
- JE: 90% of exceptions can find the calling relationship through the stack. In particular, for OOM, you need to pay attention to the memory usage.
- NE: Observe signal, code, fault addr, etc. The definition of crash signal signal can be found in the official document , among which SIGSEGV (null pointer, illegal pointer) and SIGABRT (ANR, abort() call) are more common
- ANR: First observe the trace.txt file, focusing on the main thread status, whether the lock is held, IO information and CPU usage information, as well as the status before and after GC recycling
Observe Logcat logs : especially Warning and Error levels. For ANR, you need to search for the am_anr keyword.
Resource usage : such as memory (physical + virtual), file handles, number of threads

Step 2: Group aggregation

For the crash information that has been aggregated in the background, check to see if they have any commonalities. Commonalities that can be used for troubleshooting include:

Model, Android version, Rom version, manufacturer, ABI, root or not, virtual machine or not, network status, currently opened page, process status, services running in the background, etc. Especially for Android system versions, it is easy for the originally available code to crash due to version changes, or a function that only runs normally on higher versions of Android may crash on lower versions. I once dealt with an example where a BadTokenException occasionally occurred when displaying Toast, which only occurred below 8.0.

Stable long-term governance

Adopt different stability optimization strategies according to the process you are in.

development stage

Development is the first level of quality . The earlier problems are discovered, the lower the repair costs.
Unify coding standards , enhance coding safety education, enhance code review, conduct pair programming, etc.
Code architecture optimization , module encapsulation and reuse of common capabilities and underlying functions, unit testing design, unified closure and error handling for interface return failures and other situations

testing phase

New function testing : This new function and the parts affected by the new function
Main process regression testing : core process
Coverage installation test : During the process of covering and installing the old version, the cache, database, etc. should be compatible
Compatibility testing : not only the company’s mobile phones, but also scenarios where applications are installed on third-party mobile phones
Boundary conditions : such as server downtime, abnormal return data, weak network and no network conditions

code merge phase

Conflict handling : If there is a conflict in the code, the conflict will be dealt with first, especially the dependent third-party libraries. If different branches introduce different versions, you should confirm that the final version can meet the needs of the different branches when merging the code.
Compilation check : After handling conflicts, package and install and return to the main process.
Static scanning : Use tools such as lint to statically scan the code and discover potential risk points
Automated testing : If the project integrates an automated testing framework such as Appium, it should be automatically executed after merging the code.

release stage

Adopt a grayscale strategy , first launch new version upgrades in a small area and a small amount, and gradually expand the scope of users covered by the upgrade. When the stability, business data and other indicators of the grayscale version are qualified (not less than 5% of the market), then Release in full. Perform multiple rounds of grayscale, and can perform special grayscale according to specific models and OS versions to prevent problems from occurring under specific conditions.
Adopt the ABTest strategy : in grayscale, release a new function package + a comparison package to compare stability and business data. The reason why the online full version is not used as a comparison package is because the two are of different magnitudes, and their performance on some data will be different. ABTest prevents this difference by controlling the two to have the same magnitude. At the same time, targeted ABTest can be conducted on specified system versions, models, and user groups.

Operational stage

After the application is launched, continue to pay attention to online stability fluctuations and use daily reports and other methods to monitor. When the crash rate exceeds the threshold or the trend fluctuates, timely alarms are issued to notify relevant parties.
When an unavoidable exception occurs, adopt rollback and downgrade strategies (refer to the “Business High Availability Solution Construction” section below)
In the early stage of the version, we focus on incremental anomalies . After dealing with incremental anomalies, we regularly rectify existing anomalies in order to reduce the market crash rate.

Business high availability solution construction

Stability optimization is not only about reducing the crash rate, but its fundamental goal is to ensure high availability of the business. Sometimes after an exception occurs in the code, although the application does not crash, the business is in an unavailable state, such as page jump failure, interface request failure and no retry, etc. These are the scenarios we strive to avoid.

To improve business availability, the following ideas are available:

Sort out business processes and conduct statistics on key and core paths, especially page loading success rate, download and installation success rate, etc.
Non-intrusive data collection through AOP and other methods can not only provide comprehensive coverage to prevent omissions, but also reduce development costs.
Establish a data dashboard and push it to stakeholders through IM messages, daily emails, etc. Data can be divided into two categories: business data and technical data.
Establish an alarm strategy, which usually includes the following
- Threshold alarm: The absolute value of a certain indicator, such as login success rate, payment success rate, and exception occurrence rate that did not cause a crash.
- Trend alarm: changes compared to the same period
- Alarm for specific indicators: reported once a single occurrence occurs, such as payment failure
For specific user problems, if it is difficult to reproduce in the test or development environment, information can be collected by recovering all logs.
After discovering anomalies, adopt a cover-up strategy to reduce losses.
- Close the corresponding function entrance by setting the switch in the configuration center
- If it is a jump parameter issued by the server, it can be modified to jump to the normal page when the server sends data.
- Replace abnormal logic using methods such as hot repair

Construction of client disaster recovery plan

For crashes caused by exceptions, if we adopt the process of local debugging, development, testing, online, grayscale, and full volume after receiving feedback from online users, the cycle will be relatively long, and the impact of the problem may be further affected during this period. Increase, and the number of people affected by the problem is an important reference indicator for online accident grading. Therefore, we cannot completely rely on the traditional development process for disaster recovery solutions.

New function configuration switch : For newly developed important functions, you can add a switch in the global configuration interface. When stability problems occur with this function online (not necessarily caused by the client, it may also be caused by the server, network operator, etc.), turn it off switch, hidden function entrance
Dynamically configure routing : dynamically deliver routing tables, direct problem pages to the default error handling page, etc.
Hotfix : Replace problem classes with hotfixes
Dynamic : If the project uses RN, Weex or Flutter, use its dynamic capabilities to update
Safe mode : Refer to Tmall’s security plan . When an exception occurs during the startup phase of the application, the count will be counted. If the threshold is reached, the application data will be cleared.

Android stability optimization knowledge inventory