发布时间:2025-12-09 16:02:48 浏览次数:3
早期手机平台上通常是在设备中增加一个硬件看门狗(WatchDog), 软件系统必须定时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”), 否则超过了规定的时间看门狗就会重新启动设备. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数,如果到了一定的时间还不去清看门狗,那么看门狗计数器就会溢出从而引起看门狗中断,造成系统复位。
而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, Android 的 SystemServer 是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对 SystemServer 中运行的各种线程实施监控。
但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此 Android 开发了 Watchdog 类作为软件看门狗来监控 SystemServer 中的线程。一旦发现问题,Watchdog 会杀死 SystemServer 进程。
Watchdog主要有两个作用
判断线程是否卡住的方法
MessageQueue.isPollingMonitor.monitor---HandlerChecker 检查looper是否阻塞monitor 检查是否死锁Watchdog的工作机制 https://img-blog.csdnimg.cn/img_convert/e5c8133c7f86583251c775de4ceae9c0.jpeg
Watchdog 是在 SystemServer 进程中被初始化和启动的,在 SystemServer 的 run 方法中,各种Android 服务被注册和启动,其中也包括了Watchdog 的初始化和启动,代码如下:
final Watchdog watchdog = Watchdog.getInstance();//line: 864watchdog.init(context, mActivityManagerService);在 SystemServer 中 startOtherServices() 的后半段,在 AMS(ActivityManagerService) 的 SystemReady 接口的 CallBack 函数中实现 Watchdog 的启动:
Watchdog.getInstance().start();//line: 1852Watchdog的构造方法中创建了一些HandlerChecker对象, 并添加到自己的监听队列中.
| foreground thread | FgThread.getHandler() | 前台线程 | 60s |
| main thread | new Handler(Looper.getMainLooper()) | 主线程 | 60s |
| ui thread | UiThread.getHandler() | UI线程 | 60s |
| i/o thread | IoThread.getHandler() | IO线程 | 60s |
| display thread | DisplayThread.getHandler() | Display线程 | 60s |
| PackageManager | addThread(mHandler, time) | PackageManagerService主动add的线程 | 10min |
| PackageManager | addThread(mHandler, time) | PermissionManagerService主动add的线程 | 60s |
| PowerManagerService | addThread(mHandler, time) | PowerManagerService主动add的线程 | 60s |
| ActivityManagerService | addThread(mHandler, time) | ActivityManagerService主动add的线程 | 60s |
| BinderThreadMonitor | 检查Binder线程 | 60s |
| OpenFdMonitor | 检查fd线程 | 60s |
| TvRemoteService | addMonitor(this) mLock | |
| ActivityManagerService | addMonitor(this) this | |
| MediaProjectionManagerService | addMonitor(this) mLock | |
| MediaRouterService | addMonitor(this) mLock | |
| MediaSessionService | addMonitor(this) mLock | |
| InputManagerService | addMonitor(this) mInputFilterLock nativeMonitor(mPtr); | |
| PowerManagerService | addMonitor(this) mLock | |
| NetworkManagementService | addMonitor(this) mConnector | |
| StorageManagerService | addMonitor(this) mVold | |
| WindowManagerService | addMonitor(this) mWindowMap |
HandlerChecker用于检查句柄线程的状态和调度监视器回调, 其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然,该线程是运行在SystemServer进程中的线程。
Watchdog中会构建很多的HandlerChecker, 可以分为两类
两类HandlerChecker的侧重点不同
这个方法是在Watchdog中的run方法会调用, 是HandlerChecker的核心方法, 用来检查HandlerChecker是否发生了死锁.
public void scheduleCheckLocked() {if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {// If the target looper has recently been polling, then// there is no reason to enqueue our checker on it since that// is as good as it not being deadlocked. This avoid having// to do a context switch to check the thread. Note that we// only do this if mCheckReboot is false and we have no// monitors, since those would need to be executed at this point.mCompleted = true;return;}if (!mCompleted) {// we already have a check in flight, so no needreturn;}mCompleted = false;mCurrentMonitor = null;mStartTime = SystemClock.uptimeMillis();mHandler.postAtFrontOfQueue(this);}在scheduleCheckLocked 中,其实主要是处理mMonitorChecker 的情况,对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的,例如,UiThread,肯定一直处于polling 状态。
mHandler.getLooper().getQueue().isPolling() 这个方法可以判断当前线程是否被卡住.
true: 表示looper当前正在轮询事件,
这个方法的实现在MessageQueue中,可以看到上面的注释写到:返回当前的looper线程是否在polling工作来做,这个是个很好的用于检测loop是否存活的方法。
frameworks/base/core/java/android/os/MessageQueue.java
/*** Returns whether this looper's thread is currently polling for more work to do.* This is a good signal that the loop is still alive rather than being stuck* handling a callback. Note that this method is intrinsically racy, since the* state of the loop can change before you get the result back.** <p>This method is safe to call from any thread.** @return True if the looper is currently polling for events.* @hide*/public boolean isPolling() {synchronized (this) {return isPollingLocked();}}| COMPLETED | 对应消息已处理完毕线程无阻塞 |
| WAITING | 对应消息处理花费0~29秒,继续运行 |
| WAITED_HALF | 对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听 |
| OVERDUE | 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况 |
这里的HandlerChecker使用的传入参数都是创建的HandlerThread线程的Handler
java.lang.Object↳ Thread implements Runnable↳ HandlerThread extends Thread↳ ServiceThread extends HandlerThread↳ FgThread extends ServiceThreadframeworks/base/core/java/android/os/Process.java
public static final int THREAD_PRIORITY_DEFAULT = 0; //默认的线程优先级public static final int THREAD_PRIORITY_LOWEST = 19; //最低的线程级别public static final int THREAD_PRIORITY_BACKGROUND = 10; //后台线程建议设置这个优先级public static final int THREAD_PRIORITY_FOREGROUND = -2; //用户正在交互的UI线程,代码中无法设置该优先级,系统会按照情况调整到该优先级public static final int THREAD_PRIORITY_DISPLAY = -4; //也是与UI交互相关的优先级界别,但是要比THREAD_PRIORITY_FOREGROUND优先public static final int THREAD_PRIORITY_URGENT_DISPLAY = -8; //显示线程的最高级别,用来处理绘制画面和检索输入事件public static final int THREAD_PRIORITY_AUDIO = -16; //声音线程的标准级别public static final int THREAD_PRIORITY_URGENT_AUDIO = -19; //声音线程的最高级别,优先程度较THREAD_PRIORITY_AUDIO要高。public static final int THREAD_PRIORITY_MORE_FAVORABLE = -1; //相对THREAD_PRIORITY_DEFAULT稍微优先public static final int THREAD_PRIORITY_LESS_FAVORABLE = 1; // 相对THREAD_PRIORITY_DEFAULT稍微落后一些应用设置线程优先级的方法如下, 但是有一些级别是不允许应用设置的, 是由系统进行分配的.
Process.setThreadPriority(Process.THREAD_PRIORITY_BACKGROUND +Process.THREAD_PRIORITY_LESS_FAVORABLE)打印Monitor信息
Monitor是一个接口, 用来
public interface Monitor {void monitor();}ActivityManagerService
WindowManagerService
PowerManagerService
InputManagerService
MediaSessionService
MediaRouterService
StorageManagerService
NetworkManagementService
NativeDaemonConnector
MediaProjectionManagerService
TvRemoteService
BinderThreadMonitor
OpenFdMonitor
Monitor是一个接口,实现这个接口的类有好几个。比如:如下是android9.0搜出来的结果
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpJfi2aa-1666612570217)(/home/jun/Desktop/Plane3/CoreSystemServer/watchdog/WatchdogImplClass.png)]
这么多的类实现了该接口, 他们都注册到了Watchdog中, 如AMS中
public class ActivityManagerService extends IActivityManager.Stubimplements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {......public ActivityManagerService(Context systemContext) {......Watchdog.getInstance().addMonitor(this);Watchdog.getInstance().addThread(mHandler);......}....../** In this method we try to acquire our lock to make sure that we have not deadlocked */public void monitor() {synchronized (this) { }}......}Watchdog的核心方法, 检查线程死锁, looper阻塞, 收集信息和kill掉system_server进程, 重启
@Overridepublic void run() {boolean waitedHalf = false;while (true) {final List<HandlerChecker> blockedCheckers;final String subject;final boolean allowRestart;int debuggerWasConnected = 0;synchronized (this) {long timeout = CHECK_INTERVAL;// Make sure we (re)spin the checkers that have become idle within// this wait-and-check intervalfor (int i=0; i<mHandlerCheckers.size(); i++) {//调用每个HandlerChecker的scheduleCheckLocked() 方法HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();}if (debuggerWasConnected > 0) {debuggerWasConnected--;}// NOTE: We use uptimeMillis() here because we do not want to increment the time we// wait while asleep. If the device is asleep then the thing that we are waiting// to timeout on is asleep as well and won't have a chance to run, causing a false// positive on when to kill things.long start = SystemClock.uptimeMillis(); while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}boolean fdLimitTriggered = false;if (mOpenFdMonitor != null) {fdLimitTriggered = mOpenFdMonitor.monitor();}if (!fdLimitTriggered) {final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) { //线程状态正常,重新轮询// The monitors have returned; resetwaitedHalf = false;continue;} else if (waitState == WAITING) {//处于阻塞状态,但监测时间小于30s,继续监测// still waiting but within their configured intervals; back off and recheckcontinue;} else if (waitState == WAITED_HALF) {//处于阻塞状态,监测时间已经超过30s,开始dump一些系统信息,然后继续监测30sif (!waitedHalf) {// We've waited half the deadlock-detection interval. Pull a stack// trace and wait another half.ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids());waitedHalf = true;}continue;}// something is overdue!blockedCheckers = getBlockedCheckersLocked();subject = describeCheckersLocked(blockedCheckers);} else {blockedCheckers = Collections.emptyList();subject = "Open FD high water mark reached";}allowRestart = mAllowRestart;}// If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;}}run() 方法就是死循环, 不断的去遍历所有HandlerChecker,并调其监控方法,等待三十秒,评估状态。
遍历所有的HandlerChecker, 并调用其scheduleCheckLocked方法, 记录开始时间
for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();}等待 30 秒
// 等待30秒//使用uptimeMills是为了不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠long start = SystemClock.uptimeMillis();while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
最大的返回值有四种情况:
fdMonitor
public boolean monitor() {if (mFdHighWaterMark.exists()) {dumpOpenDescriptors();return true;}return false;}收集信息
杀死系统进程
评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
private int evaluateCheckerCompletionLocked() {int state = COMPLETED;// COMPLETED = 0for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);state = Math.max(state, hc.getCompletionStateLocked());}return state;}通过 monitor() 方法检查死锁针对不同线程之间的,而服务主线程是否阻塞是针对主线程,所以通过 sendMessage() 方式是只能检测主线程是否阻塞,而不能检测是否死锁,因为如果服务主线程和另外一个线程发生死锁(如另外一个线程synchronized 关键字长时间持有某个锁,不释放),此时向主线程发送 Message,主线程的Handler是可以继续处理的。
常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor
Android SystemServer 中 WatchDog 机制介绍
Android系统层Watchdog机制源码分析
Watchdog原理和问题分析
Android 系统中的 WatchDog 详解
应用与系统稳定性第五篇—Watchdog原理和问题分析
Watchdog识别到SystemServer线程死锁后, 会收集打印信息, 代码在run函数中
while (true) {//如果发生了死锁或者消息队列阻塞就会走到下面 // If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;}输出event log
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);dump 堆栈信息
dump kerner info
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');收集dropbox信息
// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}kill 掉系统进程, 如果不在debug模式, 就kill掉自己
// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}指的是 /data/anr
final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");Android 系统中WatchDog 日志分析
Java基础之—反射