android watchdog机制

发布时间:2025-12-09 16:02:48 浏览次数:3

Android Watchdog 机制

早期手机平台上通常是在设备中增加一个硬件看门狗(WatchDog), 软件系统必须定时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”), 否则超过了规定的时间看门狗就会重新启动设备. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数,如果到了一定的时间还不去清看门狗,那么看门狗计数器就会溢出从而引起看门狗中断,造成系统复位。

而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, Android 的 SystemServer 是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对 SystemServer 中运行的各种线程实施监控。

但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此 Android 开发了 Watchdog 类作为软件看门狗来监控 SystemServer 中的线程。一旦发现问题,Watchdog 会杀死 SystemServer 进程。

Watchdog的功能

Watchdog主要有两个作用

  • Blocked in Monitor 被监控线程的monitor接口实现阻塞
  • Blocked int handler 被监控线程的消息队列不处理消息
  • 判断线程是否卡住的方法

    MessageQueue.isPollingMonitor.monitor---HandlerChecker 检查looper是否阻塞monitor 检查是否死锁

    Watchdog的工作机制

    Watchdog的工作机制 https://img-blog.csdnimg.cn/img_convert/e5c8133c7f86583251c775de4ceae9c0.jpeg

    Watchdog 的启动

    Watchdog 是在 SystemServer 进程中被初始化和启动的,在 SystemServer 的 run 方法中,各种Android 服务被注册和启动,其中也包括了Watchdog 的初始化和启动,代码如下:

    final Watchdog watchdog = Watchdog.getInstance();//line: 864watchdog.init(context, mActivityManagerService);

    在 SystemServer 中 startOtherServices() 的后半段,在 AMS(ActivityManagerService) 的 SystemReady 接口的 CallBack 函数中实现 Watchdog 的启动:

    Watchdog.getInstance().start();//line: 1852

    Watchdog的构造方法

    super("watchdog");//初始化每一个我们希望检查的线程//这里没有检查后台线程//共享的前台线程是主检查器, 还有分配其monitor检查其它线程mMonitorChecker = new HandlerChecker(FgThread.getHandler(),"foreground thread", DEFAULT_TIMEOUT);mHandlerCheckers.add(mMonitorChecker);// 为主线程添加检查器mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),"main thread", DEFAULT_TIMEOUT));// 为共享UI线程添加检查器mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),"ui thread", DEFAULT_TIMEOUT));// 为共享IO线程添加检查器mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),"i/o thread", DEFAULT_TIMEOUT));// 为共享display线程添加检查器.mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),"display thread", DEFAULT_TIMEOUT));// 初始化检查器 binder线程.addMonitor(new BinderThreadMonitor());mOpenFdMonitor = OpenFdMonitor.create();// See the notes on DEFAULT_TIMEOUT.assert DB ||DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

    Watchdog的构造方法中创建了一些HandlerChecker对象, 并添加到自己的监听队列中.

    Watchdog添加的监听handler

    线程名对应handler说明Timeout
    foreground threadFgThread.getHandler()前台线程60s
    main threadnew Handler(Looper.getMainLooper())主线程60s
    ui threadUiThread.getHandler()UI线程60s
    i/o threadIoThread.getHandler()IO线程60s
    display threadDisplayThread.getHandler()Display线程60s
    PackageManageraddThread(mHandler, time)PackageManagerService主动add的线程10min
    PackageManageraddThread(mHandler, time)PermissionManagerService主动add的线程60s
    PowerManagerServiceaddThread(mHandler, time)PowerManagerService主动add的线程60s
    ActivityManagerServiceaddThread(mHandler, time)ActivityManagerService主动add的线程60s

    Watchdog添加的监听monitor

    monitor程名说明Timeout
    BinderThreadMonitor检查Binder线程60s
    OpenFdMonitor检查fd线程60s
    TvRemoteServiceaddMonitor(this) mLock
    ActivityManagerServiceaddMonitor(this) this
    MediaProjectionManagerServiceaddMonitor(this) mLock
    MediaRouterServiceaddMonitor(this) mLock
    MediaSessionServiceaddMonitor(this) mLock
    InputManagerServiceaddMonitor(this) mInputFilterLock
    nativeMonitor(mPtr);
    PowerManagerServiceaddMonitor(this) mLock
    NetworkManagementServiceaddMonitor(this) mConnector
    StorageManagerServiceaddMonitor(this) mVold
    WindowManagerServiceaddMonitor(this) mWindowMap

    HandlerChecker

    public final class HandlerChecker implements Runnable

    HandlerChecker用于检查句柄线程的状态和调度监视器回调, 其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然,该线程是运行在SystemServer进程中的线程。

    Watchdog中会构建很多的HandlerChecker, 可以分为两类

    • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
    • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。

    两类HandlerChecker的侧重点不同

    • Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行
    • Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理

    HandlerChecker的构造函数

    public final class HandlerChecker implements Runnable {private final Handler mHandler;private final String mName;private final long mWaitMax;private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();private boolean mCompleted;private Monitor mCurrentMonitor;private long mStartTime;HandlerChecker(Handler handler, String name, long waitMaxMillis) {mHandler = handler; //线程handlermName = name; //名称mWaitMax = waitMaxMillis; //等待超时时间mCompleted = true; //线程状态}}

    HandlerChecker::scheduleCheckLocked

    这个方法是在Watchdog中的run方法会调用, 是HandlerChecker的核心方法, 用来检查HandlerChecker是否发生了死锁.

    public void scheduleCheckLocked() {if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {// If the target looper has recently been polling, then// there is no reason to enqueue our checker on it since that// is as good as it not being deadlocked. This avoid having// to do a context switch to check the thread. Note that we// only do this if mCheckReboot is false and we have no// monitors, since those would need to be executed at this point.mCompleted = true;return;}if (!mCompleted) {// we already have a check in flight, so no needreturn;}mCompleted = false;mCurrentMonitor = null;mStartTime = SystemClock.uptimeMillis();mHandler.postAtFrontOfQueue(this);}
  • isPolling() 这个方法是判断当前线程Looper是否就绪的核心方法. 如果true 当前正在轮询事件, 正常运行, 会继续向下执行
  • 如果没有mCompleted, 说明已经在检查了
  • `mHandler.postAtFrontOfQueue(this)将自己post到队列中, 之后会执行run方法
  • 在scheduleCheckLocked 中,其实主要是处理mMonitorChecker 的情况,对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的,例如,UiThread,肯定一直处于polling 状态。

    MessageQueue::isPolling

    mHandler.getLooper().getQueue().isPolling() 这个方法可以判断当前线程是否被卡住.
    true: 表示looper当前正在轮询事件,

    这个方法的实现在MessageQueue中,可以看到上面的注释写到:返回当前的looper线程是否在polling工作来做,这个是个很好的用于检测loop是否存活的方法。

    frameworks/base/core/java/android/os/MessageQueue.java

    /*** Returns whether this looper's thread is currently polling for more work to do.* This is a good signal that the loop is still alive rather than being stuck* handling a callback. Note that this method is intrinsically racy, since the* state of the loop can change before you get the result back.** <p>This method is safe to call from any thread.** @return True if the looper is currently polling for events.* @hide*/public boolean isPolling() {synchronized (this) {return isPollingLocked();}}

    HandlerChecker::run

    @Overridepublic void run() {final int size = mMonitors.size();for (int i = 0 ; i < size ; i++) {synchronized (Watchdog.this) {mCurrentMonitor = mMonitors.get(i);}mCurrentMonitor.monitor();}synchronized (Watchdog.this) {mCompleted = true;mCurrentMonitor = null;}}
  • 里面对自己的Monitors遍历并进行monitor。若有monitor发生了阻塞,那么mComplete会一直是false。
  • for循环用来检测监听列表中是否有阻塞,而且只有mMonitorChecker会走进此循环
  • 其余的handlerChecker因为mMonitors为空,都不会执行此循环
  • HandlerChecker::getCompletionStateLocked

    public int getCompletionStateLocked() {if (mCompleted) {return COMPLETED;} else {long latency = SystemClock.uptimeMillis() - mStartTime;if (latency < mWaitMax/2) {return WAITING;} else if (latency < mWaitMax) {return WAITED_HALF;}}return OVERDUE;}
  • 获取完成时间标识, mStartTime初值是在scheduleCheckLocked中设置的
  • 在系统检测调用这个获取未完成状态时,就会进入else里面,进行了时间的计算,并返回相应的时间状态码。
  • 线程的状态

    状态描述
    COMPLETED对应消息已处理完毕线程无阻塞
    WAITING对应消息处理花费0~29秒,继续运行
    WAITED_HALF对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
    OVERDUE对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况

    HandlerThread的继承关系

    这里的HandlerChecker使用的传入参数都是创建的HandlerThread线程的Handler

    java.lang.Object↳ Thread implements Runnable↳ HandlerThread extends Thread↳ ServiceThread extends HandlerThread↳ FgThread extends ServiceThread

    初始化的HandlerChecker

    public ServiceThread(String name, int priority, boolean allowIo)private FgThread() {super("android.fg", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);}private UiThread() {super("android.ui", Process.THREAD_PRIORITY_FOREGROUND, false /*allowIo*/);}private IoThread() {super("android.io", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);}private DisplayThread() {//DisplayThread运行重要的东西,但这些东西不如AnimationThread中运行的东西重要。//因此,将优先级设置为较低的一个。super("android.display", Process.THREAD_PRIORITY_DISPLAY + 1, false /*allowIo*/);}

    Android线程优先级

    frameworks/base/core/java/android/os/Process.java

    public static final int THREAD_PRIORITY_DEFAULT = 0; //默认的线程优先级public static final int THREAD_PRIORITY_LOWEST = 19; //最低的线程级别public static final int THREAD_PRIORITY_BACKGROUND = 10; //后台线程建议设置这个优先级public static final int THREAD_PRIORITY_FOREGROUND = -2; //用户正在交互的UI线程,代码中无法设置该优先级,系统会按照情况调整到该优先级public static final int THREAD_PRIORITY_DISPLAY = -4; //也是与UI交互相关的优先级界别,但是要比THREAD_PRIORITY_FOREGROUND优先public static final int THREAD_PRIORITY_URGENT_DISPLAY = -8; //显示线程的最高级别,用来处理绘制画面和检索输入事件public static final int THREAD_PRIORITY_AUDIO = -16; //声音线程的标准级别public static final int THREAD_PRIORITY_URGENT_AUDIO = -19; //声音线程的最高级别,优先程度较THREAD_PRIORITY_AUDIO要高。public static final int THREAD_PRIORITY_MORE_FAVORABLE = -1; //相对THREAD_PRIORITY_DEFAULT稍微优先public static final int THREAD_PRIORITY_LESS_FAVORABLE = 1; // 相对THREAD_PRIORITY_DEFAULT稍微落后一些

    应用设置线程优先级的方法如下, 但是有一些级别是不允许应用设置的, 是由系统进行分配的.

    Process.setThreadPriority(Process.THREAD_PRIORITY_BACKGROUND +Process.THREAD_PRIORITY_LESS_FAVORABLE)

    describeBlockedStateLocked

    public String describeBlockedStateLocked() {if (mCurrentMonitor == null) {return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";} else {return "Blocked in monitor " + mCurrentMonitor.getClass().getName()+ " on " + mName + " (" + getThread().getName() + ")";}}

    打印Monitor信息

    Monitor

    Monitor是一个接口, 用来

    public interface Monitor {void monitor();}

    实现Watchdog.Monitor接口的类

    ActivityManagerService
    WindowManagerService
    PowerManagerService
    InputManagerService
    MediaSessionService
    MediaRouterService
    StorageManagerService
    NetworkManagementService
    NativeDaemonConnector
    MediaProjectionManagerService
    TvRemoteService

    BinderThreadMonitor
    OpenFdMonitor

    Monitor是一个接口,实现这个接口的类有好几个。比如:如下是android9.0搜出来的结果

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpJfi2aa-1666612570217)(/home/jun/Desktop/Plane3/CoreSystemServer/watchdog/WatchdogImplClass.png)]

    使用Watchdog

    这么多的类实现了该接口, 他们都注册到了Watchdog中, 如AMS中

    public class ActivityManagerService extends IActivityManager.Stubimplements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {......public ActivityManagerService(Context systemContext) {......Watchdog.getInstance().addMonitor(this);Watchdog.getInstance().addThread(mHandler);......}....../** In this method we try to acquire our lock to make sure that we have not deadlocked */public void monitor() {synchronized (this) { }}......}

    Watchdog::addThread

    public void addThread(Handler thread) {addThread(thread, DEFAULT_TIMEOUT); //60s}public void addThread(Handler thread, long timeoutMillis) {synchronized (this) {if (isAlive()) {throw new RuntimeException("Threads can't be added once the Watchdog is running");}final String name = thread.getLooper().getThread().getName();mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));}}
  • addThread是将线程的Hander传给Watchdog, 然后Watchdog会根据Handler创建一个新的HandlerChecker,
  • 将新的HandlerChecker添加到监听队列中
  • Watchdog::addMonitor

    public void addMonitor(Monitor monitor) {synchronized (this) {if (isAlive()) {throw new RuntimeException("Monitors can't be added once the Watchdog is running");}mMonitorChecker.addMonitor(monitor);}}
  • 传递monitor, Watchdog会调用monitor方法, 来判断是否发生阻塞
  • 所有的Monitor都添加到了mMonitorChecker, 所以只有mMonitorChecker里是有Monitor的
  • Watchdog::run()

    Watchdog的核心方法, 检查线程死锁, looper阻塞, 收集信息和kill掉system_server进程, 重启

    @Overridepublic void run() {boolean waitedHalf = false;while (true) {final List<HandlerChecker> blockedCheckers;final String subject;final boolean allowRestart;int debuggerWasConnected = 0;synchronized (this) {long timeout = CHECK_INTERVAL;// Make sure we (re)spin the checkers that have become idle within// this wait-and-check intervalfor (int i=0; i<mHandlerCheckers.size(); i++) {//调用每个HandlerChecker的scheduleCheckLocked() 方法HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();}if (debuggerWasConnected > 0) {debuggerWasConnected--;}// NOTE: We use uptimeMillis() here because we do not want to increment the time we// wait while asleep. If the device is asleep then the thing that we are waiting// to timeout on is asleep as well and won't have a chance to run, causing a false// positive on when to kill things.long start = SystemClock.uptimeMillis(); while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}boolean fdLimitTriggered = false;if (mOpenFdMonitor != null) {fdLimitTriggered = mOpenFdMonitor.monitor();}if (!fdLimitTriggered) {final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) { //线程状态正常,重新轮询// The monitors have returned; resetwaitedHalf = false;continue;} else if (waitState == WAITING) {//处于阻塞状态,但监测时间小于30s,继续监测// still waiting but within their configured intervals; back off and recheckcontinue;} else if (waitState == WAITED_HALF) {//处于阻塞状态,监测时间已经超过30s,开始dump一些系统信息,然后继续监测30sif (!waitedHalf) {// We've waited half the deadlock-detection interval. Pull a stack// trace and wait another half.ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids());waitedHalf = true;}continue;}// something is overdue!blockedCheckers = getBlockedCheckersLocked();subject = describeCheckersLocked(blockedCheckers);} else {blockedCheckers = Collections.emptyList();subject = "Open FD high water mark reached";}allowRestart = mAllowRestart;}// If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;}}
  • run() 方法就是死循环, 不断的去遍历所有HandlerChecker,并调其监控方法,等待三十秒,评估状态。

  • 遍历所有的HandlerChecker, 并调用其scheduleCheckLocked方法, 记录开始时间

    for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();}
  • 等待 30 秒

    // 等待30秒//使用uptimeMills是为了不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠long start = SystemClock.uptimeMillis();while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}
  • 评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
    最大的返回值有四种情况:

    • COMPLETED 对应消息已处理完毕线程无阻塞
    • WAITING 对应消息处理花费0~29秒,继续运行
    • WAITED_HALF 对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
    • OVERDUE 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况
    boolean fdLimitTriggered = false;if (mOpenFdMonitor != null) {fdLimitTriggered = mOpenFdMonitor.monitor();}if (!fdLimitTriggered) {final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) {// The monitors have returned; resetwaitedHalf = false;continue;} else if (waitState == WAITING) {// still waiting but within their configured intervals; back off and recheckcontinue;} else if (waitState == WAITED_HALF) {if (!waitedHalf) {// We've waited half the deadlock-detection interval. Pull a stack// trace and wait another half.ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids());waitedHalf = true;}continue;}// something is overdue!blockedCheckers = getBlockedCheckersLocked();subject = describeCheckersLocked(blockedCheckers);} else {blockedCheckers = Collections.emptyList();subject = "Open FD high water mark reached";}
  • fdMonitor

    public boolean monitor() {if (mFdHighWaterMark.exists()) {dumpOpenDescriptors();return true;}return false;}
  • 收集信息

  • 杀死系统进程

  • Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);

    HandlerChecker::scheduleCheckLocked

    HandlerChecker::run

    Watchdog::evaluateCheckerCompletionLocked

    评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。

    private int evaluateCheckerCompletionLocked() {int state = COMPLETED;// COMPLETED = 0for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);state = Math.max(state, hc.getCompletionStateLocked());}return state;}

    HandlerChecker::getCompletionStateLocked

    Watchdog::getBlockedCheckersLocked

    Watchdog::describeCheckersLocked

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() {ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);if (hc.isOverdueLocked()) {checkers.add(hc);}}return checkers;}private String describeCheckersLocked(List<HandlerChecker> checkers) {StringBuilder builder = new StringBuilder(128);for (int i=0; i<checkers.size(); i++) {if (builder.length() > 0) {builder.append(", ");}builder.append(checkers.get(i).describeBlockedStateLocked());}return builder.toString();}
  • 打印阻塞或死锁线程的信息
  • 注意

    通过 monitor() 方法检查死锁针对不同线程之间的,而服务主线程是否阻塞是针对主线程,所以通过 sendMessage() 方式是只能检测主线程是否阻塞,而不能检测是否死锁,因为如果服务主线程和另外一个线程发生死锁(如另外一个线程synchronized 关键字长时间持有某个锁,不释放),此时向主线程发送 Message,主线程的Handler是可以继续处理的。

    触发方法

  • Blocked in Monitor
    使用Monitor接口中的锁一直无法释放即可
  • Blocked in handler
    可以在Service的onCreate中做crash, 这样长时间就会导致systemServer重启.
  • 触发log

    常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor

    Blocked in handler

    11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method)11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:323)11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.Looper.loop(Looper.java:142)11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.server.SystemServer.run(SystemServer.java:377)11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.server.SystemServer.main(SystemServer.java:239)11-15 06:56:39.696 24203 24902 W Watchdog: at java.lang.reflect.Method.invoke(Native Method)11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:......

    Blocked in monitor

    10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!

    reference

    Android SystemServer 中 WatchDog 机制介绍

    Android系统层Watchdog机制源码分析

    Watchdog原理和问题分析

    Android 系统中的 WatchDog 详解

    应用与系统稳定性第五篇—Watchdog原理和问题分析

    Watchdog 日志分析

    Watchdog识别到SystemServer线程死锁后, 会收集打印信息, 代码在run函数中

    while (true) {//如果发生了死锁或者消息队列阻塞就会走到下面 // If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;}
  • 输出event log

    EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
  • dump 堆栈信息

  • ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);
  • dump kerner info

    // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');
  • 收集dropbox信息

    // Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked. (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000); // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}
  • kill 掉系统进程, 如果不在debug模式, 就kill掉自己

    // Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}
  • prop dalvik.vm.stack-trace-dir

    指的是 /data/anr

    final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");

    reference

    Android 系统中WatchDog 日志分析

    Java基础之—反射

    需要做网站?需要网络推广?欢迎咨询客户经理 13272073477