使用混沌工程进行自动化故障演练。

配置

<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>2.2.0</version>
</dependency>

配置类详见：

ChaosMonkeyProperties
AssaultProperties
WatcherProperties

一个demo如下：

chaos:
  monkey:
    enabled: true
    assaults:
      level: 10
      latencyRangeStart: 500
      latencyRangeEnd: 10000
      exceptionsActive: true
      killApplicationActive: false
    watcher:
      repository: true
      restController: true

其中level用于判断是否触发request级别攻击，见ChaosMonkeyRequestScope：

private boolean isTrouble() {
    return chaosMonkeySettings.getAssaultProperties().getTroubleRandom() >= chaosMonkeySettings.getAssaultProperties().getLevel();
}

level取值区间为[1, 10000]。

Assault 攻击类型

ChaosMonkeyAssault是攻击类型的抽象，包含2个子类：

ChaosMonkeyRuntimeAssault：运行时攻击，例如退出程序、内存飙升。
ChaosMonkeyRequestAssault：请求级别攻击，例如延迟、请求异常。

public interface ChaosMonkeyAssault {
    boolean isActive();

    // 攻击方法的实现
    void attack();
}

KillAppAssault

直接退出程序。

System.exit(exit)

MemoryAssault

消耗可用内存。方法很简单，往Vector填充byte[]数组，然后停顿一个间隔，再触发gc。

    private void eatFreeMemory() {
        @SuppressWarnings("MismatchedQueryAndUpdateOfCollection")
        Vector<byte[]> memoryVector = new Vector<>();

        long stolenMemoryTotal = 0L;

        while (isActive()) {
            // overview of memory methods in java https://stackoverflow.com/a/18375641
            long freeMemory = runtime.freeMemory();
            long usedMemory = runtime.totalMemory() - freeMemory;

            if (cannotAllocateMoreMemory()) {
                LOGGER.debug("Cannot allocate more memory");
                break;
            }

            LOGGER.debug("Used memory in bytes: " + usedMemory);

            stolenMemoryTotal = stealMemory(memoryVector, stolenMemoryTotal, getBytesToSteal());
            waitUntil(settings.getAssaultProperties().getMemoryMillisecondsWaitNextIncrease());
        }

        // Hold memory level and cleanUp after, only if experiment is running
        if (isActive()) {
            LOGGER.info("Memory fill reached, now sleeping and holding memory");
            waitUntil(settings.getAssaultProperties().getMemoryMillisecondsHoldFilledMemory());
        }

        // clean Vector
        memoryVector.clear();
        // quickly run gc for reuse
        runtime.gc();

        long stolenAfterComplete = MemoryAssault.stolenMemory.addAndGet(-stolenMemoryTotal);
        metricEventPublisher.publishMetricEvent(MetricType.MEMORY_ASSAULT_MEMORY_STOLEN, stolenAfterComplete);
    }

    private long stealMemory(Vector<byte[]> memoryVector, long stolenMemoryTotal,
                             int bytesToSteal) {
        memoryVector.add(createDirtyMemorySlice(bytesToSteal));

        stolenMemoryTotal += bytesToSteal;
        long newStolenTotal = MemoryAssault.stolenMemory.addAndGet(bytesToSteal);
        metricEventPublisher.publishMetricEvent(MetricType.MEMORY_ASSAULT_MEMORY_STOLEN, newStolenTotal);
        LOGGER.debug("Chaos Monkey - memory assault increase, free memory: " + SizeConverter.toMegabytes(runtime
                .freeMemory()));

        return stolenMemoryTotal;
    }

    private byte[] createDirtyMemorySlice(int size) {
        byte[] b = new byte[size];
        for (int idx = 0; idx < size; idx += 4096) { // 4096
            // is commonly the size of a memory page, forcing a commit
            b[idx] = 19;
        }

        return b;
    }

LatencyAssault

给请求RT增加延迟。如果不配置延迟的实践区间，则使用随机数。

    public void attack() {
        LOGGER.debug("Chaos Monkey - timeout");

        atomicTimeoutGauge.set(determineLatency());

        // metrics
        if (metricEventPublisher != null) {
            metricEventPublisher.publishMetricEvent(MetricType.LATENCY_ASSAULT);
            metricEventPublisher.publishMetricEvent(MetricType.LATENCY_ASSAULT, atomicTimeoutGauge);
        }

        assaultExecutor.execute(atomicTimeoutGauge.get());
    }

    private int determineLatency() {
        final int latencyRangeStart =
                settings.getAssaultProperties().getLatencyRangeStart();
        final int latencyRangeEnd =
                settings.getAssaultProperties().getLatencyRangeEnd();

        if (latencyRangeStart == latencyRangeEnd) {
            return latencyRangeStart;
        } else {
            return ThreadLocalRandom.current().nextInt(latencyRangeStart,
                    latencyRangeEnd);
        }
    }

请求的执行委派给ChaosMonkeyLatencyAssaultExecutor。实际上就是Thread.sleep()。

public class LatencyAssaultExecutor implements ChaosMonkeyLatencyAssaultExecutor {
    @Override
    public void execute(long durationInMillis) {
        try {
            Thread.sleep(durationInMillis);
        } catch (InterruptedException e) {
            // do nothing
        }
    }
}

ExceptionAssault

抛出指定的异常。

public void attack() {
    LOGGER.info("Chaos Monkey - exception");
    AssaultException assaultException = this.settings.getAssaultProperties().getException();
    // metrics
    if (metricEventPublisher != null)
        metricEventPublisher.publishMetricEvent(MetricType.EXCEPTION_ASSAULT);
    assaultException.throwExceptionInstance();
}

metrics

对接io.micrometer，每个攻击都会发送metrics，很方便在dashboard观察攻击效果。这里使用spring的事件机制。 MetricEvent转换为ApplicationEvent事件。

public class MetricEvent extends ApplicationEvent {
    private final MetricType metricType;
    private final double metricValue;
    private final String methodSignature;
    private final String[] tags;

    public MetricEvent(Object source, MetricType metricType, long metricValue, String methodSignature, String... tags) {
        super(source);
        this.metricType = metricType;
        this.tags = tags;
        this.methodSignature = methodSignature;
        this.metricValue = metricValue;
    }
}

MetricEventPublisher向spring容器发送事件。

public class MetricEventPublisher implements ApplicationEventPublisherAware {
    private ApplicationEventPublisher publisher;

}

每个攻击开始，都手动发送事件：

public void attack() {
    LOGGER.info("Chaos Monkey - exception");
    AssaultException assaultException = this.settings.getAssaultProperties().getException();
    // metrics
    if (metricEventPublisher != null)
        metricEventPublisher.publishMetricEvent(MetricType.EXCEPTION_ASSAULT);
    assaultException.throwExceptionInstance();
}

开启actuator端点，就可以在dashboard观察效果。

management:
  endpoint:
    chaosmonkey:
      enabled: true
  endpoints:
    web:
      exposure:
        include: health,info,chaosmonkey

控制端点

提供了jmx和rest两种方式：

ChaosMonkeyJmxEndpoint
ChaosMonkeyRestEndpoint

切面

通过LTW方式开启切面。

@Configuration
@Profile("chaos-monkey")
@EnableLoadTimeWeaving(aspectjWeaving= EnableLoadTimeWeaving.AspectJWeaving.ENABLED)
public class ChaosMonkeyLoadTimeWeaving extends LoadTimeWeavingConfiguration {
    @Override
    public LoadTimeWeaver loadTimeWeaver() {
        return new ReflectiveLoadTimeWeaver();
    }
}

watcher包定义了几个切面，就不展开了。