服务无缘无故宕机

定时服务无缘无故宕机了,服务相关日志没有任何错误日志。
首先报告领导
恢复业务
排查问题
监控服务

服务宕机了

因服务没有监控,导致服务宕机没有发现,还是通过统计数据异常发现问题,立马去查看log日志。。。

  • 很奇怪项目日志没有任何error日志,大大的加深了问题排查。

查看jvm错误日志hs_err_pid*.log,JVM crash信息,我们可以通过分析该文件定位到导致 JVM Crash 的原因,从而修复保证系统稳定

日志头

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2640), pid=114181, tid=0x00007f9340e91700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
  • Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
    • 减小thread stack的大小
    • 线程数在3000~5000左右需要注意,JVM默认thread stack(-Xss)的大小为1024,这样当线程多时导致Native virtual memory被耗尽,实际上当thread stack的大小为128K 或 256K时是足够的,所以我们如果明确指定thread stack为128K 或 256K即可,具体使用-Xss
  • Out of Memory Error (os_linux.cpp:2640), pid=114181, tid=0x00007f9340e91700
    • 日志头可清晰看出Out of Memory Error-内存不足。
    • liunx64位解决优化方案
      • 减少Java堆大小(-Xmx / -Xms)
      • 减少Java线程数(从业务出发)
      • 减少Java线程堆栈大小(-Xss)
      • 使用-XX:ReservedCodeCacheSize =设置更大的代码缓存

堆栈信息

---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
=>0x00007f9b9447d800 JavaThread "pool-32458-thread-1" [_thread_new, id=2337, stack(0x00007f9340d91000,0x00007f9340e92000)]
0x00007f9b8c471000 JavaThread "pool-32456-thread-1" [_thread_blocked, id=2336, stack(0x00007f932a62b000,0x00007f932a72c000)]
0x00007f9ba44a4000 JavaThread "pool-32455-thread-1" [_thread_blocked, id=2330, stack(0x00007f932a72c000,0x00007f932a82d000)]
0x00007f9b745ed800 JavaThread "pool-32454-thread-1" [_thread_blocked, id=2319, stack(0x00007f932a82d000,0x00007f932a92e000)]
0x00007f9b7862a000 JavaThread "pool-32453-thread-1" [_thread_blocked, id=2318, stack(0x00007f932a92e000,0x00007f932aa2f000)]
0x00007f9b6c5cd800 JavaThread "pool-32452-thread-1" [_thread_blocked, id=2302, stack(0x00007f932aa2f000,0x00007f932ab30000)]
0x00007f9b98bf0000 JavaThread "pool-32451-thread-1" [_thread_blocked, id=2297, stack(0x00007f932ab30000,0x00007f932ac31000)]
0x00007f9b44633000 JavaThread "Keep-Alive-Timer" daemon [_thread_blocked, id=2285, stack(0x00007f9330e93000,0x00007f9330f94000)]
0x00007f9b6450b000 JavaThread "pool-32450-thread-1" [_thread_blocked, id=2187, stack(0x00007f932ac31000,0x00007f932ad32000)]
0x00007f9b9447b000 JavaThread "pool-32449-thread-1" [_thread_blocked, id=2159, stack(0x00007f932ad32000,0x00007f932ae33000)]
0x00007f9b8c46f000 JavaThread "pool-32448-thread-1" [_thread_blocked, id=2100, stack(0x00007f932ae33000,0x00007f932af34000)]
0x00007f9b8059b800 JavaThread "pool-32447-thread-1" [_thread_blocked, id=2068, stack(0x00007f932af34000,0x00007f932b035000)]
0x00007f9ba44a2000 JavaThread "pool-32446-thread-1" [_thread_blocked, id=1895, stack(0x00007f932b035000,0x00007f932b136000)]
0x00007f9b745eb000 JavaThread "pool-32445-thread-1" [_thread_blocked, id=1865, stack(0x00007f932b136000,0x00007f932b237000)]
0x00007f9b78628000 JavaThread "pool-32444-thread-1" [_thread_blocked, id=1864, stack(0x00007f932b237000,0x00007f932b338000)]
0x00007f9b6c5cb800 JavaThread "pool-32443-thread-1" [_thread_blocked, id=1854, stack(0x00007f932b338000,0x00007f932b439000)]
0x00007f9b98bed800 JavaThread "pool-32442-thread-1" [_thread_blocked, id=1850, stack(0x00007f932b439000,0x00007f932b53a000)]
0x00007f9b64508800 JavaThread "pool-32441-thread-1" [_thread_blocked, id=1849, stack(0x00007f932b53a000,0x00007f932b63b000)]
0x00007f9b94479000 JavaThread "pool-32440-thread-1" [_thread_blocked, id=1835, stack(0x00007f932b63b000,0x00007f932b73c000)]
0x00007f9b8c46d000 JavaThread "pool-32439-thread-1" [_thread_blocked, id=1832, stack(0x00007f932b73c000,0x00007f932b83d000)]
0x00007f9b80599000 JavaThread "pool-32438-thread-1" [_thread_blocked, id=1729, stack(0x00007f932b83d000,0x00007f932b93e000)]
0x00007f9ba449f800 JavaThread "pool-32437-thread-1" [_thread_blocked, id=1657, stack(0x00007f932b93e000,0x00007f932ba3f000)]
0x00007f9b78625800 JavaThread "pool-32436-thread-1" [_thread_blocked, id=1412, stack(0x00007f932ba3f000,0x00007f932bb40000)]
0x00007f9b54782000 JavaThread "pool-32435-thread-1" [_thread_blocked, id=1183, stack(0x00007f932bb40000,0x00007f932bc41000)]
0x00007f9b486df800 JavaThread "pool-32434-thread-1" [_thread_blocked, id=1182, stack(0x00007f932bc41000,0x00007f932bd42000)]
0x00007f9b44631000 JavaThread "pool-2-thread-16487" [_thread_blocked, id=1180, stack(0x00007f932bd42000,0x00007f932be43000)]
0x00007f9b4462f000 JavaThread "pool-2-thread-16486" [_thread_blocked, id=1177, stack(0x00007f932be43000,0x00007f932bf44000)]
0x0000000001d29800 JavaThread "pool-32433-thread-1" [_thread_blocked, id=1176, stack(0x00007f932bf44000,0x00007f932c045000)]
0x00007f9c5458a800 JavaThread "pool-32432-thread-1" [_thread_blocked, id=1175, stack(0x00007f932c045000,0x00007f932c146000)]
0x00007f9b4462d000 JavaThread "pool-2-thread-16485" [_thread_blocked, id=1174, stack(0x00007f932c146000,0x00007f932c247000)]
0x00007f9c4465c800 JavaThread "pool-32431-thread-1" [_thread_blocked, id=1173, stack(0x00007f932c247000,0x00007f932c348000)]
  • JAVA线程堆栈,发现堆栈里面大量的pool的线程池,blocked阻塞线程高达32458个,这就是根本原因,每执行一个就创建。
  • 误用JAVA线程池,每次用都新new一个线程池newSingleThreadScheduledExecutor
  • 确实每次new会占用堆外堆存,没有跟踪到底层,但是线程池是管理线程的,虚拟机线程肯定是要跟OS申请线程资源的,linux中线程作为轻量进程,每fork一个肯定会占用OS的资源,相对于java虚拟机堆内内存来说,即是占用了堆外内存;而虚拟机本身由于线程池不释放,老生代会一直缓慢增加,但是没有堆外内存那么厉害,当老生代一直增加到100%后,虚拟机本身会报内存溢出。而操作系统层面,由于大量VIRT被占用,就连简单的top有时也会因为没有办法分配内存而执行不了

[hs_err_pid文件]

优化方案

  • 线程池用完了必须shutdown()。
  • 避免一直new创建新的线程池。
  • 服务总内存16G,此服务启动设置了2G,增大了最大内存至3G,设置堆栈大小256K。
文章作者: 陈 武
文章链接: http://www.updatecg.xin/2020/06/20/服务无缘无故宕机/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 我的学习记录
打赏
  • 微信
  • 支付寶

评论