从现象来看,报错日志为(从JobManager中获取)
[2019-02-16 09:18:50,218] INFO Diagnostics for container container_e31_1548733575161_1174_01_000003 in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.Container id: container_e31_1548733575161_1174_01_000003Exit code: 1Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)Container exited with a non-zero exit code 1
JobManager是通过NodeManager启动TaskManager的,所以我找到NodeManager来查看日志,对应的日志如下:
2019-02-16 09:19:32,065 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from LOCALIZING to LOCALIZED2019-02-16 09:19:32,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from LOCALIZED to RUNNING2019-02-16 09:19:32,079 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /data/yarn/nm/usercache/flink/appcache/application_1548733575161_1175/container_e31_1548733575161_1175_01_000004/default_container_executor.sh]2019-02-16 09:19:32,166 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e31_1548733575161_1175_01_0000042019-02-16 09:19:32,181 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18989 for container-id container_e31_1548733575161_1175_01_000004: 15.8 MB of 1 GB physical memory used; 1.7 GB of 2.1 GB virtual memory used2019-02-16 09:19:32,194 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2073 for container-id container_e31_1548733575161_0314_01_000003: 514.5 MB of 1 GB physical memory used; 2.0 GB of 2.1 GB virtual memory used2019-02-16 09:19:32,202 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 5863 for container-id container_e31_1548733575161_0909_01_000001: 376.8 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used2019-02-16 09:19:34,381 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e31_1548733575161_1175_01_000004 is : 12019-02-16 09:19:34,381 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e31_1548733575161_1175_01_000004 and exit code: 1ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)2019-02-16 09:19:34,381 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_e31_1548733575161_1175_01_0000042019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 12019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=1: 2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.run(Shell.java:507)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.FutureTask.run(FutureTask.java:266)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.lang.Thread.run(Thread.java:748)2019-02-16 09:19:34,382 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 12019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from RUNNING to EXITED_WITH_FAILURE2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e31_1548733575161_1175_01_0000042019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /data/yarn/nm/usercache/flink/appcache/application_1548733575161_1175/container_e31_1548733575161_1175_01_0000042019-02-16 09:19:34,396 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=flink OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1548733575161_1175 CONTAINERID=container_e31_1548733575161_1175_01_0000042019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from EXITED_WITH_FAILURE to DONE2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_e31_1548733575161_1175_01_000004 from application application_1548733575161_11752019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_e31_1548733575161_1175_01_000004 for log-aggregation2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1548733575161_11752019-02-16 09:19:34,396 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_e31_1548733575161_1175_01_0000042019-02-16 09:19:35,204 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_e31_1548733575161_1175_01_000004
可以看到确实报错了,但是看这个并不能很清楚的知道为啥脚本启动报错!
然后记得有个朋友说过可以延迟删除container文件,百度了一下,有这么一篇文章
https://blog.csdn.net/xiao_jun_0820/article/details/76081321https://blog.csdn.net/wangming520liwei/article/details/78923216
根据这个文件内容
yarn.nodemanager.delete.debug-delay-sec默认值:0,app执行完之后立即删除本地文件desc:应用程序完成之后 NodeManager 的 DeletionService 删除应用程序的本地化文件和日志目录之前的时间(秒数)。要诊断 YARN 应用程序问题,请将此属性的值设为足够大(例如,设为 600 秒,即 10 分钟)以允许检查这些目录。---------------------
PS:日志聚合---https://blog.csdn.net/lrf2454224026/article/details/82700129
再试一把,
然后让运维重新部署下!可以看到TaskManager的日志
[2019-02-16 11:17:01,557] INFO Unable to start Queryable State Server. All ports in provided range ([9067]) are occupied. org.apache.flink.queryablestate.network.AbstractServerBase.start(AbstractServerBase.java:197) [2019-02-16 11:17:01,557] INFO Shutting down Queryable State Server @ null org.apache.flink.queryablestate.network.AbstractServerBase.shutdownServer(AbstractServerBase.java:288) [2019-02-16 11:17:01,558] INFO Queryable State Server was shutdown successfully. org.apache.flink.queryablestate.server.KvStateServerImpl.shutdown(KvStateServerImpl.java:107) [2019-02-16 11:17:01,559] ERROR Error while starting up taskManager grizzled.slf4j.Logger.error(slf4j.scala:116) java.io.IOException: Failed to start the Queryable State Data Server. at org.apache.flink.runtime.io.network.NetworkEnvironment.start(NetworkEnvironment.java:319) at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:240) at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:2023) at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1854) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply$mcV$sp(TaskManager.scala:1964) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply(TaskManager.scala:1942) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply(TaskManager.scala:1942) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.akka.AkkaUtils$.retryOnBindException(AkkaUtils.scala:766) at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1942) at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1713) at org.apache.flink.runtime.taskmanager.TaskManager.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala) at org.apache.flink.yarn.YarnTaskManagerRunnerFactory$Runner.call(YarnTaskManagerRunnerFactory.java:70) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:78) at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)Caused by: org.apache.flink.util.FlinkRuntimeException: Unable to start Queryable State Server. All ports in provided range are occupied. at org.apache.flink.queryablestate.network.AbstractServerBase.start(AbstractServerBase.java:198) at org.apache.flink.queryablestate.server.KvStateServerImpl.start(KvStateServerImpl.java:95) at org.apache.flink.runtime.io.network.NetworkEnvironment.start(NetworkEnvironment.java:315) ... 18 more
---
解决方案:
在flink-conf.yaml里添加【参考文档:https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/queryable_state.html#Configuration
相关类:org.apache.flink.configuration.QueryableStateOptions】
query.proxy.ports: 50100-50200,50300-59900,59999query.server.ports: 50100-50200,50300-59900,59999