西安移动网站建设,上海网站建设集中,蜜芽tv跳转接口点击进入网页,做餐饮培训网站广告背景 
本文基于 
Spark 3.1.1    
open-jdk-1.8.0.352目前在排查 Spark 任务的时候#xff0c;遇到了一个很奇怪的问题#xff0c;在此记录一下。 
现象描述 
一个 Spark Application, Driver端的内存为 5GB,一直以来都是能正常调度运行#xff0c;突然有一天#xff0c;报…背景 
本文基于 
Spark 3.1.1    
open-jdk-1.8.0.352目前在排查 Spark 任务的时候遇到了一个很奇怪的问题在此记录一下。 
现象描述 
一个 Spark Application, Driver端的内存为 5GB,一直以来都是能正常调度运行突然有一天报错了 
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(user_lable_id#530L, 500), ENSURE_REQUIREMENTS, [id#1564]
- *(16) Project [xxx]- *(16) BroadcastHashJoin ...- *(14) ColumnarToRow- FileScan parquet xxxat org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:169)at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:525)at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:453)at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:452)at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:496)at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:132)at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:746)at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)at org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegenExec.scala:511)at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)at org.apache.spark.sql.execution.joins.SortMergeJoinExec.inputRDDs(SortMergeJoinExec.scala:378)at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:746)at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:123)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:123)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency$lzycompute(ShuffleExchangeExec.scala:157)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency(ShuffleExchangeExec.scala:155)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:172)at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)... 291 more
Caused by: java.util.concurrent.ExecutionException: org.apache.spark.util.SparkFatalExceptionat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:206)at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:199)at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:515)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:193)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:189)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:203)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareRelation(BroadcastHashJoinExec.scala:217)at org.apache.spark.sql.execution.joins.HashJoin.codegenOuter(HashJoin.scala:497)at org.apache.spark.sql.execution.joins.HashJoin.codegenOuter$(HashJoin.scala:496)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.joins.HashJoin.doConsume(HashJoin.scala:352)at org.apache.spark.sql.execution.joins.HashJoin.doConsume$(HashJoin.scala:349)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:194)at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:149)at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:41)at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:87)at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:194)at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:149)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.consume(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.joins.HashJoin.codegenOuter(HashJoin.scala:542)at org.apache.spark.sql.execution.joins.HashJoin.codegenOuter$(HashJoin.scala:496)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.joins.HashJoin.doConsume(HashJoin.scala:352)at org.apache.spark.sql.execution.joins.HashJoin.doConsume$(HashJoin.scala:349)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:194)at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:149)at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:41)at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:87)at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:194)at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:149)at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:496)at org.apache.spark.sql.execution.InputRDDCodegen.doProduce(WholeStageCodegenExec.scala:483)at org.apache.spark.sql.execution.InputRDDCodegen.doProduce$(WholeStageCodegenExec.scala:456)at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:496)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:496)at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:54)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)at org.apache.spark.sql.execution.joins.HashJoin.doProduce(HashJoin.scala:346)at org.apache.spark.sql.execution.joins.HashJoin.doProduce$(HashJoin.scala:345)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:54)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)at org.apache.spark.sql.execution.joins.HashJoin.doProduce(HashJoin.scala:346)at org.apache.spark.sql.execution.joins.HashJoin.doProduce$(HashJoin.scala:345)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:40)at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:54)at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:655)at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:718)at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:123)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:123)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency$lzycompute(ShuffleExchangeExec.scala:157)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency(ShuffleExchangeExec.scala:155)at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:172)at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)... 328 more
Caused by: org.apache.spark.util.SparkFatalExceptionat org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:173)at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:190)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:750)注意处于安全考虑本文隐藏了具体的物理执行计划 对于一个在大数据行业摸爬滚打了多年的老手来说第一眼肯定是跟着堆栈信息进行排查 理所当然的就是会找到BroadcastExchangeExec这个类但是就算把代码全看一遍也不会有所发现。 
蓦然回首 
这个问题折腾了我大约2个小时错误发生的上下文都看了不止十遍了还是没找到一丝头绪可能是上帝的旨意在离错误不到50行的地方刚好是一个页面的距离发现了以下错误 
53.024: [Full GC (Ergonomics) [PSYoungGen: 802227K-698101K(1191424K)] [ParOldGen: 3085945K-3085781K(3495424K)] 3888173K-3783883K(4686848K), [Metaspace: 135862K-135862K(1185792K)], 0.9651630 secs] [Times: user25.51 sys0.39, real0.96 secs] 
53.990: [Full GC (Allocation Failure) [PSYoungGen: 698101K-698047K(1191424K)] [ParOldGen: 3085781K-3079721K(3495424K)] 3783883K-3777769K(4686848K), [Metaspace: 135862K-134900K(1185792K)], 0.6236139 secs] [Times: user14.05 sys0.28, real0.63 secs] 
java.lang.OutOfMemoryError: Java heap space
Dumping heap to panda_dump ...
Heap dump file created [3938522340 bytes in 5.708 secs]真是 众人寻他千百度蓦然回首 没想到是 OOM 问题。 
结论 
在查找错误的时候还是得在错误的上下文中多翻几页。