spark on yarn

xiaoxiao2021-02-27  450

1、spark-default export SPARK_LOCAL_DIRS=/home/hadoop/spark/tmp export SPARK_HOME=/usr/install/spark

2、spark-env

//This requires spark.shuffle.service.enabled to be set. The following //configurations are also relevant: //spark.dynamicAllocation.minExecutors, //spark.dynamicAllocation.maxExecutors, and //spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 0 spark.dynamicAllocation.maxExecutors 20 spark.dynamicAllocation.executorIdleTimeout 120s spark.dynamicAllocation.cachedExecutorIdleTimeout 1800s Spark.shuffle.service.port 7338 spark.shuffle.io.connectionTimeout 600s

spark.yarn.jars hdfs://master:9000/user/yarn_jars/spark2.0/*

spark.yarn.executor.memoryOverhead 3g spark.driver.memory 3g spark.yarn.am.memory 3g spark.executor.memory 8g spark.executor.cores 3 spark.yarn.queue test spark.ui.enabled true spark.port.maxRetries 50 spark.locality.wait 0s spark.master yarn

应用程序上载到HDFS的复制份数 spark.yarn.submit.file.replication 3

spark.yarn.am.waitTime 100s 设置为true,在job结束后,将stage相关的文件保留而不是删除。 (一般无需保留,设置成false) spark.preserve.staging.files false

Spark application master给YARN ResourceManager 发送心跳的时间间隔(ms) spark.yarn.scheduler.heartbeat.interal-ms 5000

仅适用于HashShuffleMananger的实现,同样是为了解决生成过多文件的问题,采用的方式是在不同批次运行的Map任务之间重用Shuffle输出文件,也就是说合并的是不同批次的Map任务的输出数据,但是每个Map任务所需要的文件还是取决于Reduce分区的数量,因此,它并不减少同时打开的输出文件的数量,因此对内存使用量的减少并没有帮助。只是HashShuffleManager里的一个折中的解决方案。 spark.shuffle.consolidateFiles true

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.executor.extraJavaOptions -XX:+PrintGCDetails-XX:+PrintGCTimeStamps

spark.driver.cores 1 spark.driver.maxResultSize 1g spark.driver.memory 1g spark.executor.memory 1g //including map output files and RDDs that get stored on disk spark.local.dir /tmp spark.submit.deployMode client/cluster spark.reducer.maxSizeInFlight 48m spark.shuffle.compress true spark.shuffle.file.buffer 32k spark.shuffle.io.maxRetries 3 spark.shuffle.io.preferDirectBufs true spark.shuffle.io.retryWait 5s //This must be enabled if spark.dynamicAllocation.enabled is “true”. spark.shuffle.service.enabled false

spark.shuffle.service.port 7337 //在sort-shuffle里面如果没有map-side 聚合,避免合并排序数据,最多允许有这么多分区 spark.shuffle.sort.bypassMergeThreshold 200

spark.shuffle.spill.compress true spark.io.compression.codec lz4 org.apache.spark.io.LZ4CompressionCodec, org.apache.spark.io.LZFCompressionCodec, and org.apache.spark.io.SnappyCompressionCodec.

spark.broadcast.compress true spark.io.compression.snappy.blockSize 32k spark.io.compression.lz4.blockSize 32k spark.kryoserializer.buffer.max 64m spark.kryoserializer.buffer 64k spark.rdd.compress false spark.memory.fraction 0.6 spark.memory.storageFraction 0.5

spark.memory.offHeap.enabled false spark.memory.offHeap.size 0

Spark.executor.cores 1 spark.default.parallelism 2 spark.executor.heartbeatInterval 10s spark.files.useFetchCache true spark.storage.memoryMapThreshold 2m

//This config will be used in place of //spark.core.connection.ack.wait.timeout, //spark.storage.blockManagerSlaveTimeoutMs, spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or spark.rpc.lookupTimeout spark.network.timeout 120s

spark.cores.max (not set) spark.locality.wait 3s

//Useful for multi-user services. spark.scheduler.mode FIFO //任务推测机制 spark.speculation false //检查任务推测的频率 spark.speculation.interval 100ms //任务慢多少倍开始推测 //完成任务的百分比 开始启用 spark.speculation.quantile 0.75

spark.speculation.multiplier 1.5

spark.sql.autoBroadcastJoinThreshold -1

spark.sql.shuffle.partitions 800 spark.shuffle.manager tungsten-sort //Spark SQL在每次执行次,先把SQL查询编译JAVA字节码。针对执行时间长//的SQL查询或频繁执行的SQL查询,此配置能加快查询速度,因为它产生特殊//的字节码去执行。但是针对很短的查询,可能会增加开销,因为它必须先编译//每一个查询 spark.sql.codegen true

//shuffle默认情况下的文件数据为map tasks * reduce tasks,通过设置其为//true,可以使spark合并shuffle的中间文件为reduce的tasks数目 spark.shuffle.consolidateFiles true

转载请注明原文地址: https://www.6miu.com/read-939.html

最新回复(0)