一个网站两个域名备案,阜阳做网站的商户,保健品网站设计机构,门户网站的定义文章目录 1. 前言2. 参数说明3. Standalone集群最佳实践 1. 前言
部署提交应用到 spark 集群#xff0c;可能会用到 spark-submit 工具#xff0c;鉴于网上的博客质量残差不齐#xff0c;且有很多完全是无效且错误的配置#xff0c;没有搞明白诸如--total-executor-cores … 文章目录 1. 前言2. 参数说明3. Standalone集群最佳实践 1. 前言
部署提交应用到 spark 集群可能会用到 spark-submit 工具鉴于网上的博客质量残差不齐且有很多完全是无效且错误的配置没有搞明白诸如--total-executor-cores 、--executor-cores、--num-executors的关系和区别。因此有必要结合官网文档 submitting-applications 详细记录一下参数的含义。
2. 参数说明
一般的用法是spark-submit [option] xx.jar/xx.py 详细说明如下
Usage: spark-submit [options] app jar | python file | R file [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]Options:--master MASTER_URL spark://host:port, mesos://host:port, yarn,k8s://https://host:port, or local (Default: local[*]).--deploy-mode DEPLOY_MODE Whether to launch the driver program locally (client) oron one of the worker machines inside the cluster (cluster)(Default: client).--class CLASS_NAME Your applications main class (for Java / Scala apps).--name NAME A name of your application.--jars JARS Comma-separated list of jars to include on the driverand executor classpaths.--packages Comma-separated list of maven coordinates of jars to includeon the driver and executor classpaths. Will search the localmaven repo, then maven central and any additional remoterepositories given by --repositories. The format for thecoordinates should be groupId:artifactId:version.--exclude-packages Comma-separated list of groupId:artifactId, to exclude whileresolving the dependencies provided in --packages to avoiddependency conflicts.--repositories Comma-separated list of additional remote repositories tosearch for the maven coordinates given with --packages.--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to placeon the PYTHONPATH for Python apps.--files FILES Comma-separated list of files to be placed in the workingdirectory of each executor. File paths of these filesin executors can be accessed via SparkFiles.get(fileName).--archives ARCHIVES Comma-separated list of archives to be extracted into theworking directory of each executor.--conf, -c PROPVALUE Arbitrary Spark configuration property.--properties-file FILE Path to a file from which to load extra properties. If notspecified, this will look for conf/spark-defaults.conf.--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).--driver-java-options Extra Java options to pass to the driver.--driver-library-path Extra library path entries to pass to the driver.--driver-class-path Extra class path entries to pass to the driver. Note thatjars added with --jars are automatically included in theclasspath.--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).--proxy-user NAME User to impersonate when submitting the application.This argument does not work with --principal / --keytab.--help, -h Show this help message and exit.--verbose, -v Print additional debug output.--version, Print the version of current Spark.Spark Connect only:--remote CONNECT_URL URL to connect to the server for Spark Connect, e.g.,sc://host:port. --master and --deploy-mode cannot be settogether with this option. This option is experimental, andmight change between minor releases.Cluster deploy mode only:--driver-cores NUM Number of cores used by the driver, only in cluster mode(Default: 1).Spark standalone or Mesos with cluster deploy mode only:--supervise If given, restarts the driver on failure.Spark standalone, Mesos or K8s with cluster deploy mode only:--kill SUBMISSION_ID If given, kills the driver specified.--status SUBMISSION_ID If given, requests the status of the driver specified.Spark standalone and Mesos only:--total-executor-cores NUM Total cores for all executors.Spark standalone, YARN and Kubernetes only:--executor-cores NUM Number of cores used by each executor. (Default: 1 inYARN and K8S modes, or all available cores on the workerin standalone mode).Spark on YARN and Kubernetes only:--num-executors NUM Number of executors to launch (Default: 2).If dynamic allocation is enabled, the initial number ofexecutors will be at least NUM.--principal PRINCIPAL Principal to be used to login to KDC.--keytab KEYTAB The full path to the file that contains the keytab for theprincipal specified above.Spark on YARN only:--queue QUEUE_NAME The YARN queue to submit to (Default: default).我把一些主要的参数列举一下
--master MASTER_URL 其中 MASTER_URL 可选如下 local启1个work线程本地运行应用程序local[K]启K个work线程本地运行应用程序local[K,F]启K个work线程本地运行应用程序且运行中最大容忍F次失败次数local[*]尽可能多启动cpu逻辑线程本地运行应用程序local[*,F]尽可能多启动cpu逻辑线程本地运行应用程序且运行中最大容忍F次失败次数local-cluster[N,C,M]仅用于单元测试它在一个JVM中模拟一个分布式集群其中有N个工作线程每个工作线程有C个内核每个工作进程有M MiB的内存。spark://host:port连接standalone集群的master节点端口默认7077spark://HOST1:PORT1,HOST2:PORT2连接带有Zookeeper备份的standalone集群的master节点。该列表必须使用Zookeeper设置高可用性集群中的所有主主机端口默认7077。mesos://host:port连接 Mesos 集群端口默认5050yarn连接 YARN 集群此外--deploy-mode参数决定了是client还是cluster模式k8s://https://host:port 连接 K8s 集群此外--deploy-mode参数决定了是client还是cluster模式 --deploy-mode 可选cluster及client。cluster在work节点上部署driver。client作为外部client在本地部署driver默认是client--driver-memory MEM 分配driver的内存默认1024M--executor-memory MEM 分配每个executor的内存默认1G--driver-cores NUM driver 可以使用的核数默认1。注意仅在cluster模式下有效--total-executor-cores NUM 所有的executor总共的核数。注意仅在Spark standalone 以及 Mesos下生效--executor-cores NUM 每个executor可以使用的核数默认1。注意仅在 Spark standalone, YARN以及Kubernetes下生效--num-executors NUM executor启动的数量默认2。注意仅在Spark on YARN 以及 Kubernetes下生效
3. Standalone集群最佳实践
因为Spark Standalone集群下--num-executors NUM 参数不生效而且如果你没有用--deploy-modecluster那么--driver-cores NUM 参数也是不生效的那么一种可行的提交参数
spark-submit
--master spark://master:7077
--name spark-app
--total-executor-cores{集群机器数}*{一台机器的逻辑核数-1}
--executor-cores{一台机器的逻辑核数-1}
--executor-memory{一台机器的内存-3GB}
xxx.py例如Spark Standalone集群有3台机器每台机器cpu核数是16每台机器的内存是16GB那么可以如下提交
spark-submit
--master spark://master:7077
--name spark-app
--total-executor-cores45
--executor-cores15
--executor-memory13GB
xxx.py当然--executor-memory 可以根据实际情况去调整先大致看一下有多少空闲的内存
free -h然后再调整大小