linux版本: centos7
hadoop版本: 3.1.1
spark版本: 2.3.2
在1中已经搭建好了hadoop,接下来就是spark。
为方便起见,用shell脚本完成下载spark、hive(后面再搭,先把spark跑起来)的操作。
download_file.sh ------------------------------- #!/bin/bash TARGET=files HADOOP_VERSION=3.1.1 HIVE_VERSION=2.3.3 SPARK_VERSION=2.3.2 HADOOP_FILE=hadoop-$HADOOP_VERSION.tar.gz HIVE_FILE=apache-hive-$HIVE_VERSION-bin.tar.gz SPARK_FILE=spark-$SPARK_VERSION-bin-hadoop2.7.tgz if [ ! -f "$HADOOP_FILE" ]; then echo "https://www-us.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/$HADOOP_FILE is downloading" curl -O https://www-us.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/$HADOOP_FILE fi echo "Hadoop is completed!" if [ ! -f "$HIVE_FILE" ]; then echo "https://www-us.apache.org/dist/hive/hive-$HIVE_VERSION/$HIVE_FILE is downloading" curl -O https://www-us.apache.org/dist/hive/hive-$HIVE_VERSION/$HIVE_FILE fi echo "HIVE is completed!" if [ ! -f "$SPARK_FILE" ]; then echo "https://www-us.apache.org/dist/spark/spark-$SPARK_VERSION/$SPARK_FILE is downloading" curl -O https://www-us.apache.org/dist/spark/spark-$SPARK_VERSION/$SPARK_FILE fi echo "$SPARK_FILE completed!"运行脚本下载spark和hive
解压缩到~/hadoop下(hadoop用户身份)
$ cd ~/hadoop $ ls hadoop-3.1.1 hive-2.3.3 spark-2.3.2下载Anaconda3
$ cd ~/hadoop $ ls hadoop-3.1.1 hive-2.3.3 spark-2.3.2由于我打算使用pyspark,且想使用python3,所以安装Anaconda3。
$ curl -O https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.3.0-Linux-x86_64.sh $ ./Anaconda3-5.3.0-Linux-x86_64.sh配置文件
spark-env.sh追加如下内容
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64 export JRE_HOME=${JAVA_HOME}/jre export HADOOP_HOME=/home/hadoop/hadoop/hadoop-3.1.1 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_MASTER_PORT=7077 export SPARK_MASTER_HOST=master slaves slave1 slave2 log4j.properties若想日志信息仅记载警告信息,则将log4j.rootCategory=INFO, console改为log4j.rootCategory=WARN, console
/etc/profile.d/spark-2.3.2.sh export SPARK_HOME=/home/hadoop/hadoop/spark-2.3.2 export PATH=$SPARK_HOME/bin:$PATH $ source /etc/profile以上内容master和slaves机器都要进行,master配置好拷贝过去就可以了。
master机器运行spark
$ cd $SPARK_HOME/sbin $ ./start-all.sh <!--先执行start-master再执行start-slave--> $ jps 16000 Master 15348 NameNode 15598 SecondaryNameNode 16158 Jps <!--只启动了hdfs和spark-->打开http://master:8080(公网),显示Alive Workers:2,完成。