列举网站建设的SEO策略,答题做任务网站,wordpress建网站培训,西安建设工程诚信平台Scala版
1#xff09;创建项目
增加 Scala 插件
Spark 由 Scala 语言开发的#xff0c;咱们当前使用的 Spark 版本为 3.2.0#xff0c;默认采用的 Scala 编译版本为 2.13#xff0c;所以后续开发时。我们依然采用这个版本。开发前请保证 IDEA 开发工具中含有 Scala 开发…Scala版
1创建项目
增加 Scala 插件
Spark 由 Scala 语言开发的咱们当前使用的 Spark 版本为 3.2.0默认采用的 Scala 编译版本为 2.13所以后续开发时。我们依然采用这个版本。开发前请保证 IDEA 开发工具中含有 Scala 开发插件
创建Maven工程
创建Maven Project工程GAV如下
GroupIdArtifactIdVersioncom.clear.sparkbigdata-spark_2.131.0
创建Maven Module工程GAV如下
GroupIdArtifactIdVersioncom.clear.sparkspark-core1.0
POM
repositories!-- 指定仓库的位置依次为aliyun、cloudera、jboss --repositoryidaliyun/idurlhttp://maven.aliyun.com/nexus/content/groups/public//url/repositoryrepositoryidcloudera/idurlhttps://repository.cloudera.com/artifactory/cloudera-repos//url/repositoryrepositoryidjboss/idurlhttps://repository.jboss.com/nexus/content/groups/public//url/repository
/repositoriespropertiesmaven.compiler.source1.8/maven.compiler.sourcemaven.compiler.target1.8/maven.compiler.targetscala.version2.13.5/scala.versionscala.binary.version2.13/scala.binary.versionspark.version3.2.0/spark.versionhadoop.version3.1.3/hadoop.version
/propertiesdependencies!-- 依赖Scala语言--dependencygroupIdorg.scala-lang/groupIdartifactIdscala-library/artifactIdversion${scala.version}/version/dependency!-- Spark Core 依赖 --dependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_${scala.binary.version}/artifactIdversion${spark.version}/version/dependency!-- Hadoop Client 依赖 --dependencygroupIdorg.apache.hadoop/groupIdartifactIdhadoop-client/artifactIdversion${hadoop.version}/version/dependency
/dependenciesbuildoutputDirectorytarget/classes/outputDirectorytestOutputDirectorytarget/test-classes/testOutputDirectoryresourcesresourcedirectory${project.basedir}/src/main/resources/directory/resource/resourcesplugins!-- maven 编译插件--plugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-compiler-plugin/artifactIdversion3.10.1/versionconfigurationsource${maven.compiler.source}/sourcetarget${maven.compiler.target}/targetencodingUTF-8/encoding/configuration/plugin!-- 该插件用于将 Scala 代码编译成 class 文件 --plugingroupIdnet.alchim31.maven/groupIdartifactIdscala-maven-plugin/artifactIdversion3.2.2/versionexecutionsexecution!-- 声明绑定到 maven 的 compile 阶段 --goalsgoalcompile/goalgoaltestCompile/goal/goals/execution/executions/plugin/plugins
/builddependencies!-- spark-core依赖--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_2.13/artifactIdversion3.2.0/version/dependency
/dependencies
buildplugins!-- 该插件用于将 Scala 代码编译成 class 文件 --plugingroupIdnet.alchim31.maven/groupIdartifactIdscala-maven-plugin/artifactIdversion3.2.2/versionexecutionsexecution!-- 声明绑定到 maven 的 compile 阶段 --goalsgoalcompile/goalgoaltestCompile/goal/goals/execution/executions/pluginplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-assembly-plugin/artifactIdversion3.1.0/versionconfigurationdescriptorRefsdescriptorRefjar-with-dependencies/descriptorRef/descriptorRefs/configurationexecutionsexecutionidmake-assembly/idphasepackage/phasegoalsgoalsingle/goal/goals/execution/executions/plugin/plugins
/build配置文件
在src/main/resources目录下放置如下三个文件可以从服务器中拷贝
core-site.xmlhdfs-site.xmllog4j.properties
3代码编写
package com.clear.sparkimport org.apache.spark.{SparkConf, SparkContext}/*** 使用Scala语言使用SparkCore编程实现词频统计WordCount* 从HDFS上读取文件统计WordCount将结果保存在HDFS上*/
object SparkWordCount {def main(args: Array[String]): Unit {// todo 创建SparkContext对象需要传递SparkConf对象设置应用配置信息val conf new SparkConf().setAppName(词频统计).setMaster(local[2])val sc new SparkContext(conf)// todo 读取数据封装数据到RDDval inputRDD sc.textFile(/opt/data/wc/README.md)// 分析数据调用RDD算子val resultRDD inputRDD.flatMap(line line.split(\\s)).map(word (word, 1)).reduceByKey((tmp, item) tmp item)// 保存数据将最终RDD结果数据保存至外部存储系统resultRDD.foreach(tuple println(tuple))resultRDD.saveAsTextFile(s/opt/data/wc-${System.nanoTime()})// 应用程序结束关闭资源sc.stop()}
}4测试
[nhkkk01 wordcount]$ $SPARK_HOME/bin/spark-submit --class com.clear.WordCount /opt/data/wordcount/spark-core-scala-1.0.jar Java版
1POM
dependencies!-- spark-core依赖--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_2.13/artifactIdversion3.2.0/versionscopeprovided/scope/dependency
/dependenciesbuildpluginsplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-jar-plugin/artifactIdversion2.4/versionconfigurationarchivemanifest!-- mainClass标签填写主程序入口--mainClasscom.clear.demo1.CreateFileUtil/mainClassaddClasspathtrue/addClasspathclasspathPrefixlib//classpathPrefix/manifest/archiveclassesDirectory/classesDirectory/configuration/plugin!-- 复制依赖文件到编译目录中 --plugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-dependency-plugin/artifactIdversion3.1.1/versionexecutionsexecutionidcopy-dependencies/idphasepackage/phasegoalsgoalcopy-dependencies/goal/goalsconfigurationoutputDirectory${project.build.directory}/lib/outputDirectory/configuration/execution/executions/plugin/plugins
/build2代码
package com.clear.wordcount;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;import java.util.Arrays;public class JavaSparkWordCount {public static void main(String[] args) {// 创建 SparkConf 对象配置应用SparkConf conf new SparkConf().setAppName(JavaSparkWordCount).setMaster(local);// 基于 SparkConf 创建 JavaSparkContext 对象JavaSparkContext jsc new JavaSparkContext(conf);// 加载文件内容JavaRDDString lines jsc.textFile(file:///opt/data/wordcount/README.md);// 转换为单词 RDDJavaRDDString words lines.flatMap(line -Arrays.asList(line.split( )).iterator());// 统计每个单词出现的次数JavaPairRDDString, Integer counts words.mapToPair(word - new Tuple2(word, 1)).reduceByKey((x, y) - (x y));// 输出结果counts.saveAsTextFile(file:///opt/data/wordcount/wc);// 关闭 JavaSparkContext 对象jsc.stop();}
}3测试
运行
[nhkkk01 wordcount]$ $SPARK_HOME/bin/spark-submit --class com.clear.wordcount.JavaSparkWordCount /opt/data/wordcount/spark-core-demo-1.0.jar 查看结果
[nhkkk01 wc]$ pwd
/opt/data/wordcount/wc
[nhkkk01 wc]$ ll
total 8
-rw-r--r--. 1 nhk nhk 4591 Jul 30 17:48 part-00000
-rw-r--r--. 1 nhk nhk 0 Jul 30 17:49 _SUCCESS
[nhkkk01 wc]$ head part-00000
(package,1)
(For,3)
(Programs,1)
(processing.,2)
(Because,1)
(The,1)
(cluster.,1)
(its,1)
([run,1)
(APIs,1)