这篇文章给大家分享的是有关hive on spark怎样编译 的内容。小编觉得挺实用的,因此分享给大家做个参考,一起跟随小编过来看看吧。
前置条件说明
Hive on Spark是Hive跑在Spark上,用的是Spark执行引擎,而不是MapReduce,和Hive on Tez的道理一样。
从Hive 1.1版本开始,Hive on Spark已经成为Hive代码的一部分了,并且在spark分支上面。
源码下载
git clone https://github.com/apache/hive.git hive_on_spark
编译
cd hive_on_spark/
git branch -r
origin/HEAD -> origin/master
origin/HIVE-4115
origin/HIVE-8065
origin/beeline-cli
origin/branch-0.10
origin/branch-0.11
origin/branch-0.12
origin/branch-0.13
origin/branch-0.14
origin/branch-0.2
origin/branch-0.3
origin/branch-0.4
origin/branch-0.5
origin/branch-0.6
origin/branch-0.7
origin/branch-0.8
origin/branch-0.8-r2
origin/branch-0.9
origin/branch-1
origin/branch-1.0
origin/branch-1.0.1
origin/branch-1.1
origin/branch-1.1.1
origin/branch-1.2
origin/cbo
origin/hbase-metastore
origin/llap
origin/master
origin/maven
origin/next
origin/parquet
origin/ptf-windowing
origin/release-1.1
origin/spark
origin/spark-new
origin/spark2
origin/tez
origin/vectorization
git checkout origin/spark
git branch* (分离自 origin/spark)
master123456789101112131415161718192021222324252627282930313233343536373839404142434445
修改$HIVE_ON_SPARK/pom.xml
spark版本改成spark1.4.1
<spark.version>1.4.1</spark.version>1
hadoop版本改成2.3.0-cdh6.1.0
<hadoop-23.version>2.3.0-cdh6.1.0</hadoop-23.version>1
编译命令
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"mvn clean package -Phadoop-2 -DskipTests12
添加Spark的依赖到Hive的方法
spark home:/home/cluster/apps/spark/spark-1.4.1
hive home:/home/cluster/apps/hive_on_spark
1.set the property ‘spark.home’ to point to the Spark installation:
hive> set spark.home=/home/cluster/apps/spark/spark-1.4.1; 1
Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:
export SPARK_HOME=/home/cluster/apps/spark/spark-1.4.11
3.Set the spark-assembly jar on the Hive auxpath:
hive --auxpath /home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar1
Add the spark-assembly jar for the current user session:
hive> add jar /home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar;1
Link the spark-assembly jar to $HIVE_HOME/lib.
启动Hive过程中可能出现的错误:
[ERROR] Terminal initialization failed; falling back to unsupportedjava.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected123456789101112131415161718
解决方法:export HADOOP_USER_CLASSPATH_FIRST=true
其他场景的错误解决方法参见:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
需要设置spark.eventLog.dir参数,比如:
set spark.eventLog.dir= hdfs://master:8020/directory
否则查询会报错,否则一直报错:/tmp/spark-event类似的文件夹不存在
启动hive后设置执行引擎为spark:
hive> set hive.execution.engine=spark;1
设置spark的运行模式:
hive> set spark.master=spark://master:70771
或者yarn:spark.master=yarn
Configure Spark-application configs for Hive
可以配置在spark-defaults.conf或者hive-site.xml
spark.master=<Spark Master URL>
spark.eventLog.enabled=true;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;
spark.executor.memory=... #Amount of memory to use per executor process.spark.executor.cores=... #Number of cores per executor.spark.yarn.executor.memoryOverhead=...spark.executor.instances=... #The number of executors assigned to each application.spark.driver.memory=... #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.spark.yarn.driver.memoryOverhead=... #We recommend 400 (MB).12345678910
参数配置详见文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
执行sql语句后可以在监控页面查看job/stages等信息
hive (default)> select city_id, count(*) c from city_info group by city_id order by c desc limit 5;
Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1e
Total jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapreduce.job.reduces=<number>
state = SENT
state = STARTED
state = STARTED
state = STARTED
state = STARTED
Query Hive on Spark job[0] stages:1Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]2015-03-09 17:38:11,822 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1 Stage-2_0: 0/1state = STARTED
state = STARTED
state = STARTED2015-03-09 17:38:14,845 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1 Stage-2_0: 0/1state = STARTED
state = STARTED2015-03-09 17:38:16,861 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1 Stage-2_0: 0/1state = SUCCEEDED2015-03-09 17:38:17,867 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished Stage-2_0: 1/1 Finished
Status: Finished successfully in 10.07 seconds
OK
city_id c
-1000 22826-10 17294-20 10608-1 6186
4158Time taken: 18.417 seconds, Fetched: 5 row(s)
感谢各位的阅读!关于“hive on spark怎样编译 ”这篇文章就分享到这里了,希望以上内容可以对大家有一定的帮助,让大家可以学到更多知识,如果觉得文章不错,可以把它分享出去让更多的人看到吧!