|
1 | 1 | = Stackable Operator for Apache Hive |
2 | 2 | :description: The Stackable Operator for Apache Hive is a Kubernetes operator that can manage Apache Hive metastores. Learn about its features, resources, dependencies and demos, and see the list of supported Hive versions. |
3 | 3 | :keywords: Stackable Operator, Hadoop, Apache Hive, Kubernetes, k8s, operator, engineer, big data, metadata, storage, query |
4 | | - |
5 | | -This is an operator for Kubernetes that can manage https://hive.apache.org[Apache Hive] metastores. The Apache Hive |
6 | | -metastore (HMS) was originally developed as part of Apache Hive. It stores information on the location of tables and |
7 | | -partitions in file and blob storages such as xref:hdfs:index.adoc[Apache HDFS] and S3 and is now used by other tools |
8 | | -besides Hive as well to access tables in files. This Operator does not support deploying Hive itself, but |
9 | | -xref:trino:index.adoc[Trino] is recommended as an alternative query engine. |
| 4 | +:hive: https://hive.apache.org |
| 5 | +:github: https://github.com/stackabletech/hive-operator/ |
| 6 | +:crd: {crd-docs-base-url}/hive-operator/{crd-docs-version}/ |
| 7 | +:crd-hivecluster: {crd-docs}/hive.stackable.tech/hivecluster/v1alpha1/ |
| 8 | +:feature-tracker: https://features.stackable.tech/unified |
| 9 | + |
| 10 | +[.link-bar] |
| 11 | +* {github}[GitHub {external-link-icon}^] |
| 12 | +* {feature-tracker}[Feature Tracker {external-link-icon}^] |
| 13 | +* {crd}[CRD documentation {external-link-icon}^] |
| 14 | + |
| 15 | +This is an operator for Kubernetes that can manage {hive}[Apache Hive] metastores. |
| 16 | +The Apache Hive metastore (HMS) was originally developed as part of Apache Hive. |
| 17 | +It stores information on the location of tables and partitions in file and blob storages such as xref:hdfs:index.adoc[Apache HDFS] and S3 and is now used by other tools besides Hive as well to access tables in files. |
| 18 | +This operator does not support deploying Hive itself, but xref:trino:index.adoc[Trino] is recommended as an alternative query engine. |
10 | 19 |
|
11 | 20 | == Getting started |
12 | 21 |
|
13 | | -Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable |
14 | | -Hive Operator and its dependencies. It walks you through setting up a Hive metastore and connecting it to a demo |
15 | | -Postgres database and a Minio instance to store data in. |
| 22 | +Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable Hive operator and its dependencies. |
| 23 | +It walks you through setting up a Hive metastore and connecting it to a demo Postgres database and a Minio instance to store data in. |
16 | 24 |
|
17 | | -Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your Hive metastore |
18 | | -configuration to your needs, or have a look at the <<demos, demos>> for some example setups with either |
19 | | -xref:trino:index.adoc[Trino] or xref:spark-k8s:index.adoc[Spark]. |
| 25 | +Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your Hive metastore configuration to your needs, or have a look at the <<demos, demos>> for some example setups with either xref:trino:index.adoc[Trino] or xref:spark-k8s:index.adoc[Spark]. |
20 | 26 |
|
21 | 27 | == Operator model |
22 | 28 |
|
23 | | -The Operator manages the _HiveCluster_ custom resource. The cluster implements a single `metastore` |
24 | | -xref:concepts:roles-and-role-groups.adoc[role]. |
| 29 | +The operator manages the _HiveCluster_ custom resource. |
| 30 | +The cluster implements a single `metastore` xref:concepts:roles-and-role-groups.adoc[role]. |
25 | 31 |
|
26 | | -image::hive_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable Operator for Apache Hive] |
| 32 | +image::hive_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable operator for Apache Hive] |
27 | 33 |
|
28 | | -For every role group the Operator creates a ConfigMap and StatefulSet which can have multiple replicas (Pods). Every |
29 | | -role group is accessible through its own Service, and there is a Service for the whole cluster. |
| 34 | +For every role group the operator creates a ConfigMap and StatefulSet which can have multiple replicas (Pods). |
| 35 | +Every role group is accessible through its own Service, and there is a Service for the whole cluster. |
30 | 36 |
|
31 | | -The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the Hive metastore |
32 | | -instance. The discovery ConfigMap contains information on how to connect to the HMS. |
| 37 | +The operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the Hive metastore instance. |
| 38 | +The discovery ConfigMap contains information on how to connect to the HMS. |
33 | 39 |
|
34 | 40 | == Dependencies |
35 | 41 |
|
36 | | -The Stackable Operator for Apache Hive depends on the Stackable xref:commons-operator:index.adoc[commons], |
37 | | -xref:secret-operator:index.adoc[secret] and xref:listener-operator:index.adoc[listener] operators. |
| 42 | +The Stackable operator for Apache Hive depends on the Stackable xref:commons-operator:index.adoc[commons], xref:secret-operator:index.adoc[secret] and xref:listener-operator:index.adoc[listener] operators. |
38 | 43 |
|
39 | 44 | == Required external component: An SQL database |
40 | 45 |
|
41 | | -The Hive metastore requires a database to store metadata. Consult the xref:required-external-components.adoc[required |
42 | | -external components page] for an overview of the supported databases and minimum supported versions. |
| 46 | +The Hive metastore requires an SQL database to store metadata. |
| 47 | +Consult the xref:required-external-components.adoc[required external components page] for an overview of the supported databases and minimum supported versions. |
43 | 48 |
|
44 | | -== [[demos]]Demos |
| 49 | +== [[demos]]Demos |
45 | 50 |
|
46 | 51 | Three demos make use of the Hive metastore. |
47 | 52 |
|
48 | | -The xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] and xref:demos:trino-taxi-data.adoc[] use the HMS to store |
49 | | -metadata information about taxi data. The first demo then analyzes the data using xref:spark-k8s:index.adoc[Apache Spark] |
50 | | -and the second one using xref:trino:index.adoc[Trino]. |
| 53 | +The xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] and xref:demos:trino-taxi-data.adoc[] use the HMS to store metadata information about taxi data. |
| 54 | +The first demo then analyzes the data using xref:spark-k8s:index.adoc[Apache Spark] and the second one using xref:trino:index.adoc[Trino]. |
51 | 55 |
|
52 | | -The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo is the biggest demo available. It uses both Spark and |
53 | | -Trino for analysis. |
| 56 | +The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo is the biggest demo available. |
| 57 | +It uses both Spark and Trino for analysis. |
54 | 58 |
|
55 | 59 | == Why is the Hive query engine not supported? |
56 | 60 |
|
57 | | -Only the metastore is supported, not Hive itself. There are several reasons why running Hive on Kubernetes may not be an |
58 | | -optimal solution. The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of |
59 | | -the same role as Kubernetes - i.e. assigning resources. For this reason we provide xref:trino:index.adoc[Trino] as a |
60 | | -query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of |
61 | | -this operator as well. Trino should offer all the capabilities Hive offers including a lot of additional functionality, |
62 | | -such as connections to other data sources. |
| 61 | +Only the metastore is supported, not Hive itself. |
| 62 | +There are several reasons why running Hive on Kubernetes may not be an optimal solution. |
| 63 | +The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources. |
| 64 | +For this reason we provide xref:trino:index.adoc[Trino] as a query engine in the Stackable Data Platform instead of Hive. |
| 65 | +Trino still uses the Hive Metastore, hence the inclusion of this operator as well. |
| 66 | +Trino should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources. |
63 | 67 |
|
64 | 68 | Additionally, Tables in the HMS can also be accessed from xref:spark-k8s:index.adoc[Apache Spark]. |
65 | 69 |
|
66 | 70 | == Supported versions |
67 | 71 |
|
68 | | -The Stackable Operator for Apache Hive currently supports the Hive versions listed below. |
| 72 | +The Stackable operator for Apache Hive currently supports the Hive versions listed below. |
69 | 73 | To use a specific Hive version in your HiveCluster, you have to specify an image - this is explained in the xref:concepts:product-image-selection.adoc[] documentation. |
70 | 74 | The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under xref:concepts:product-image-selection.adoc[] as well. |
71 | 75 |
|
72 | 76 | include::partial$supported-versions.adoc[] |
| 77 | + |
| 78 | +== Useful links |
| 79 | + |
| 80 | +* The {github}[hive-operator {external-link-icon}^] GitHub repository |
| 81 | +* The operator feature overview in the {feature-tracker}[feature tracker {external-link-icon}^] |
| 82 | +* The {crd-hivecluster}[HiveCluster {external-link-icon}^] CRD documentation |
0 commit comments