# Velodb
> "Timeout expired. The timeout period elapsed prior to completion of the
---
# Source: https://docs.velodb.io/cloud/4.x/best-practice/bi-faq
Version: 4.x
On this page
# BI FAQ
## Power BI
### Q1. An error occurs when you use JDBC to pull data into Desktop Power BI.
"Timeout expired. The timeout period elapsed prior to completion of the
operation or the server is not responding".
Usually, this is Power BI pulling the time timeout of the data source. When
filling in the data source server and database, click the advanced option,
which has a timeout time, set the time higher.
### Q2. When the 2.1.x version uses JDBC to connect to Power BI, an error
occurs "An error happened while reading data from the provider: the given key
was not present in the dictionary".
Run "show collation" in the database first. Generally, only utf8mb4_900_bin is
displayed, and the charset is utf8mb4. The main reason for this error is that
ID 33 needs to be found when connecting to Power BI. That is, rows with 33ids
in the table need to be upgraded to version 2.1.5 or later.
### Q3. Connection Doris Times error "Reading data from the provider times
error index and count must refer to the location within the string".
The cause of the problem is that global parameters are loaded during the
connection process, and the SQL column names and values are the same
SELECT
@@max_allowed_packet as max_allowed_packet, @@character_set_client ,@@character_set_connection ,
@@license,@@sql_mode ,@@lower_case_table_names , @@autocommit ;
The new optimizer can be turned off in the current version or upgraded to
version 2.0.7 or 2.1.6 or later.
### Q4. JDBC connection version 2.1.x error message "Character set 'utf8mb3'
is not supported by.net.Framework".
This problem is easily encountered in version 2.1.x. If this problem occurs,
you need to upgrade the JDBC Driver to 8.0.32.
## Tableau
### Q1. Version 2.0.x reports that Tableau cannot connect to the data source
with error code 37CE01A3.
Turn off the new optimizer in the current version or upgrade to 2.0.7 or later
### Q2. SSL connection error:protocol version mismatch Failed to connect to
the MySQL server
The cause of this error is that SSL authentication is enabled on Doris, but
SSL connections are not used during the connection. You need to disable the
enable_ssl variable in fe.conf.
### Q3. Connection error Unsupported command(COM_STMT_PREPARED)
The MySQL driver version is improperly installed. Install the MySQL 5.1.x
connection driver instead.
On This Page
* Power BI
* Q1. An error occurs when you use JDBC to pull data into Desktop Power BI. "Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding".
* Q2. When the 2.1.x version uses JDBC to connect to Power BI, an error occurs "An error happened while reading data from the provider: the given key was not present in the dictionary".
* Q3. Connection Doris Times error "Reading data from the provider times error index and count must refer to the location within the string".
* Q4. JDBC connection version 2.1.x error message "Character set 'utf8mb3' is not supported by.net.Framework".
* Tableau
* Q1. Version 2.0.x reports that Tableau cannot connect to the data source with error code 37CE01A3.
* Q2. SSL connection error version mismatch Failed to connect to the MySQL server
* Q3. Connection error Unsupported command(COM_STMT_PREPARED)
---
# Source: https://docs.velodb.io/cloud/4.x/best-practice/data-faq
Version: 4.x
On this page
# Data Operation Error
This document is mainly used to record common problems of data operation
during the use of Doris. It will be updated from time to time.
### Q1. Use Stream Load to access FE's public network address to import data,
but is redirected to the intranet IP?
When the connection target of stream load is the http port of FE, FE will only
randomly select a BE node to perform the http 307 redirect operation, so the
user's request is actually sent to a BE assigned by FE. The redirect returns
the IP of the BE, that is, the intranet IP. So if you send the request through
the public IP of FE, it is very likely that you cannot connect because it is
redirected to the internal network address.
The usual way is to ensure that you can access the intranet IP address, or to
assume a load balancer for all BE upper layers, and then directly send the
stream load request to the load balancer, and the load balancer will
transparently transmit the request to the BE node .
### Q2. Does Doris support changing column names?
After version 1.2.0, when the `"light_schema_change"="true"` option is
enabled, column names can be modified.
Before version 1.2.0 or when the `"light_schema_change"="true"` option is not
enabled, modifying column names is not supported. The reasons are as follows:
Doris supports modifying database name, table name, partition name,
materialized view (Rollup) name, as well as column type, comment, default
value, etc. But unfortunately, modifying column names is currently not
supported.
For some historical reasons, the column names are currently written directly
to the data file. When Doris queries, it also finds the corresponding column
through the class name. Therefore, modifying the column name is not only a
simple metadata modification, but also involves data rewriting, which is a
very heavy operation.
We do not rule out some compatible means to support lightweight column name
modification operations in the future.
### Q3. Does the table of the Unique Key model support creating a materialized
view?
not support.
The table of the Unique Key model is a business-friendly table. Because of its
unique function of deduplication according to the primary key, it can easily
synchronize business databases with frequently changed data. Therefore, many
users will first consider using the Unique Key model when accessing data into
Doris.
But unfortunately, the table of the Unique Key model cannot establish a
materialized view. The reason is that the essence of the materialized view is
to "pre-compute" the data through pre-computation, so that the calculated data
is directly returned during the query to speed up the query. In the
materialized view, the "pre-computed" data is usually some aggregated
indicators, such as sum and count. At this time, if the data changes, such as
update or delete, because the pre-computed data has lost detailed information,
it cannot be updated synchronously. For example, a sum value of 5 may be 1+4
or 2+3. Because of the loss of detailed information, we cannot distinguish how
this summation value is calculated, so we cannot meet the needs of updating.
### Q4. tablet writer write failed, tablet_id=27306172, txn_id=28573520,
err=-235 or -238
This error usually occurs during data import operations. The error code is
-235. The meaning of this error is that the data version of the corresponding
tablet exceeds the maximum limit (default 500, controlled by the BE parameter
`max_tablet_version_num`), and subsequent writes will be rejected. For
example, the error in the question means that the data version of the tablet
27306172 exceeds the limit.
This error is usually caused by the import frequency being too high, which is
greater than the compaction speed of the backend data, causing versions to
pile up and eventually exceed the limit. At this point, we can first pass the
show tablet 27306172 statement, and then execute the show proc statement in
the result to check the status of each copy of the tablet. The versionCount in
the result represents the number of versions. If you find that a copy has too
many versions, you need to reduce the import frequency or stop importing and
observe whether the number of versions drops. If the number of versions does
not decrease after the import is stopped, you need to go to the corresponding
BE node to view the be.INFO log, search for the tablet id and compaction
keyword, and check whether the compaction is running normally. For compaction
tuning, you can refer to the ApacheDoris official account article: [Doris Best
Practices - Compaction Tuning
(3)](https://mp.weixin.qq.com/s/cZmXEsNPeRMLHp379kc2aA)
The -238 error usually occurs when the same batch of imported data is too
large, resulting in too many Segment files for a tablet (default is 200,
controlled by the BE parameter `max_segment_num_per_rowset`). At this time, it
is recommended to reduce the amount of data imported in one batch, or
appropriately increase the BE configuration parameter value to solve the
problem. Since version 2.0, users can enable segment compaction feature to
reduce segment file number by setting `enable_segcompaction=true` in BE
config.
### Q5. tablet 110309738 has few replicas: 1, alive backends: [10003]
This error can occur during a query or import operation. Usually means that
the copy of the corresponding tablet has an exception.
At this point, you can first check whether the BE node is down by using the
show backends command. For example, the isAlive field is false, or the
LastStartTime is a recent time (indicating that it has been restarted
recently). If the BE is down, you need to go to the node corresponding to the
BE and check the be.out log. If BE is down for abnormal reasons, the exception
stack is usually printed in be.out to help troubleshoot the problem. If there
is no error stack in be.out. Then you can use the linux command dmesg -T to
check whether the process is killed by the system because of OOM.
If no BE node is down, you need to pass the show tablet 110309738 statement,
and then execute the show proc statement in the result to check the status of
each tablet copy for further investigation.
### Q6. Calling stream load to import data through a Java program may result
in a Broken Pipe error when a batch of data is large.
Apart from Broken Pipe, some other weird errors may occur.
This situation usually occurs after enabling httpv2. Because httpv2 is an http
service implemented using spring boot, and uses tomcat as the default built-in
container. However, there seems to be some problems with tomcat's handling of
307 forwarding, so the built-in container was modified to jetty later. In
addition, the version of apache http client in the java program needs to use
the version after 4.5.13. In the previous version, there were also some
problems with the processing of forwarding.
So this problem can be solved in two ways:
1. Disable httpv2
Restart FE after adding enable_http_server_v2=false in fe.conf. However, the
new version of the UI interface can no longer be used, and some new interfaces
based on httpv2 can not be used. (Normal import queries are not affected).
2. Upgrade
Upgrading to Doris 0.15 and later has fixed this issue.
### Q7. Error -214 is reported when importing and querying
When performing operations such as import, query, etc., you may encounter the
following errors:
failed to initialize storage reader. tablet=63416.1050661139.aa4d304e7a7aff9c-f0fa7579928c85a0, res=-214, backend=192.168.100.10
A -214 error means that the data version for the corresponding tablet is
missing. For example, the above error indicates that the data version of the
copy of tablet 63416 on the BE of 192.168.100.10 is missing. (There may be
other similar error codes, which can be checked and repaired in the following
ways).
Typically, if your data has multiple copies, the system will automatically
repair these problematic copies. You can troubleshoot with the following
steps:
First, check the status of each copy of the corresponding tablet by executing
the `show tablet 63416` statement and executing the `show proc xxx` statement
in the result. Usually we need to care about the data in the `Version` column.
Normally, the Version of multiple copies of a tablet should be the same. And
it is the same as the VisibleVersion version of the corresponding partition.
You can view the corresponding partition version with `show partitions from
tblx` (the partition corresponding to the tablet can be obtained in the `show
tablet` statement.)
At the same time, you can also visit the URL in the CompactionStatus column in
the `show proc` statement (just open it in a browser) to view more specific
version information to check which versions are missing.
If there is no automatic repair for a long time, you need to use the `show
proc "/cluster_balance"` statement to view the tablet repair and scheduling
tasks currently being executed by the system. It may be because there are a
large number of tablets waiting to be scheduled, resulting in a longer repair
time. You can follow records in `pending_tablets` and `running_tablets`.
Further, you can use the `admin repair` statement to specify a table or
partition to be repaired first. For details, please refer to `help admin
repair`;
If it still can't be repaired, then in the case of multiple replicas, we use
the `admin set replica status` command to force the replica in question to go
offline. For details, see the example of setting the replica status to bad in
`help admin set replica status`. (After set to bad, the copy will no longer be
accessed. And it will be automatically repaired later. But before operation,
you should make sure that other copies are normal)
### Q8. Not connected to 192.168.100.1:8060 yet, server_id=384
We may encounter this error when importing or querying. If you go to the
corresponding BE log, you may also find similar errors.
This is an RPC error, and there are usually two possibilities: 1. The
corresponding BE node is down. 2. rpc congestion or other errors.
If the BE node is down, you need to check the specific downtime reason. Only
the problem of rpc congestion is discussed here.
One case is OVERCROWDED, which means that the rpc source has a large amount of
unsent data that exceeds the threshold. BE has two parameters associated with
it:
1. `brpc_socket_max_unwritten_bytes`: The default value is 1GB. If the unsent data exceeds this value, an error will be reported. This value can be modified appropriately to avoid OVERCROWDED errors. (But this cures the symptoms but not the root cause, and there is still congestion in essence).
2. `tablet_writer_ignore_eovercrowded`: Default is false. If set to true, Doris will ignore OVERCROWDED errors during import. This parameter is mainly to avoid import failure and improve the stability of import.
The second is that the packet size of rpc exceeds max_body_size. This problem
may occur if the query has a very large String type, or a bitmap type. It can
be circumvented by modifying the following BE parameters:
brpc_max_body_size:default 3GB.
### Q9. [ Broker load ] org.apache.thrift.transport.TTransportException:
java.net.SocketException: Broken pipe
`org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe` during import.
The reason for this problem may be that when importing data from external
storage (such as HDFS), because there are too many files in the directory, it
takes too long to list the file directory. Here, the Broker RPC Timeout
defaults to 10 seconds, and the timeout needs to be adjusted appropriately
here. time.
Modify the `fe.conf` configuration file to add the following parameters:
broker_timeout_ms = 10000
##The default here is 10 seconds, you need to increase this parameter appropriately
Adding parameters here requires restarting the FE service.
### Q10. [ Routine load ] ReasonOfStateChanged: ErrorReason{code=errCode =
104, msg='be 10004 abort task with reason: fetch failed due to requested
offset not available on the broker: Broker: Offset out of range'}
The reason for this problem is that Kafka's cleanup policy defaults to 7 days.
When a routine load task is suspended for some reason and the task is not
restored for a long time, when the task is resumed, the routine load records
the consumption offset, and This problem occurs when kafka has cleaned up the
corresponding offset
So this problem can be solved with alter routine load:
View the smallest offset of kafka, use the ALTER ROUTINE LOAD command to
modify the offset, and resume the task
ALTER ROUTINE LOAD FOR db.tb
FROM kafka
(
"kafka_partitions" = "0",
"kafka_offsets" = "xxx",
"property.group.id" = "xxx"
);
### Q11. ERROR 1105 (HY000): errCode = 2, detailMessage =
(192.168.90.91)[CANCELLED][INTERNAL_ERROR]error setting certificate verify
locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none
yum install -y ca-certificates
ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt
### Q12. create partition failed. partition numbers will exceed limit variable
max_auto_partition_num
To prevent accidental creation of too many partitions when importing data for
auto-partitioned tables, we use the FE configuration item
`max_auto_partition_num` to control the maximum number of partitions to be
created automatically for such tables. If you really need to create more
partitions, please modify this config item of FE Master node.
On This Page
* Q1. Use Stream Load to access FE's public network address to import data, but is redirected to the intranet IP?
* Q2. Does Doris support changing column names?
* Q3. Does the table of the Unique Key model support creating a materialized view?
* Q4. tablet writer write failed, tablet_id=27306172, txn_id=28573520, err=-235 or -238
* Q5. tablet 110309738 has few replicas: 1, alive backends: [10003]
* Q6. Calling stream load to import data through a Java program may result in a Broken Pipe error when a batch of data is large.
* Q7. Error -214 is reported when importing and querying
* Q8. Not connected to 192.168.100.1:8060 yet, server_id=384
* Q9. [ Broker load ] org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
* Q10. [ Routine load ] ReasonOfStateChanged: ErrorReason{code=errCode = 104, msg='be 10004 abort task with reason: fetch failed due to requested offset not available on the broker: Broker: Offset out of range'}
* Q11. ERROR 1105 (HY000): errCode = 2, detailMessage = (192.168.90.91)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none
* Q12. create partition failed. partition numbers will exceed limit variable max_auto_partition_num
---
# Source: https://docs.velodb.io/cloud/4.x/best-practice/lakehouse-faq
Version: 4.x
On this page
# Data Lakehouse FAQ
## Certificate Issues
1. When querying, an error `curl 77: Problem with the SSL CA cert.` occurs. This indicates that the current system certificate is too old and needs to be updated locally.
* You can download the latest CA certificate from `https://curl.haxx.se/docs/caextract.html`.
* Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`.
2. When querying, an error occurs: `ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`.
yum install -y ca-certificates
ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt
## Kerberos
1. When connecting to a Hive Metastore authenticated with Kerberos, an error `GSS initiate failed` is encountered.
This is usually due to incorrect Kerberos authentication information. You can
troubleshoot by following these steps:
1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later.
2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes.
* `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop hdfs access, fill in the corresponding values for hdfs.
* `hive.metastore.kerberos.principal`: Used for hive metastore.
3. Try replacing the IP in the principal with a domain name (do not use the default `_HOST` placeholder).
4. Ensure that the `/etc/krb5.conf` file exists on all FE and BE nodes.
2. When connecting to a Hive database through the Hive Catalog, an error occurs: `RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`.
If the error occurs during the query when there are no issues with `show
databases` and `show tables`, follow these two steps:
* Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories.
* Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query.
When encountering the error `GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos Ticket)` while querying a table
configured with Kerberos, restarting FE and BE nodes usually resolves the
issue.
* Before restarting all nodes, configure `-Djavax.security.auth.useSubjectCredsOnly=false` in the JAVA_OPTS parameter in `"${DORIS_HOME}/be/conf/be.conf"` to obtain JAAS credentials information through the underlying mechanism rather than the application.
* Refer to [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) for solutions to common JAAS errors.
To resolve the error `Unable to obtain password from user` when configuring
Kerberos in the Catalog:
* Ensure the principal used is listed in klist by checking with `klist -kt your.keytab`.
* Verify the catalog configuration for any missing settings such as `yarn.resourcemanager.principal`.
* If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the `JAVA_HOME` environment variable.
* Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation.
* JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the `$JAVA_HOME/jre/lib/security` directory based on the JDK version:
* JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html)
* JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html)
* JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html)
When encountering the error `java.security.InvalidKeyException: Illegal key
size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162
or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files.
If configuring Kerberos in the Catalog results in the error `SIMPLE
authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core-
site.xml` file in the `"${DORIS_HOME}/be/conf"` directory.
If accessing HDFS results in the error `No common protection layer between
client and server`, ensure that the `hadoop.rpc.protection` properties on the
client and server are consistent.
hadoop.security.authentication
kerberos
When using Broker Load with Kerberos configured and encountering the error
`Cannot locate default realm.`:
Add the configuration item `-Djava.security.krb5.conf=/your-path` to the
`JAVA_OPTS` in the `start_broker.sh` script for Broker Load.
3. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously.
4. Accessing Kerberos with JDK 17
When running Doris with JDK 17 and accessing Kerberos services, you may
encounter issues accessing due to the use of deprecated encryption algorithms.
You need to add the `allow_weak_crypto=true` property in krb5.conf or upgrade
the encryption algorithm in Kerberos.
For more details, refer to:
## JDBC Catalog
1. Error connecting to SQLServer via JDBC Catalog: `unable to find valid certification path to requested target`
Add the `trustServerCertificate=true` option in the `jdbc_url`.
2. Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions
Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`.
> Note: Starting from version 1.2.3, when connecting to MySQL database via
> JDBC Catalog, these parameters will be automatically added.
3. Error connecting to MySQL database via JDBC Catalog: `Establishing SSL connection without server's identity verification is not recommended`
Add `useSSL=true` in the `jdbc_url`.
4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver.
## Hive Catalog
1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported`
You can try the following methods:
* Put the `iceberg` runtime-related jar package in the lib/ directory of Hive.
* Configure in `hive-site.xml`:
metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
After the configuration is completed, you need to restart the Hive Metastore.
* Add `"get_schema_from_table" = "true"` in the Catalog properties
This parameter is supported since versions 2.1.10 and 3.0.6.
2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException`
If the fe.log contains the following stack trace:
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.getFilteredObjects(AuthorizationMetaStoreFilterHook.java:78) ~[hive-exec-3.1.3-core.jar:3.1.3]
at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.filterDatabases(AuthorizationMetaStoreFilterHook.java:55) ~[hive-exec-3.1.3-core.jar:3.1.3]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1548) ~[doris-fe.jar:3.1.3]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1542) ~[doris-fe.jar:3.1.3]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]
Try adding `"metastore.filter.hook" =
"org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the
`create catalog` statement to resolve.
3. If after creating Hive Catalog, `show tables` works fine but querying results in `java.net.UnknownHostException: xxxxx`
Add the following in the CATALOG's PROPERTIES:
'fs.defaultFS' = 'hdfs://'
4. Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the catalog configuration to map with the column names in the hive table.
CREATE CATALOG hive PROPERTIES (
'hive.version' = '1.x.x'
);
5. When querying table data using Catalog, errors related to Hive Metastore such as `Invalid method name` are encountered, set the `hive.version` parameter.
6. When querying a table in ORC format, if the FE reports `Could not obtain block` or `Caused by: java.lang.NoSuchFieldError: types`, it may be due to the FE accessing HDFS to retrieve file information and perform file splitting by default. In some cases, the FE may not be able to access HDFS. This can be resolved by adding the following parameter: `"hive.exec.orc.split.strategy" = "BI"`. Other options include HYBRID (default) and ETL.
7. In Hive, you can find the partition field values of a Hudi table, but in Doris, you cannot. Doris and Hive currently have different ways of querying Hudi. In Doris, you need to add the partition fields in the avsc file structure of the Hudi table. If not added, Doris will query with partition_val being empty (even if `hoodie.datasource.hive_sync.partition_fields=partition_val` is set).
{
"type": "record",
"name": "record",
"fields": [{
"name": "partition_val",
"type": [
"null",
"string"
],
"doc": "Preset partition field, empty string when not partitioned",
"default": null
},
{
"name": "name",
"type": "string",
"doc": "Name"
},
{
"name": "create_time",
"type": "string",
"doc": "Creation time"
}
]
}
8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced.
9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service:
metastore.storage.schema.reader.impl
org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
10. Error: `java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty`. The complete error message in the FE log is as follows:
org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path exception. path=s3://bucket/part-*, err: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
org.apache.hadoop.fs.s3a.AWSClientIOException: listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: javax.net.ssl.SSLException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Try updating the CA certificate on the FE node using `update-ca-trust
(CentOS/RockyLinux)`, and then restart the FE process.
11. BE error: `java.lang.InternalError`. If you see an error similar to the following in `be.INFO`:
W20240506 15:19:57.553396 266457 jni-util.cpp:259] java.lang.InternalError
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.(ZlibDecompressor.java:114)
at org.apache.hadoop.io.compress.GzipCodec$GzipZlibDecompressor.(GzipCodec.java:229)
at org.apache.hadoop.io.compress.GzipCodec.createDecompressor(GzipCodec.java:188)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:183)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.(CodecFactory.java:99)
at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:223)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:212)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:43)
It is because the Doris built-in `libz.a` conflicts with the system
environment's `libz.so`. To resolve this issue, first execute `export
LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH`, and then restart the BE
process.
12. When inserting data into Hive, an error occurred as `HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`.
Since after inserting the data, the corresponding statistical information
needs to be updated, and this update operation requires the alter privilege.
Therefore, the alter privilege needs to be added for this user on Ranger.
## HDFS
1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2.
2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used.
Note: This feature may increase the load on the HDFS cluster, so use it
judiciously.
You can enable this feature by:
create catalog regression properties (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
'dfs.client.hedged.read.threadpool.size' = '128',
'dfs.client.hedged.read.threshold.millis' = "500"
);
`dfs.client.hedged.read.threadpool.size` represents the number of threads used
for Hedged Read, which are shared by an HDFS Client. Typically, for an HDFS
cluster, BE nodes will share an HDFS Client.
`dfs.client.hedged.read.threshold.millis` is the read threshold in
milliseconds. When a read request exceeds this threshold without returning, a
Hedged Read is triggered.
When enabled, you can see the related parameters in the Query Profile:
`TotalHedgedRead`: Number of times Hedged Read was initiated.
`HedgedReadWins`: Number of successful Hedged Reads (times when the request
was initiated and returned faster than the original request)
Note that these values are cumulative for a single HDFS Client, not for a
single query. The same HDFS Client can be reused by multiple queries.
3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`
In the start scripts of FE and BE, the environment variable `HADOOP_CONF_DIR`
is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as
pointing to a non-existent or incorrect path, it may load the wrong xxx-
site.xml file, resulting in reading incorrect information.
Check if `HADOOP_CONF_DIR` is configured correctly or remove this environment
variable.
4. `BlockMissingExcetpion: Could not obtain block: BP-XXXXXXXXX No live nodes contain current block`
Possible solutions include:
* Use `hdfs fsck file -files -blocks -locations` to check if the file is healthy.
* Check connectivity with datanodes using `telnet`.
* Check datanode logs.
If you encounter the following error:
`org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL
data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX.
Perhaps the client is running an older version of Hadoop which does not
support SASL data transfer protection` it means that the current hdfs has
enabled encrypted transmission, but the client has not, causing the error.
Use any of the following solutions:
* Copy hdfs-site.xml and core-site.xml to be/conf and fe/conf directories. (Recommended)
* In hdfs-site.xml, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the catalog.
## DLF Catalog
1. When using the DLF Catalog, if `Invalid address` occurs during BE reading JindoFS data, add the domain name appearing in the logs to IP mapping in `/etc/hosts`.
2. If there is no permission to read data, use the `hadoop.username` property to specify a user with permission.
3. The metadata in the DLF Catalog should be consistent with DLF. When managing metadata using DLF, newly imported partitions in Hive may not be synchronized by DLF, leading to inconsistencies between DLF and Hive metadata. To address this, ensure that Hive metadata is fully synchronized by DLF.
## Other Issues
1. Query results in garbled characters after mapping Binary type to Doris
Doris natively does not support the Binary type, so when mapping Binary types
from various data lakes or databases to Doris, it is usually done using the
String type. The String type can only display printable characters. If you
need to query the content of Binary data, you can use the `TO_BASE64()`
function to convert it to Base64 encoding before further processing.
2. Analyzing Parquet files
When querying Parquet files, due to potential differences in the format of
Parquet files generated by different systems, such as the number of RowGroups,
index values, etc., sometimes it is necessary to check the metadata of Parquet
files for issue identification or performance analysis. Here is a tool
provided to help users analyze Parquet files more conveniently:
1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz)
2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet`
3. Use the following command to analyze the metadata of the Parquet file:
`./parquet-tools meta /path/to/file.parquet`
4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli)
On This Page
* Certificate Issues
* Kerberos
* JDBC Catalog
* Hive Catalog
* HDFS
* DLF Catalog
* Other Issues
---
# Source: https://docs.velodb.io/cloud/4.x/best-practice/load-faq
Version: 4.x
On this page
# Load FAQ
## General Load FAQ
### Error "[DATA_QUALITY_ERROR] Encountered unqualified data"
**Problem Description** : Data quality error during loading.
**Solution** :
* Stream Load and Insert Into operations will return an error URL, while for Broker Load you can check the error URL through the `Show Load` command.
* Use a browser or curl command to access the error URL to view the specific data quality error reasons.
* Use the strict_mode and max_filter_ratio parameters to control the acceptable error rate.
### Error "[E-235] Failed to init rowset builder"
**Problem Description** : Error -235 occurs when the load frequency is too
high and data hasn't been compacted in time, exceeding version limits.
**Solution** :
* Increase the batch size of data loading and reduce loading frequency.
* Increase the `max_tablet_version_num` parameter in `be.conf`, it is recommended not to exceed 5000.
### Error "[E-238] Too many segments in rowset"
**Problem Description** : Error -238 occurs when the number of segments under
a single rowset exceeds the limit.
**Common Causes** :
* The bucket number configured during table creation is too small.
* Data skew occurs; consider using more balanced bucket keys.
### Error "Transaction commit successfully, BUT data will be visible later"
**Problem Description** : Data load is successful but temporarily not visible.
**Cause** : Usually due to transaction publish delay caused by system resource
pressure.
### Error "Failed to commit kv txn [...] Transaction exceeds byte limit"
**Problem Description** : In shared-nothing mode, too many partitions and
tablets are involved in a single load, exceeding the transaction size limit.
**Solution** :
* Load data by partition in batches to reduce the number of partitions involved in a single load.
* Optimize table structure to reduce the number of partitions and tablets.
### Extra "\r" in the last column of CSV file
**Problem Description** : Usually caused by Windows line endings.
**Solution** : Specify the correct line delimiter: `-H "line_delimiter:\r\n"`
### CSV data with quotes imported as null
**Problem Description** : CSV data with quotes becomes null after import.
**Solution** : Use the `trim_double_quotes` parameter to remove double quotes
around fields.
## Stream Load
### Reasons for Slow Loading
* Bottlenecks in CPU, IO, memory, or network card resources.
* Slow network between client machine and BE machines, can be initially diagnosed through ping latency from client to BE machines.
* Webserver thread count bottleneck, too many concurrent Stream Loads on a single BE (exceeding be.conf webserver_num_workers configuration) may cause thread count bottleneck.
* Memtable Flush thread count bottleneck, check BE metrics doris_be_flush_thread_pool_queue_size to see if queuing is severe. Can be resolved by increasing the be.conf flush_thread_num_per_store parameter.
### Handling Special Characters in Column Names
When column names contain special characters, use single quotes with backticks
to specify the columns parameter:
curl --location-trusted -u root:"" \
-H 'columns:`@coltime`,colint,colvar' \
-T a.csv \
-H "column_separator:," \
http://127.0.0.1:8030/api/db/loadtest/_stream_load
## Routine Load
### Major Bug Fixes
Issue Description| Trigger Conditions| Impact Scope| Temporary Solution|
Affected Versions| Fixed Versions| Fix PR| When at least one job times out
while connecting to Kafka, it affects the import of other jobs, slowing down
global Routine Load imports.| At least one job times out while connecting to
Kafka.| Shared-nothing and shared-storage| Stop or manually pause the job to
resolve the issue.| <2.1.9 <3.0.5| 2.1.9 3.0.5|
[#47530](https://github.com/apache/doris/pull/47530)| User data may be lost
after restarting the FE Master.| The job's offset is set to OFFSET_END, and
the FE is restarted.| Shared-storage| Change the consumption mode to
OFFSET_BEGINNING.| 3.0.2-3.0.4| 3.0.5|
[#46149](https://github.com/apache/doris/pull/46149)| A large number of small
transactions are generated during import, causing compaction to fail and
resulting in continuous -235 errors.| Doris consumes data too quickly, or
Kafka data flow is in small batches.| Shared-nothing and shared-storage| Pause
the Routine Load job and execute the following command: `ALTER ROUTINE LOAD
FOR jobname FROM kafka ("property.enable.partition.eof" = "false");`| <2.1.8
<3.0.4| 2.1.8 3.0.4| [#45528](https://github.com/apache/doris/pull/45528),
[#44949](https://github.com/apache/doris/pull/44949),
[#39975](https://github.com/apache/doris/pull/39975)| Kafka third-party
library destructor hangs, causing data consumption to fail.| Kafka topic
deletion (possibly other conditions).| Shared-nothing and shared-storage|
Restart all BE nodes.| <2.1.8 <3.0.4| 2.1.8 3.0.4|
[#44913](https://github.com/apache/doris/pull/44913)| Routine Load scheduling
hangs.| Timeout occurs when FE aborts a transaction in Meta Service.| Shared-
storage| Restart the FE node.| <3.0.2| 3.0.2|
[#41267](https://github.com/apache/doris/pull/41267)| Routine Load restart
issue.| Restarting BE nodes.| Shared-nothing and shared-storage| Manually
resume the job.| <2.1.7 <3.0.2| 2.1.7 3.0.2|
[#41134](https://github.com/apache/doris/pull/41134)
---|---|---|---|---|---|---
### Default Configuration Optimizations
Optimization Content| Applied Versions| Corresponding PR| Increased the
timeout duration for Routine Load.| 2.1.7 3.0.3|
[#42042](https://github.com/apache/doris/pull/42042),
[#40818](https://github.com/apache/doris/pull/40818)| Adjusted the default
value of `max_batch_interval`.| 2.1.8 3.0.3|
[#42491](https://github.com/apache/doris/pull/42491)| Removed the restriction
on `max_batch_interval`.| 2.1.5 3.0.0|
[#29071](https://github.com/apache/doris/pull/29071)| Adjusted the default
values of `max_batch_rows` and `max_batch_size`.| 2.1.5 3.0.0|
[#36632](https://github.com/apache/doris/pull/36632)
---|---|---
### Observability Optimizations
Optimization Content| Applied Versions| Corresponding PR| Added observability-
related metrics.| 3.0.5| [#48209](https://github.com/apache/doris/pull/48209),
[#48171](https://github.com/apache/doris/pull/48171),
[#48963](https://github.com/apache/doris/pull/48963)
---|---|---
### Error "failed to get latest offset"
**Problem Description** : Routine Load cannot get the latest Kafka offset.
**Common Causes** :
* Usually due to network connectivity issues with Kafka. Verify by pinging or using telnet to test the Kafka domain name.
* Timeout caused by third-party library bug, error: java.util.concurrent.TimeoutException: Waited X seconds
### Error "failed to get partition meta: Local:'Broker transport failure"
**Problem Description** : Routine Load cannot get Kafka Topic Partition Meta.
**Common Causes** :
* Usually due to network connectivity issues with Kafka. Verify by pinging or using telnet to test the Kafka domain name.
* If using domain names, try configuring domain name mapping in /etc/hosts
### Error "Broker: Offset out of range"
**Problem Description** : The consumed offset doesn't exist in Kafka, possibly
because it has been cleaned up by Kafka.
**Solution** :
* Need to specify a new offset for consumption, for example, set offset to OFFSET_BEGINNING.
* Need to set appropriate Kafka log cleanup parameters based on import speed: log.retention.hours, log.retention.bytes, etc.
On This Page
* General Load FAQ
* Error "[DATA_QUALITY_ERROR] Encountered unqualified data"
* Error "[E-235] Failed to init rowset builder"
* Error "[E-238] Too many segments in rowset"
* Error "Transaction commit successfully, BUT data will be visible later"
* Error "Failed to commit kv txn [...] Transaction exceeds byte limit"
* Extra "\r" in the last column of CSV file
* CSV data with quotes imported as null
* Stream Load
* Reasons for Slow Loading
* Handling Special Characters in Column Names
* Routine Load
* Major Bug Fixes
* Default Configuration Optimizations
* Observability Optimizations
* Error "failed to get latest offset"
* Error "failed to get partition meta: Local:'Broker transport failure"
* Error "Broker: Offset out of range"
---
# Source: https://docs.velodb.io/cloud/4.x/best-practice/sql-faq
Version: 4.x
On this page
# SQL Error
### Q1. Show backends/frontends The information viewed is incomplete
After executing certain statements such as `show backends/frontends`, some
columns may be found to be incomplete in the results. For example, the disk
capacity information cannot be seen in the show backends result.
Usually this problem occurs when the cluster has multiple FEs. If users
connect to non-Master FE nodes to execute these statements, they will see
incomplete information. This is because some information exists only on the
Master FE node. For example, BE's disk usage information, etc. Therefore,
complete information can only be obtained after a direct connection to the
Master FE.
Of course, users can also execute `set forward_to_master=true;` before
executing these statements. After the session variable is set to true, some
information viewing statements executed subsequently will be automatically
forwarded to the Master FE to obtain the results. In this way, no matter which
FE the user is connected to, the complete result can be obtained.
### Q2. invalid cluster id: xxxx
This error may appear in the results of the show backends or show frontends
commands. Usually appears in the error message column of an FE or BE node. The
meaning of this error is that after the Master FE sends the heartbeat
information to the node, the node finds that the cluster id carried in the
heartbeat information is different from the cluster id stored locally, so it
refuses to respond to the heartbeat.
The Master FE node of Doris will actively send heartbeats to each FE or BE
node, and will carry a cluster_id in the heartbeat information. cluster_id is
the unique cluster ID generated by the Master FE when a cluster is
initialized. When the FE or BE receives the heartbeat information for the
first time, the cluster_id will be saved locally in the form of a file. The
file of FE is in the image/ directory of the metadata directory, and the BE
has a cluster_id file in all data directories. After that, each time the node
receives the heartbeat, it will compare the content of the local cluster_id
with the content in the heartbeat. If it is inconsistent, it will refuse to
respond to the heartbeat.
This mechanism is a node authentication mechanism to prevent receiving false
heartbeat messages sent by nodes outside the cluster.
If needed to recover from this error. The first thing to do is to make sure
that all the nodes are in the correct cluster. After that, for the FE node,
you can try to modify the cluster_id value in the image/VERSION file in the
metadata directory and restart the FE. For the BE node, you can delete all the
cluster_id files in the data directory and restart the BE.
### Q3. Unique Key model query results are inconsistent
In some cases, when a user uses the same SQL to query a table with a Unique
Key model, the results of multiple queries may be inconsistent. And the query
results always change between 2-3 kinds.
This may be because, in the same batch of imported data, there are data with
the same key but different values, which will lead to inconsistent results
between different replicas due to the uncertainty of the sequence of data
overwriting.
For example, the table is defined as k1, v1. A batch of imported data is as
follows:
1, "abc"
1, "def"
Then maybe the result of copy 1 is `1, "abc"`, and the result of copy 2 is `1,
"def"`. As a result, the query results are inconsistent.
To ensure that the data sequence between different replicas is unique, you can
refer to the [Sequence Column](/cloud/4.x/user-guide/data-
modification/update/update-of-unique-model) function.
### Q4. The problem of querying bitmap/hll type data returns NULL
In version 1.1.x, when vectorization is enabled, and the bitmap type field in
the query data table returns a NULL result,
1. First you have to `set return_object_data_as_binary=true;`
2. Turn off vectorization `set enable_vectorized_engine=false;`
3. Turn off SQL cache `set [global] enable_sql_cache = false;`
This is because the bitmap / hll type is in the vectorized execution engine:
the input is all NULL, and the output result is also NULL instead of 0
### Q5. The problem of querying bitmap/hll type data returns NULL
In version 1.1.x, when vectorization is turned on, and the bitmp type field in
the query data table returns a NULL result,
1. First you have to `set return_object_data_as_binary=true;`
2. Turn off vectorization `set enable_vectorized_engine=false;`
3. Turn off SQL cache `set [global] enable_sql_cache = false;`
This is because the bitmap/hll type is in the vectorized execution engine: the
input is all NULL, and the output result is also NULL instead of 0
### Q6. Error when accessing object storage: curl 77: Problem with the SSL CA
cert
If the `curl 77: Problem with the SSL CA cert` error appears in the be.INFO
log. You can try to solve it in the following ways:
1. Download the certificate at : cacert.pem
2. Copy the certificate to the specified location: `sudo cp /tmp/cacert.pem /etc/ssl/certs/ca-certificates.crt`
3. Restart the BE node.
### Q7. import error:"Message": "[INTERNAL_ERROR]single replica load is
disabled on BE."
1. Make sure this parameters `enable_single_replica_load` in be.conf is set true
2. Restart the BE node.
On This Page
* Q1. Show backends/frontends The information viewed is incomplete
* Q2. invalid cluster id: xxxx
* Q3. Unique Key model query results are inconsistent
* Q4. The problem of querying bitmap/hll type data returns NULL
* Q5. The problem of querying bitmap/hll type data returns NULL
* Q6. Error when accessing object storage: curl 77: Problem with the SSL CA cert
* Q7. import error:"Message": "[INTERNAL_ERROR]single replica load is disabled on BE."
---
# Source: https://docs.velodb.io/cloud/4.x/ecosystem/observability/logstash
Version: 4.x
On this page
# Logstash Doris output plugin
## Introduction
Logstash is a log ETL framework (collect, preprocess, send to storage systems)
that supports custom output plugins to write data into storage systems. The
Logstash Doris output plugin is a plugin for outputting data to Doris.
The Logstash Doris output plugin calls the [Doris Stream
Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual)
HTTP interface to write data into Doris in real-time, offering capabilities
such as multi-threaded concurrency, failure retries, custom Stream Load
formats and parameters, and output write speed.
Using the Logstash Doris output plugin mainly involves three steps:
1. Install the plugin into Logstash
2. Configure the Doris output address and other parameters
3. Start Logstash to write data into Doris in real-time
## Installation
### Obtaining the Plugin
You can download the plugin from the official website or compile it from the
source code yourself.
* Download from the official website
* Installation package without dependencies
* Compile from source code
cd extension/logstash/
gem build logstash-output-doris.gemspec
### Installing the Plugin
* Standard Installation
`${LOGSTASH_HOME}` is the installation directory of Logstash. Run the
`bin/logstash-plugin` command under it to install the plugin.
${LOGSTASH_HOME}/bin/logstash-plugin install logstash-output-doris-1.2.0.gem
Validating logstash-output-doris-1.2.0.gem
Installing logstash-output-doris
Installation successful
The standard installation mode will automatically install the ruby modules
that the plugin depends on. In cases where the network is not available, it
will get stuck and cannot complete. In such cases, you can download the zip
installation package with dependencies for a completely offline installation,
noting that you need to use `file://` to specify the local file system.
* Offline Installation
${LOGSTASH_HOME}/bin/logstash-plugin install file:///tmp/logstash-output-doris-1.2.0.zip
Installing file: logstash-output-doris-1.2.0.zip
Resolving dependencies.........................
Install successful
## Configuration
The configuration for the Logstash Doris output plugin is as follows:
Configuration| Description| `http_hosts`| Stream Load HTTP address, formatted
as a string array, can have one or more elements, each element is host:port.
For example: ["http://fe1:8030", "http://fe2:8030"]| `user`| Doris username,
this user needs to have import permissions for the corresponding Doris
database and table| `password`| Password for the Doris user| `db`| The Doris
database name to write into| `table`| The Doris table name to write into|
`label_prefix`| Doris Stream Load Label prefix, the final generated Label is
_{label_prefix}_{db}_{table}_{yyyymmdd_hhmmss}_{uuid}_, the default value is
logstash| `headers`| Doris Stream Load headers parameter, the syntax format is
a ruby map, for example: headers => { "format" => "json", "read_json_by_line"
=> "true" }| `mapping`| Mapping from Logstash fields to Doris table fields,
refer to the usage examples in the subsequent sections| `message_only`| A
special form of mapping, only outputs the Logstash @message field to Doris,
default is false| `max_retries`| Number of retries for Doris Stream Load
requests on failure, default is -1 for infinite retries to ensure data
reliability| `log_request`| Whether to output Doris Stream Load request and
response metadata in logs for troubleshooting, default is false|
`log_speed_interval`| Time interval for outputting speed in logs, unit is
seconds, default is 10, setting to 0 can disable this type of logging
---|---
## Usage Example
### TEXT Log Collection Example
This example demonstrates TEXT log collection using Doris FE logs as an
example.
**1\. Data**
FE log files are typically located at the fe/log/fe.log file under the Doris
installation directory. They are typical Java program logs, including fields
such as timestamp, log level, thread name, code location, and log content. Not
only do they contain normal logs, but also exception logs with stacktraces,
which are multiline. Log collection and storage need to combine the main log
and stacktrace into a single log entry.
2024-07-08 21:18:01,432 INFO (Statistics Job Appender|61) [StatisticsJobAppender.runAfterCatalogReady():70] Stats table not available, skip
2024-07-08 21:18:53,710 WARN (STATS_FETCH-0|208) [StmtExecutor.executeInternalQuery():3332] Failed to run internal SQL: OriginStatement{originStmt='SELECT * FROM __internal_schema.column_statistics WHERE part_id is NULL ORDER BY update_time DESC LIMIT 500000', idx=0}
org.apache.doris.common.UserException: errCode = 2, detailMessage = tablet 10031 has no queryable replicas. err: replica 10032's backend 10008 does not exist or not alive
at org.apache.doris.planner.OlapScanNode.addScanRangeLocations(OlapScanNode.java:931) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.planner.OlapScanNode.computeTabletInfo(OlapScanNode.java:1197) ~[doris-fe.jar:1.2-SNAPSHOT]
**2\. Table Creation**
The table structure includes fields such as the log's creation time,
collection time, hostname, log file path, log type, log level, thread name,
code location, and log content.
CREATE TABLE `doris_log` (
`log_time` datetime NULL COMMENT 'log content time',
`collect_time` datetime NULL COMMENT 'log agent collect time',
`host` text NULL COMMENT 'hostname or ip',
`path` text NULL COMMENT 'log file path',
`type` text NULL COMMENT 'log type',
`level` text NULL COMMENT 'log level',
`thread` text NULL COMMENT 'log thread',
`position` text NULL COMMENT 'log code position',
`message` text NULL COMMENT 'log message',
INDEX idx_host (`host`) USING INVERTED COMMENT '',
INDEX idx_path (`path`) USING INVERTED COMMENT '',
INDEX idx_type (`type`) USING INVERTED COMMENT '',
INDEX idx_level (`level`) USING INVERTED COMMENT '',
INDEX idx_thread (`thread`) USING INVERTED COMMENT '',
INDEX idx_position (`position`) USING INVERTED COMMENT '',
INDEX idx_message (`message`) USING INVERTED PROPERTIES("parser" = "unicode", "support_phrase" = "true") COMMENT ''
) ENGINE=OLAP
DUPLICATE KEY(`log_time`)
COMMENT 'OLAP'
PARTITION BY RANGE(`log_time`) ()
DISTRIBUTED BY RANDOM BUCKETS 10
PROPERTIES (
"replication_num" = "1",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-7",
"dynamic_partition.end" = "1",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "10",
"dynamic_partition.create_history_partition" = "true",
"compaction_policy" = "time_series"
);
**3\. Logstash Configuration**
Logstash mainly has two types of configuration files: one for the entire
Logstash system and another for a specific log collection.
The configuration file for the entire Logstash system is usually located at
config/logstash.yml. To improve performance when writing to Doris, it is
necessary to modify the batch size and batch delay. For logs with an average
size of a few hundred bytes per line, a batch size of 1,000,000 lines and a
batch delay of 10 seconds are recommended.
pipeline.batch.size: 1000000
pipeline.batch.delay: 10000
The configuration file for a specific log collection, such as
logstash_doris_log.conf, mainly consists of three parts corresponding to the
various stages of ETL:
1. Input is responsible for reading the raw data.
2. Filter is responsible for data transformation.
3. Output is responsible for sending the data to the output destination.
# 1. input is responsible for reading raw data
# File input is an input plugin that can be configured to read the log file of the configured path. It uses the multiline codec to concatenate lines that do not start with a timestamp to the end of the previous line, achieving the effect of merging stacktraces with the main log. File input saves the log content in the @message field, and there are also some metadata fields such as host, log.file.path. Here, we manually add a field named type through add_field, with its value set to fe.log.
input {
file {
path => "/mnt/disk2/xiaokang/opt/doris_master/fe/log/fe.log"
add_field => {"type" => "fe.log"}
codec => multiline {
# valid line starts with timestamp
pattern => "^%{TIMESTAMP_ISO8601} "
# any line not starting with a timestamp should be merged with the previous line
negate => true
what => "previous"
}
}
}
# 2. filter section is responsible for data transformation
# grok is a commonly used data transformation plugin that has some built-in patterns, such as TIMESTAMP_ISO8601 for parsing timestamps, and also supports writing regular expressions to extract fields.
filter {
grok {
match => {
# parse log_time, level, thread, position fields from message
"message" => "%{TIMESTAMP_ISO8601:log_time} (?[A-Z]+) \((?[^\[]*)\) \[(?[^\]]*)\]"
}
}
}
# 3. output section is responsible for data output
# Doris output sends data to Doris using the Stream Load HTTP interface. The data format for Stream Load is specified as JSON through the headers parameter, and the mapping parameter specifies the mapping from Logstash fields to JSON fields. Since headers specify "format" => "json", Stream Load will automatically parse the JSON fields and write them into the corresponding fields of the Doris table.
output {
doris {
http_hosts => ["http://localhost:8630"]
user => "root"
password => ""
db => "log_db"
table => "doris_log"
headers => {
"format" => "json"
"read_json_by_line" => "true"
"load_to_single_tablet" => "true"
}
mapping => {
"log_time" => "%{log_time}"
"collect_time" => "%{@timestamp}"
"host" => "%{[host][name]}"
"path" => "%{[log][file][path]}"
"type" => "%{type}"
"level" => "%{level}"
"thread" => "%{thread}"
"position" => "%{position}"
"message" => "%{message}"
}
log_request => true
}
}
**4\. Running Logstash**
${LOGSTASH_HOME}/bin/logstash -f config/logstash_doris_log.conf
# When log_request is set to true, the log will output the request parameters and response results of each Stream Load.
[2024-07-08T22:35:34,772][INFO ][logstash.outputs.doris ][main][e44d2a24f17d764647ce56f5fed24b9bbf08d3020c7fddcc3298800daface80a] doris stream load response:
{
"TxnId": 45464,
"Label": "logstash_log_db_doris_log_20240708_223532_539_6c20a0d1-dcab-4b8e-9bc0-76b46a929bd1",
"Comment": "",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 452,
"NumberLoadedRows": 452,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 277230,
"LoadTimeMs": 1797,
"BeginTxnTimeMs": 0,
"StreamLoadPutTimeMs": 18,
"ReadDataTimeMs": 9,
"WriteDataTimeMs": 1758,
"CommitAndPublishTimeMs": 18
}
# By default, speed information is logged every 10 seconds, including the amount of data since startup (in MB and ROWS), the total speed (in MB/s and R/s), and the speed in the last 10 seconds.
[2024-07-08T22:35:38,285][INFO ][logstash.outputs.doris ][main] total 11 MB 18978 ROWS, total speed 0 MB/s 632 R/s, last 10 seconds speed 1 MB/s 1897 R/s
### JSON Log Collection Example
This example demonstrates JSON log collection using data from the GitHub
events archive.
**1\. Data**
The GitHub events archive contains archived data of GitHub user actions,
formatted as JSON. It can be downloaded from
[here](https://data.gharchive.org/), for example, the data for January 1,
2024, at 3 PM.
wget https://data.gharchive.org/2024-01-01-15.json.gz
Below is a sample of the data. Normally, each piece of data is on a single
line, but for ease of display, it has been formatted here.
{
"id": "37066529221",
"type": "PushEvent",
"actor": {
"id": 46139131,
"login": "Bard89",
"display_login": "Bard89",
"gravatar_id": "",
"url": "https://api.github.com/users/Bard89",
"avatar_url": "https://avatars.githubusercontent.com/u/46139131?"
},
"repo": {
"id": 780125623,
"name": "Bard89/talk-to-me",
"url": "https://api.github.com/repos/Bard89/talk-to-me"
},
"payload": {
"repository_id": 780125623,
"push_id": 17799451992,
"size": 1,
"distinct_size": 1,
"ref": "refs/heads/add_mvcs",
"head": "f03baa2de66f88f5f1754ce3fa30972667f87e81",
"before": "85e6544ede4ae3f132fe2f5f1ce0ce35a3169d21"
},
"public": true,
"created_at": "2024-04-01T23:00:00Z"
}
**2\. Table Creation**
CREATE DATABASE log_db;
USE log_db;
CREATE TABLE github_events
(
`created_at` DATETIME,
`id` BIGINT,
`type` TEXT,
`public` BOOLEAN,
`actor.id` BIGINT,
`actor.login` TEXT,
`actor.display_login` TEXT,
`actor.gravatar_id` TEXT,
`actor.url` TEXT,
`actor.avatar_url` TEXT,
`repo.id` BIGINT,
`repo.name` TEXT,
`repo.url` TEXT,
`payload` TEXT,
`host` TEXT,
`path` TEXT,
INDEX `idx_id` (`id`) USING INVERTED,
INDEX `idx_type` (`type`) USING INVERTED,
INDEX `idx_actor.id` (`actor.id`) USING INVERTED,
INDEX `idx_actor.login` (`actor.login`) USING INVERTED,
INDEX `idx_repo.id` (`repo.id`) USING INVERTED,
INDEX `idx_repo.name` (`repo.name`) USING INVERTED,
INDEX `idx_host` (`host`) USING INVERTED,
INDEX `idx_path` (`path`) USING INVERTED,
INDEX `idx_payload` (`payload`) USING INVERTED PROPERTIES("parser" = "unicode", "support_phrase" = "true")
)
ENGINE = OLAP
DUPLICATE KEY(`created_at`)
PARTITION BY RANGE(`created_at`) ()
DISTRIBUTED BY RANDOM BUCKETS 10
PROPERTIES (
"replication_num" = "1",
"compaction_policy" = "time_series",
"enable_single_replica_compaction" = "true",
"dynamic_partition.enable" = "true",
"dynamic_partition.create_history_partition" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-30",
"dynamic_partition.end" = "1",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "10",
"dynamic_partition.replication_num" = "1"
);
**3\. Logstash Configuration**
The configuration file differs from the previous TEXT log collection in the
following aspects:
1. The codec parameter for file input is json. Logstash will parse each line of text as JSON format and use the parsed fields for subsequent processing.
2. No filter plugin is used because no additional processing or transformation is needed.
input {
file {
path => "/tmp/github_events/2024-04-01-23.json"
codec => json
}
}
output {
doris {
http_hosts => ["http://fe1:8630", "http://fe2:8630", "http://fe3:8630"]
user => "root"
password => ""
db => "log_db"
table => "github_events"
headers => {
"format" => "json"
"read_json_by_line" => "true"
"load_to_single_tablet" => "true"
}
mapping => {
"created_at" => "%{created_at}"
"id" => "%{id}"
"type" => "%{type}"
"public" => "%{public}"
"actor.id" => "%{[actor][id]}"
"actor.login" => "%{[actor][login]}"
"actor.display_login" => "%{[actor][display_login]}"
"actor.gravatar_id" => "%{[actor][gravatar_id]}"
"actor.url" => "%{[actor][url]}"
"actor.avatar_url" => "%{[actor][avatar_url]}"
"repo.id" => "%{[repo][id]}"
"repo.name" => "%{[repo][name]}"
"repo.url" => "%{[repo][url]}"
"payload" => "%{[payload]}"
"host" => "%{[host][name]}"
"path" => "%{[log][file][path]}"
}
log_request => true
}
}
**4\. Running Logstash**
${LOGSTASH_HOME}/bin/logstash -f logstash_github_events.conf
On This Page
* Introduction
* Installation
* Obtaining the Plugin
* Installing the Plugin
* Configuration
* Usage Example
* TEXT Log Collection Example
* JSON Log Collection Example
---
# Source: https://docs.velodb.io/cloud/4.x/getting-started/overview
Version: 4.x
On this page
# Introduction
VeloDB Cloud is a new generation of multi-cloud native real-time data
warehouse based on Apache Doris, focusing on meeting the real-time analysis
needs of enterprise-level big data, and providing customers with extremely
cost-effective, easy-to-use data analysis services.
VeloDB Cloud is publicly available to customers. If customers want to deploy
VeloDB data warehouse to AWS (Amazon Web Services), Microsoft Azure, GCP
(Google Cloud Platform), please visit and log in to [VeloDB
Cloud](https://www.velodb.cloud/passport/login).
## Key Features
* **Extreme Performance** : In terms of storage, VeloDB Cloud adopts efficient columnar storage and data indexing; in terms of computing, VeloDB Cloud relies on the MPP distributed computing architecture and the vectorized execution engine optimized for X64 and ARM64; VeloDB Cloud is at the global leading level in the ClickBench public performance evaluation.
* **Cost-Effective** : VeloDB Cloud adopts a cloud-native architecture that separates storage and computing, and is designed and developed based on cloud infrastructure. In terms of storage, shared object storage achieves extremely low cost; in terms of computing, VeloDB Cloud supports on-demand scaling and start-stop to maximize resource utilization.
* **Easy-to-Use** : One-click deployment, out-of-the-box; supports MySQL-compatible network connection protocols; provides integrated connectors with Kafka/Flink/Spark/DBT; has a powerful and easy-to-use visual operation and maintenance management console and data development tools.
* **Single-Unified** : On a single product, multiple analytical workloads can be run. Supports real-time/interactive/batch computing types, structured/semi-structured data types, and federated analysis of external data lakes (such as Hive, Iceberg, Hudi, etc.) and databases (such as MySQL, Elasticsearch, etc.).
* **Open** : Based on the open source Apache Doris research and development, VeloDB Cloud continue to contribute innovations to the open source community. VeloDB Cloud is fully compatible with the Apache Doris syntax protocol, and can freely migrate data with Apache Doris. Continue to be compatible and mutually certified with domestic and foreign ecological products and tools. Open cooperation with cloud platforms at home and abroad, the product runs on multiple clouds, providing a consistent user experience.
* **Safe and Stable** : In terms of data security, VeloDB Cloud provides complete authority control, data encryption, backup and recovery mechanisms; in terms of operation and maintenance management, VeloDB Cloud provides comprehensive observability metrics collection and visual management of data warehouse service; in terms of technical support, VeloDB Cloud has a complete ticketing management system and remote assistance platform, providing multiple levels of expert support services.
## Key Concepts

Key Concepts of VeloDB Cloud
* **Organization** : An organization represents an enterprise or a relatively independent group, and users can use the service as an organization after registering with VeloDB Cloud. Organizations are billing and settlement objects in VeloDB Cloud, and billing, resources, and data between different organizations are isolated from each other.
* **Warehouse** : A warehouse is a logical concept that includes computing and storage resources. Each organization can create multiple warehouses to meet the data analysis needs of different businesses, such as orders, advertising, logistics and other businesses. Similarly, resources and data between different warehouses are also isolated from each other, which can be used to meet the security requirements within the organization.
* **Cluster** : A cluster is a computing resource in the warehouse, including one or more computing nodes, which can be elastically scaled. A warehouse can contain multiple clusters, which share the underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other.
* **Storage** : Use a mature and stable object storage system to store the full amount of data, and support multi-computing cluster shared storage, which brings extremely low storage cost, high data reliability and almost unlimited storage capacity to the data warehouse, and greatly simplifies the implementation complexity of the upper computing cluster.
## Product Architecture

Cloud-Native Storage and Computing Separation Architecture
* **Cloud Service Layer** : The cloud service layer is a collection of supporting services provided by VeloDB Cloud, including: authentication, access control, cloud infrastructure management, metadata management, query parsing and optimization, etc., expressed in the form of a "warehouse". Warehouses are isolated from each other.
* **Computing Cluster Layer** : The computing layer is decoupled from the storage layer, supporting flexible elastic scaling and smooth upgrades. The computing layer consists of several computing clusters. Multiple computing clusters share storage, and workloads are isolated between multiple clusters. Each cluster contains one or more computing nodes. Computing nodes use high-speed hard disks to build hot data caches (Cache), and avoid unnecessary cold data reading through leading query optimizers and rich indexing technologies, which significantly optimizes the problem of high response delay of object storage, providing customers with the ultimate data analysis performance.
* **Shared Storage Layer** : The bottom layer of VeloDB Cloud uses cheap, highly available, and nearly infinitely scalable object storage as the shared storage layer, and is based on object storage for deep optimization design, which can help customers reduce the cost of data analysis by multiples, and easily support PB-level data analysis needs. The unified standard and maturity of object storage in different cloud environments also strengthens the consistent use experience of VeloDB Cloud in multiple clouds.
## Application Scenario
* **High Concurrent Real-time Reporting and Analysis** : Use VeloDB Cloud to process online high-concurrency reports to obtain real-time, fast, stable, and highly available services. It supports real-time data writing, sub-second query response, and high-concurrency point queries to meet the high-availability deployment requirements of clusters.
* **User Portrait and Behavior Analysis** : Based on VeloDB Cloud, build user CDP (Customer Data management Platform) data warehouse platform layering, support millisecond-level column addition and dynamic tables to flexibly respond to business changes, support rich behavior analysis functions to simplify development and improve efficiency, and support high-level orthogonal bitmaps to achieve second-level circle people in portrait scenes.
* **Log Storage and Analysis** : Integrating the VeloDB Cloud data warehouse into the logging system to realize real-time log query, low-cost storage, and efficient processing, reduce the overall cost of the enterprise log system, and improve the performance and reliability of the log system.
* **Lake Warehouse Integration and Federated Analysis** : Unified integration of data lakes, databases, and data warehouses into a single platform, relying on the data federation query acceleration capability of VeloDB Cloud, provides high-performance business intelligence reports, Adhoc analysis, and incremental ETL/ELT data processing services.
## Relationship to Apache Doris
VeloDB Inc ("**VeloDB** ") is a commercial company with products based on
Apache Doris. VeloDB was founded in May 2023 by the founding team of Apache
Doris. VeloDB is an important driving force of Apache Doris. It has 7 PMC
members and 20 Committers, and has led the release of a series of core
versions of Apache Doris. VeloDB vigorously promotes the open source Apache
Doris, the technology benefits open source users and developers, and launches
commercial products based on Apache Doris, the business empowers commercial
customers, and the two-wheel drive achieves healthy growth of open source and
business.
VeloDB Cloud is a new generation of multi-cloud native real-time data
warehouse built by VeloDB based on Apache Doris. Compared with Apache Doris,
VeloDB Cloud has the following main differences:
* The core version is more mature and stable, with more enterprise-level features and cloud-native features.
* Provides a built-in visualized operation and maintenance management console and data development tools, no need users to install and deploy, out-of-the-box, minimalist operation and maintenance and management.
On This Page
* Key Features
* Key Concepts
* Product Architecture
* Application Scenario
* Relationship to Apache Doris
---
# Source: https://docs.velodb.io/cloud/4.x/getting-started/quick-start
Version: 4.x
On this page
# Getting Started
## New User Registration and Organization Creation
### Register and Login
Click to enter the VeloDB Cloud registration and
trial page and fill in the relevant information to complete the registration.

> **Tip** VeloDB Cloud includes two independent account systems: One is used
> for logging into the console, as described in this topic. The other one is
> used to connect to the warehouse, which is described in the Connections
> topic.
### Change Password
After login, click **User Menu** > **User Center** to change the login
password for the VeloDB Cloud console.

Once you have successfully changed the password for the first time, you can
use the password for subsequent logins.
## Warehouse and Cluster Creation
In VeloDB Cloud, the warehouse is a logical concept that includes physical
objects such as warehouse metadata, clusters, and data storage.
Under each organization, you can create multiple warehouses to meet the needs
of different business systems, and the resources and data between these
warehouses are isolated.
### Create Warehouse
A wizard page will be displayed if the organization does not have a warehouse.
You can create the first warehouse following the prompts.

You can use a free-tier warehouse or directly purchase a paid warehouse
based on your analytical requirements.
> **Tip:**
>
> 1. For more information about SaaS and BYOC, see [Overview of
> Warehouses](/cloud/4.x/management-guide/warehouse-management/).
> 2. If you need to activate a free BYOC, please refer to [Create a BYOC
> Warehouse](/cloud/4.x/management-guide/warehouse-management/create-byoc-
> warehouse).
>
### Create Cluster
If you have activated the trial warehouse, you will see a trial cluster in
that warehouse.
In the trial warehouse, you may try the features by importing small amounts of
data. You may not create paid clusters under the trial warehouse. If you are
happy with the trial experience, you can upgrade the trial warehouse to a paid
one, and then you can create paid clusters under the paid warehouse.
## Change Warehouse Password
The username and password are required when connecting to a warehouse. VeloDB
Cloud initializes the username ('admin') and password for you. You can change
the password on the **Settings** page.

> **Warning** The password only supports uppercase letters, lowercase letters,
> numbers and special characters ~!@#$%^&*()_+|<>,.?/:;'[]", need to
> contain at least 3 of them, length 8-20 characters.
## Connect to Warehouse
Click **Query** in the left navigation bar, open the login page, enter the
username and password, and enter the WebUI interface after completing the
login.

### Create Database
Execute the following statement in the query editor:
create database demo;
### Create Data Table
Execute the following statement in the query editor:
use demo;
create table mytable
(
k1 TINYINT,
k2 DECIMAL(10, 2) DEFAULT "10.05",
k3 CHAR(10) COMMENT "string column",
k4 INT NOT NULL DEFAULT "1" COMMENT "int column"
)
COMMENT "my first table"
DISTRIBUTED BY HASH(k1) BUCKETS 1;
You can see the fields of mytable through desc mytable.
### Insert Data
Execute the following statement in the query editor:
INSERT INTO mytable (k1, k2, k3, k4) VALUES
(1, 0.14, 'a1', 20),
(2, 1.04, 'b2', 21),
(3, 3.14, 'c3', 22),
(4, 4.35, 'd4', 23);
### Query Data
The table creation and data import are completed above, and the query can be
performed below.
select * from mytable;

## (OPTIONAL)Connect to Warehouse Using MySQL Client
### IP Whitelist Management
On the **Connections** page, switch to the **Public Link** tab to manage IP
whitelist. Click **Add IP Whitelist** to add new IP addresses.

In the IP whitelist, users can add or delete IP addresses to enable or disable
their access to the warehouse.
### MySQL Client
You may download MySQL Client from the official website of MySQL. Here we
provide a Linux-free version of [MySQL Client](https://doris-build-hk.oss-cn-
hongkong.aliyuncs.com/mysql-client/mysql-5.7.22-linux-
glibc2.12-x86_64.tar.gz). If you need MySQL Client for Mac and Windows, please
go to the MySQL official website.
Currently, VeloDB is compatible with MySQL Client 5.7 and above.
You may read details about connections by clicking "Connections" on the target
warehouse on the VeloDB Cloud console.
> Note:
>
> 1. The warehouse supports public network connection and private network
> (PrivateLink) connection. Different connection methods require different
> connection information.
>
> 2. The public network connection is open by default, and the IP whitelist
> is also open to the public by default. If you no longer need to connect to
> the warehouse from the public network, please close it.
>
> 3. For the first connection, please use the user admin and its password.
> You can initialize or reset it in the **Setting** page on VeloDB Cloud
> console.
>
>
Supposing that you are connecting to a warehouse using the following public
link:

Download MySQL Client and unzip the file, find the `mysql` command line tool
under the `bin/` directory. Execute the folowing command to connect to VeloDB.
mysql -h 34.199.74.195 -P 33641 -u admin -p
After login, if you see the following snippet, that usually means that your
Client IP address has not been added to the connection whitelist on the
console.
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
If the following is displayed, that means the connection succeeds.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 119952
Server version: 5.7.37 VeloDB Core version: 3.0.4
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
### Create Database and Table
#### Create Database
create database demo;
#### Create Table
use demo;
create table mytable
(
k1 TINYINT,
k2 DECIMAL(10, 2) DEFAULT "10.05",
k3 CHAR(10) COMMENT "string column",
k4 INT NOT NULL DEFAULT "1" COMMENT "int column"
)
COMMENT "my first table"
DISTRIBUTED BY HASH(k1) BUCKETS 1;
You may check details of `mytable` via `desc mytable`.
#### Load Data
Save the following sample data in the local data.csv:
1,0.14,a1,20
2,1.04,b2,21
3,3.14,c3,22
4,4.35,d4,23
**Upload data via HTTP protocol** :
curl -u admin:admin_123 -H "fileName:dir1/data.csv" -T data.csv -L '34.199.74.195:39173/copy/upload'
You can call and upload multiple files be repeating this command.
**Load data by the copy into command:**
* * *
curl -u admin:admin_123 -H "Content-Type: application/json" '34.199.74.195:39173/copy/query' -d '{"sql": "copy into demo.mytable from @~(\"dir1/data.csv\") PROPERTIES (\"file.column_separator\"=\",\", \"copy.async\"=\"false\")"}'
`dir1/data.csv` refers to the file uploaded in the previous step. Wildcard and
glob pattern matching are supported here.
The service side can automatically identify general formats such as csv.
`file.column_separator=","` specifies comma as the separator in the csv
format.
Since the copy into command is submitted asychronously by default,
`"copy.async"="false"` is specified here to implement synchronous submission.
That is, the command will only return after the data are loaded successfully.
If you see the following response, that means the data are successfully
loaded.
{
"msg": "success",
"code": 0,
"data": {
"result": {
"msg": "",
"loadedRows": "4",
"id": "d33e62f655c4a1a-9827d5561adfb93d",
"state": "FINISHED",
"type": "",
"filterRows": "0",
"unselectRows": "0",
"url": null
},
"time": 5007,
"type": "result_set"
},
"count": 0
}
### Query Data
After table creation and data loading, you may execute queries on the data.
mysql> use demo;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from mytable;
+------+------+------+------+
| k1 | k2 | k3 | k4 |
+------+------+------+------+
| 1 | 0.14 | a1 | 20 |
| 2 | 1.04 | b2 | 21 |
| 3 | 3.14 | c3 | 22 |
| 4 | 4.35 | d4 | 23 |
+------+------+------+------+
4 rows in set (0.15 sec)
On This Page
* New User Registration and Organization Creation
* Register and Login
* Change Password
* Warehouse and Cluster Creation
* Create Warehouse
* Create Cluster
* Change Warehouse Password
* Connect to Warehouse
* Create Database
* Create Data Table
* Insert Data
* Query Data
* (OPTIONAL)Connect to Warehouse Using MySQL Client
* IP Whitelist Management
* MySQL Client
* Create Database and Table
* Query Data
---
# Source: https://docs.velodb.io/cloud/4.x/integration/bi/tableau
Version: 4.x
On this page
# Tableau
VeloDB provides an official Tableau connector. This connector accesses data
based on the MySQL JDBC Driver.
The connector has been tested by the [TDVT
framework](https://tableau.github.io/connector-plugin-sdk/docs/tdvt) with a
100% pass rate.
With this connector, Tableau can integrate Doris databases and tables as data
sources. To enable this, follow the setup guide below:
* Install Tableau and the Doris connector
* Configure an Doris data source in Tableau
* Build visualizations in Tableau
* Connection and usage tips
* Summary
## Install Tableau and Doris connector
1. Download and install [Tableau desktop](https://www.tableau.com/products/desktop/download).
2. Get the [tableau-doris](https://velodb-bi-connector-1316291683.cos.ap-hongkong.myqcloud.com/Tableau/latest/doris_jdbc-latest.taco) custom connector connector (doris_jdbc-***.taco).
3. Get [MySQL JDBC](https://velodb-bi-connector-1316291683.cos.ap-hongkong.myqcloud.com/Tableau/latest/mysql-connector-j-8.3.0.jar) (version 8.3.0).
4. Locations to place the Connector and JDBC driver MacOS:
* Refer to this path: `~/Documents/My Tableau Repository/Connectors`, place the `doris_jdbc-latest.taco` custom connector file (if the path does not exist, create it manually as needed).
* JDBC driver jar placement path: `~/Library/Tableau/Drivers` Windows: Assume `tableau_path` is the Tableau installation directory on Windows, typically defaults to: `tableau_path = C:\Program Files\Tableau`
* Refer to this path: `%tableau_path%``\Connectors\`, place the `doris_jdbc-latest.taco` custom connector file (if the path does not exist, create it manually as needed).
* JDBC driver jar placement path: `%tableau_path%\Drivers\`
Next, you can configure a Doris data source in Tableau and start building data
visualizations!
## Configure a Doris data source in Tableau
Now that you have installed and set up the **JDBC and Connector** drivers,
let's look at how to define a data source in Tableau that connects to the tpch
database in Doris.
1. Gather your connection details
To connect to Doris via JDBC, you need the following information:
Parameter| Meaning| Example| Server| Database host| 127.0.1.28| Port| Database MySQL port| 9030| Catalog| Doris Catalog, used when querying external tables and data lakes, set in Advanced| internal| Database| Database name| tpch| Authentication| Choose database authentication method: Username / Username and Password| Username and Password| Username| Username| testuser| Password| Password| | Init SQL Statement| Initial SQL statement| `select * from database.table`
---|---|---
2. Launch Tableau. (If you were already running it before placing the connector, please restart.)
3. From the left menu, click **More** under the **To a Server** section. In the list of available connectors, search for **Doris JDBC by VeloDB** :

4. Click **Doris by VeloDB ,the following dialog will pop up:**

5. Enter the corresponding connection information as prompted in the dialog.
6. Optional advanced configuration:
* You can enter preset SQL in Initial SQL to define the data source 
* In Advanced, you can use Catalog to access data lake data sources; the default value is internal, 
7. After completing the above input fields, click the **Sign In** button, and you should see a new Tableau workbook: 
Next, you can build some visualizations in Tableau!
## Build visualizations in Tableau
We choose TPC-H data as the data source, refer to [this
document](/cloud/4.x/benchmark/tpch) for the construction method of the Doris
TPC-H data source
Now that we have configured the Doris data source in Tableau, let's visualize
the data
1. Drag the customer table and orders table to the workbook. And select the table join field Custkey for them below

2. Drag the nation table to the workbook and select the table join field Nationkey with the customer table 
3. Now that you have associated the customer table, orders table and nation table as a data source, you can use this relationship to handle questions about the data. Select the `Sheet 1` tab at the bottom of the workbook to enter the workspace. 
4. Suppose you want to know the summary of the number of users per year. Drag OrderDate from orders to the `Columns` area (horizontal field), and then drag customer(count) from customer to `Rows`. Tableau will generate the following line chart: 
A simple line chart is completed, but this dataset is automatically generated
by the tpch script and default rules and is not actual data. It is not for
reference and is intended to test availability.
5. Suppose you want to know the average order amount (USD) by region (country) and year:
* Click the `New Worksheet` tab to create a new sheet
* Drag Name from the nation table to `Rows`
* Drag OrderDate from the orders table to `Columns`
You should see the following:

6. Note: The `Abc` value is just a placeholder value, because you have not defined aggregation logic for that mark, so you need to drag a measure onto the table. Drag Totalprice from the orders table to the middle of the table. Note that the default calculation is to perform a SUM on Totalprices: 
7. Click `SUM` and change `Measure` to `Average`. 
8. From the same dropdown menu, select `Format ` and change `Numbers` to `Currency (Standard)`: 
9. Get a table that meets expectations: 
So far, Tableau has been successfully connected to Doris, and data analysis
and visualization dashboard production has been achieved.
## Connection and usage tips
**Performance optimization**
* According to actual needs, reasonably create doris databases and tables, partition and bucket by time, which can effectively reduce predicate filtering and most data transmission
* Appropriate data pre-aggregation can be done by creating materialized views on the Doris side.
* Set a reasonable refresh plan to balance the computing resource consumption of refresh and the timeliness of dashboard data
**Security configuration**
* It is recommended to use VPC private connections to avoid security risks introduced by public network access.
* Configure security groups to restrict access.
* Enable access methods such as SSL/TLS connections.
* Refine Doris user account roles and access permissions to avoid excessive delegation of permissions.
On This Page
* Install Tableau and Doris connector
* Configure a Doris data source in Tableau
* Build visualizations in Tableau
* Connection and usage tips
---
# Source: https://docs.velodb.io/cloud/4.x/integration/data-processing/flink-doris-connector
Version: 4.x
On this page
# Flink Doris Connector
The [Flink Doris Connector](https://github.com/apache/doris-flink-connector)
is used to read from and write data to a Doris cluster through Flink. It also
integrates [FlinkCDC](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/overview/), which allows for more
convenient full database synchronization with upstream databases such as
MySQL.
Using the Flink Connector, you can perform the following operations:
* **Read data from Doris** : Flink Connector supports parallel reading from BE, improving data retrieval efficiency.
* **Write data to Doris** : After batching in Flink, data is imported into Doris in bulk using Stream Load.
* **Perform dimension table joins with Lookup Join** : Batching and asynchronous queries accelerate dimension table joins.
* **Full database synchronization** : Using Flink CDC, you can synchronize entire databases such as MySQL, Oracle, and PostgreSQL, including automatic table creation and DDL operations.
## Version Description
Connector Version| Flink Version| Doris Version| Java Version| Scala Version|
1.0.3| 1.11,1.12,1.13,1.14| 0.15+| 8| 2.11,2.12| 1.1.1| 1.14| 1.0+| 8|
2.11,2.12| 1.2.1| 1.15| 1.0+| 8| -| 1.3.0| 1.16| 1.0+| 8| -| 1.4.0|
1.15,1.16,1.17| 1.0+| 8| -| 1.5.2| 1.15,1.16,1.17,1.18| 1.0+| 8| -| 1.6.1|
1.15,1.16,1.17,1.18,1.19| 1.0+| 8| -| 24.0.1| 1.15,1.16,1.17,1.18,1.19,1.20|
1.0+| 8| -| 24.1.0| 1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -| 25.0.0|
1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -| 25.1.0|
1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -
---|---|---|---|---
## Usage
The Flink Doris Connector can be used in two ways: via Jar or Maven.
#### Jar
You can download the corresponding version of the Flink Doris Connector Jar
file [here](https://doris.apache.org/download#doris-ecosystem), then copy this
file to the `classpath` of your `Flink` setup to use the `Flink-Doris-
Connector`. For a `Standalone` mode Flink deployment, place this file under
the `lib/` folder. For a Flink cluster running in `Yarn` mode, place the file
into the pre-deployment package.
#### Maven
To use it with Maven, simply add the following dependency to your Pom file:
org.apache.doris
flink-doris-connector-${flink.version}
${connector.version}
For example:
org.apache.doris
flink-doris-connector-1.16
25.1.0
## Working Principles
### Reading Data from Doris

When reading data, Flink Doris Connector offers higher performance compared to
Flink JDBC Connector and is recommended for use:
* **Flink JDBC Connector** : Although Doris is compatible with the MySQL protocol, using Flink JDBC Connector for reading and writing to a Doris cluster is not recommended. This approach results in serial read/write operations on a single FE node, creating a bottleneck and affecting performance.
* **Flink Doris Connector** : Starting from Doris 2.1, ADBC is the default protocol for Flink Doris Connector. The reading process follows these steps:
a. Flink Doris Connector first retrieves Tablet ID information from FE based
on the query plan.
b. It generates the query statement: `SELECT * FROM tbs TABLET(id1, id2,
id3)`.
c. The query is then executed through the ADBC port of FE.
d. Data is returned directly from BE, bypassing FE to eliminate the single-
point bottleneck.
### Writing Data to Doris
When using Flink Doris Connector for data writing, batch processing is
performed in Flink's memory before bulk import via Stream Load. Doris Flink
Connector provides two batching modes, with Flink Checkpoint-based streaming
writes as the default:
| Streaming Write| Batch Write| **Trigger Condition**| Relies on Flink
Checkpoints and follows Flink's checkpoint cycle to write to Doris| Periodic
submission based on connector-defined time or data volume thresholds|
**Consistency**| Exactly-Once| At-Least-Once; Exactly-Once can be ensured
with the primary key model| **Latency**| Limited by the Flink checkpoint
interval, generally higher| Independent batch mechanism with flexible
adjustment| **Fault Tolerance & Recovery**| Fully consistent with Flink state
recovery| Relies on external deduplication logic (e.g., Doris primary key
deduplication)
---|---|---
## Quick Start
#### Preparation
#### Flink Cluster Deployment
Taking a Standalone cluster as an example:
1. Download the Flink installation package, e.g., [Flink 1.18.1](https://archive.apache.org/dist/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz);
2. After extraction, place the Flink Doris Connector package in `/lib`;
3. Navigate to the `` directory and run `bin/start-cluster.sh` to start the Flink cluster;
4. You can verify if the Flink cluster started successfully using the `jps` command.
#### Initialize Doris Tables
Run the following statements to create Doris tables:
CREATE DATABASE test;
CREATE TABLE test.student (
`id` INT,
`name` VARCHAR(256),
`age` INT
)
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
INSERT INTO test.student values(1,"James",18);
INSERT INTO test.student values(2,"Emily",28);
CREATE TABLE test.student_trans (
`id` INT,
`name` VARCHAR(256),
`age` INT
)
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);
#### Run FlinkSQL Task
**Start FlinkSQL Client**
bin/sql-client.sh
**Run FlinkSQL**
CREATE TABLE Student (
id STRING,
name STRING,
age INT
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'test.student',
'username' = 'root',
'password' = ''
);
CREATE TABLE StudentTrans (
id STRING,
name STRING,
age INT
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'test.student_trans',
'username' = 'root',
'password' = '',
'sink.label-prefix' = 'doris_label'
);
INSERT INTO StudentTrans SELECT id, concat('prefix_',name), age+1 FROM Student;
#### Query Data
mysql> select * from test.student_trans;
+------+--------------+------+
| id | name | age |
+------+--------------+------+
| 1 | prefix_James | 19 |
| 2 | prefix_Emily | 29 |
+------+--------------+------+
2 rows in set (0.02 sec)
## Scenarios and Operations
### Reading Data from Doris
When Flink reads data from Doris, the Doris Source is currently a bounded
stream and does not support continuous reading in a CDC manner. Data can be
read from Doris using Thrift or ArrowFlightSQL (supported from version 24.0.0
onward). Starting from version 2.1, ArrowFlightSQL is the recommended
approach.
* **Thrift** : Data is read by calling the BE's Thrift interface. For detailed steps, refer to [Reading Data via Thrift Interface](https://github.com/apache/doris/blob/master/samples/doris-demo/doris-source-demo/README.md).
* **ArrowFlightSQL** : Based on Doris 2.1, this method allows high-speed reading of large volumes of data using the Arrow Flight SQL protocol. For more information, refer to [High-speed Data Transfer via Arrow Flight SQL](https://doris.apache.org/docs/dev/db-connect/arrow-flight-sql-connect/).
#### Using FlinkSQL to Read Data
##### Thrift Method
CREATE TABLE student (
id INT,
name STRING,
age INT
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030', -- Fe的host:HttpPort
'table.identifier' = 'test.student',
'username' = 'root',
'password' = ''
);
SELECT * FROM student;
##### ArrowFlightSQL
CREATE TABLE student (
id INT,
name STRING,
age INT
)
WITH (
'connector' = 'doris',
'fenodes' = '{fe.conf:http_port}',
'table.identifier' = 'test.student',
'source.use-flight-sql' = 'true',
'source.flight-sql-port' = '{fe.conf:arrow_flight_sql_port}',
'username' = 'root',
'password' = ''
);
SELECT * FROM student;
#### Using DataStream API to Read Data
When using the DataStream API to read data, you need to include the
dependencies in your program's POM file in advance, as described in the
"Usage" section.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DorisOptions option = DorisOptions.builder()
.setFenodes("127.0.0.1:8030")
.setTableIdentifier("test.student")
.setUsername("root")
.setPassword("")
.build();
DorisReadOptions readOptions = DorisReadOptions.builder().build();
DorisSource> dorisSource = DorisSource.>builder()
.setDorisOptions(option)
.setDorisReadOptions(readOptions)
.setDeserializer(new SimpleListDeserializationSchema())
.build();
env.fromSource(dorisSource, WatermarkStrategy.noWatermarks(), "doris source").print();
env.execute("Doris Source Test");
For the complete code, refer
to:[DorisSourceDataStream.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/DorisSourceDataStream.java)
### Writing Data to Doris
Flink writes data to Doris using the Stream Load method, supporting both
streaming and batch-insertion modes.
Difference Between Streaming and Batch-insertion
Starting from Connector 1.5.0, batch-insertion is supported. Batch-insertion
does not rely on Checkpoints; it buffers data in memory and controls the
writing timing based on batch parameters. Streaming insertion requires
Checkpoints to be enabled, continuously writing upstream data to Doris during
the entire Checkpoint period, without keeping data in memory continuously.
#### Using FlinkSQL to Write Data
For testing, Flink's [Datagen](https://nightlies.apache.org/flink/flink-docs-
master/docs/connectors/table/datagen/) is used to simulate the continuously
generated upstream data.
-- enable checkpoint
SET 'execution.checkpointing.interval' = '30s';
CREATE TABLE student_source (
id INT,
name STRING,
age INT
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1',
'fields.name.length' = '20',
'fields.id.min' = '1',
'fields.id.max' = '100000',
'fields.age.min' = '3',
'fields.age.max' = '30'
);
-- doris sink
CREATE TABLE student_sink (
id INT,
name STRING,
age INT
)
WITH (
'connector' = 'doris',
'fenodes' = '10.16.10.6:28737',
'table.identifier' = 'test.student',
'username' = 'root',
'password' = 'password',
'sink.label-prefix' = 'doris_label'
--'sink.enable.batch-mode' = 'true' Adding this configuration enables batch writing
);
INSERT INTO student_sink SELECT * FROM student_source;
#### Using DataStream API to Write Data
When using the DataStream API to write data, different serialization methods
can be used to serialize the upstream data before writing it to the Doris
table.
info
The Connector already contains the HttpClient4.5.13 version. If you reference
HttpClient separately in your project, you need to ensure that the versions
are consistent.
##### Standard String Format
When the upstream data is in CSV or JSON format, you can directly use the
`SimpleStringSerializer` to serialize the data.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000);
DorisSink.Builder builder = DorisSink.builder();
DorisOptions dorisOptions = DorisOptions.builder()
.setFenodes("10.16.10.6:28737")
.setTableIdentifier("test.student")
.setUsername("root")
.setPassword("")
.build();
Properties properties = new Properties();
// When the upstream data is in json format, the following configuration needs to be enabled
properties.setProperty("read_json_by_line", "true");
properties.setProperty("format", "json");
// When writing csv data from the upstream, the following configurations need to be enabled
//properties.setProperty("format", "csv");
//properties.setProperty("column_separator", ",");
DorisExecutionOptions executionOptions = DorisExecutionOptions.builder()
.setLabelPrefix("label-doris")
.setDeletable(false)
//.setBatchMode(true) Enable batch writing
.setStreamLoadProp(properties)
.build();
builder.setDorisReadOptions(DorisReadOptions.builder().build())
.setDorisExecutionOptions(executionOptions)
.setSerializer(new SimpleStringSerializer())
.setDorisOptions(dorisOptions);
List data = new ArrayList<>();
data.add("{\"id\":3,\"name\":\"Michael\",\"age\":28}");
data.add("{\"id\":4,\"name\":\"David\",\"age\":38}");
env.fromCollection(data).sinkTo(builder.build());
env.execute("doris test");
For the complete code, refer
to:[DorisSinkExample.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/DorisSinkExample.java)
##### RowData Format
RowData is the internal format of Flink. If the upstream data is in RowData
format, you need to use the `RowDataSerializer` to serialize the data.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10000);
env.setParallelism(1);
DorisSink.Builder builder = DorisSink.builder();
Properties properties = new Properties();
properties.setProperty("column_separator", ",");
properties.setProperty("line_delimiter", "\n");
properties.setProperty("format", "csv");
// When writing json data from the upstream, the following configuration needs to be enabled
// properties.setProperty("read_json_by_line", "true");
// properties.setProperty("format", "json");
DorisOptions.Builder dorisBuilder = DorisOptions.builder();
dorisBuilder
.setFenodes("10.16.10.6:28737")
.setTableIdentifier("test.student")
.setUsername("root")
.setPassword("");
DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();
executionBuilder.setLabelPrefix(UUID.randomUUID().toString()).setDeletable(false).setStreamLoadProp(properties);
// flink rowdata‘s schema
String[] fields = {"id","name", "age"};
DataType[] types = {DataTypes.INT(), DataTypes.VARCHAR(256), DataTypes.INT()};
builder.setDorisExecutionOptions(executionBuilder.build())
.setSerializer(
RowDataSerializer.builder() // serialize according to rowdata
.setType(LoadConstants.CSV)
.setFieldDelimiter(",")
.setFieldNames(fields)
.setFieldType(types)
.build())
.setDorisOptions(dorisBuilder.build());
// mock rowdata source
DataStream source =
env.fromElements("")
.flatMap(
new FlatMapFunction() {
@Override
public void flatMap(String s, Collector out)
throws Exception {
GenericRowData genericRowData = new GenericRowData(3);
genericRowData.setField(0, 1);
genericRowData.setField(1, StringData.fromString("Michael"));
genericRowData.setField(2, 18);
out.collect(genericRowData);
GenericRowData genericRowData2 = new GenericRowData(3);
genericRowData2.setField(0, 2);
genericRowData2.setField(1, StringData.fromString("David"));
genericRowData2.setField(2, 38);
out.collect(genericRowData2);
}
});
source.sinkTo(builder.build());
env.execute("doris test");
For the complete code, refer
to:[DorisSinkExampleRowData.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/DorisSinkExampleRowData.java)
##### Debezium Format
For upstream data in Debezium format, such as data from FlinkCDC or Debezium
format in Kafka, you can use the `JsonDebeziumSchemaSerializer` to serialize
the data.
// enable checkpoint
env.enableCheckpointing(10000);
Properties props = new Properties();
props.setProperty("format", "json");
props.setProperty("read_json_by_line", "true");
DorisOptions dorisOptions = DorisOptions.builder()
.setFenodes("127.0.0.1:8030")
.setTableIdentifier("test.student")
.setUsername("root")
.setPassword("").build();
DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();
executionBuilder.setLabelPrefix("label-prefix")
.setStreamLoadProp(props)
.setDeletable(true);
DorisSink.Builder builder = DorisSink.builder();
builder.setDorisReadOptions(DorisReadOptions.builder().build())
.setDorisExecutionOptions(executionBuilder.build())
.setDorisOptions(dorisOptions)
.setSerializer(JsonDebeziumSchemaSerializer.builder().setDorisOptions(dorisOptions).build());
env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
.sinkTo(builder.build());
For the complete code, refer
to:[CDCSchemaChangeExample.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/CDCSchemaChangeExample.java)
##### Multi-table Write Format
Currently, DorisSink supports synchronizing multiple tables with a single
Sink. You need to pass both the data and the database/table information to the
Sink, and serialize it using the `RecordWithMetaSerializer`.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DorisSink.Builder builder = DorisSink.builder();
Properties properties = new Properties();
properties.setProperty("column_separator", ",");
properties.setProperty("line_delimiter", "\n");
properties.setProperty("format", "csv");
DorisOptions.Builder dorisBuilder = DorisOptions.builder();
dorisBuilder
.setFenodes("10.16.10.6:28737")
.setTableIdentifier("")
.setUsername("root")
.setPassword("");
DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder();
executionBuilder
.setLabelPrefix("label-doris")
.setStreamLoadProp(properties)
.setDeletable(false)
.setBatchMode(true);
builder.setDorisReadOptions(DorisReadOptions.builder().build())
.setDorisExecutionOptions(executionBuilder.build())
.setDorisOptions(dorisBuilder.build())
.setSerializer(new RecordWithMetaSerializer());
RecordWithMeta record = new RecordWithMeta("test", "student_1", "1,David,18");
RecordWithMeta record1 = new RecordWithMeta("test", "student_2", "1,Jack,28");
env.fromCollection(Arrays.asList(record, record1)).sinkTo(builder.build());
For the complete code, refer
to:[DorisSinkMultiTableExample.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/DorisSinkMultiTableExample.java)
### Lookup Join
Using Lookup Join can optimize dimension table joins in Flink. When using
Flink JDBC Connector for dimension table joins, the following issues may
arise:
* Flink JDBC Connector uses a synchronous query mode, meaning that after upstream data (e.g., from Kafka) sends a record, it immediately queries the Doris dimension table. This results in high query latency under high-concurrency scenarios.
* Queries executed via JDBC are typically point lookups per record, whereas Doris recommends batch queries for better efficiency.
Using [Lookup Join](https://nightlies.apache.org/flink/flink-docs-
release-1.20/docs/dev/table/sql/queries/joins/#lookup-join) for dimension
table joins in Flink Doris Connector provides the following advantages:
* **Batch caching of upstream data** , avoiding the high latency and database load caused by per-record queries.
* **Asynchronous execution of join queries** , improving data throughput and reducing the query load on Doris.
CREATE TABLE fact_table (
`id` BIGINT,
`name` STRING,
`city` STRING,
`process_time` as proctime()
) WITH (
'connector' = 'kafka',
...
);
create table dim_city(
`city` STRING,
`level` INT ,
`province` STRING,
`country` STRING
) WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
'table.identifier' = 'dim.dim_city',
'username' = 'root',
'password' = ''
);
SELECT a.id, a.name, a.city, c.province, c.country,c.level
FROM fact_table a
LEFT JOIN dim_city FOR SYSTEM_TIME AS OF a.process_time AS c
ON a.city = c.city
### Full Database Synchronization
The Flink Doris Connector integrates **Flink CDC** ([Flink CDC
Documentation](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/overview/)), making it easier to
synchronize relational databases like MySQL to Doris. This integration also
includes automatic table creation, schema changes, etc. Supported databases
for synchronization include: MySQL, Oracle, PostgreSQL, SQLServer, MongoDB,
and DB2.
Note
1. When using full database synchronization, you need to add the corresponding Flink CDC dependencies in the `$FLINK_HOME/lib` directory (Fat Jar), such as **flink-sql-connector-mysql-cdc-${version}.jar** , **flink-sql-connector-oracle-cdc-${version}.jar**. FlinkCDC version 3.1 and later is not compatible with previous versions. You can download the dependencies from the following links: [FlinkCDC 3.x](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-mysql-cdc/), [FlinkCDC 2.x](https://repo.maven.apache.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/).
2. For versions after Connector 24.0.0, the required Flink CDC version must be 3.1 or higher. You can download it [here](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-mysql-cdc/). If Flink CDC is used to synchronize MySQL and Oracle, you must also add the relevant JDBC drivers under `$FLINK_HOME/lib`.
#### MySQL Whole Database Synchronization
After starting the Flink cluster, you can directly run the following command:
bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
lib/flink-doris-connector-1.16-24.0.1.jar \
mysql-sync-database \
--database test_db \
--mysql-conf hostname=127.0.0.1 \
--mysql-conf port=3306 \
--mysql-conf username=root \
--mysql-conf password=123456 \
--mysql-conf database-name=mysql_db \
--including-tables "tbl1|test.*" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password=123456 \
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### Oracle Whole Database Synchronization
bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
./lib/flink-doris-connector-1.16-24.0.1.jar \
oracle-sync-database \
--database test_db \
--oracle-conf hostname=127.0.0.1 \
--oracle-conf port=1521 \
--oracle-conf username=admin \
--oracle-conf password="password" \
--oracle-conf database-name=XE \
--oracle-conf schema-name=ADMIN \
--including-tables "tbl1|tbl2" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password=\
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### PostgreSQL Whole Database Synchronization
/bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1\
-c org.apache.doris.flink.tools.cdc.CdcTools \
./lib/flink-doris-connector-1.16-24.0.1.jar \
postgres-sync-database \
--database db1\
--postgres-conf hostname=127.0.0.1 \
--postgres-conf port=5432 \
--postgres-conf username=postgres \
--postgres-conf password="123456" \
--postgres-conf database-name=postgres \
--postgres-conf schema-name=public \
--postgres-conf slot.name=test \
--postgres-conf decoding.plugin.name=pgoutput \
--including-tables "tbl1|tbl2" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password=\
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### SQLServer Whole Database Synchronization
/bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
./lib/flink-doris-connector-1.16-24.0.1.jar \
sqlserver-sync-database \
--database db1\
--sqlserver-conf hostname=127.0.0.1 \
--sqlserver-conf port=1433 \
--sqlserver-conf username=sa \
--sqlserver-conf password="123456" \
--sqlserver-conf database-name=CDC_DB \
--sqlserver-conf schema-name=dbo \
--including-tables "tbl1|tbl2" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password=\
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### DB2 Whole Database Synchronization
bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
lib/flink-doris-connector-1.16-24.0.1.jar \
db2-sync-database \
--database db2_test \
--db2-conf hostname=127.0.0.1 \
--db2-conf port=50000 \
--db2-conf username=db2inst1 \
--db2-conf password=doris123456 \
--db2-conf database-name=testdb \
--db2-conf schema-name=DB2INST1 \
--including-tables "FULL_TYPES|CUSTOMERS" \
--single-sink true \
--use-new-schema-change true \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password=123456 \
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### MongoDB Whole Database Synchronization
/bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
./lib/flink-doris-connector-1.18-24.0.1.jar \
mongodb-sync-database \
--database doris_db \
--schema-change-mode debezium_structure \
--mongodb-conf hosts=127.0.0.1:27017 \
--mongodb-conf username=flinkuser \
--mongodb-conf password=flinkpwd \
--mongodb-conf database=test \
--mongodb-conf scan.startup.mode=initial \
--mongodb-conf schema.sample-percent=0.2 \
--including-tables "tbl1|tbl2" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password= \
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--sink-conf sink.enable-2pc=false \
--table-conf replication_num=1
#### AWS Aurora MySQL Whole Database Synchronization
bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
lib/flink-doris-connector-1.18-25.0.0.jar \
mysql-sync-database \
--database testwd \
--mysql-conf hostname=xxx.us-east-1.rds.amazonaws.com \
--mysql-conf port=3306 \
--mysql-conf username=admin \
--mysql-conf password=123456 \
--mysql-conf database-name=test \
--mysql-conf server-time-zone=UTC \
--including-tables "student" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password= \
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
#### AWS RDS MySQL Whole Database Synchronization
bin/flink run \
-Dexecution.checkpointing.interval=10s \
-Dparallelism.default=1 \
-c org.apache.doris.flink.tools.cdc.CdcTools \
lib/flink-doris-connector-1.18-25.0.0.jar \
mysql-sync-database \
--database testwd \
--mysql-conf hostname=xxx.ap-southeast-1.rds.amazonaws.com \
--mysql-conf port=3306 \
--mysql-conf username=admin \
--mysql-conf password=123456 \
--mysql-conf database-name=test \
--mysql-conf server-time-zone=UTC \
--including-tables "student" \
--sink-conf fenodes=127.0.0.1:8030 \
--sink-conf username=root \
--sink-conf password= \
--sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \
--sink-conf sink.label-prefix=label \
--table-conf replication_num=1
## Usage Instructions
### Parameter Configuration
#### General Configuration Items
Key| Default Value| Required| Comment| fenodes| \--| Y| Doris FE http
addresses. Multiple addresses are supported and should be separated by
commas.| benodes| \--| N| Doris BE http addresses. Multiple addresses are
supported and should be separated by commas.| jdbc-url| \--| N| JDBC
connection information, such as jdbc:mysql://127.0.0.1:9030.|
table.identifier| \--| Y| Doris table name, such as db.tbl.| username| \--| Y|
Username for accessing Doris.| password| \--| Y| Password for accessing
Doris.| auto-redirect| TRUE| N| Whether to redirect StreamLoad requests. After
enabling, StreamLoad will write through FE and will no longer explicitly
obtain BE information.| doris.request.retries| 3| N| The number of retries for
sending requests to Doris.| doris.request.connect.timeout| 30s| N| The
connection timeout for sending requests to Doris.| doris.request.read.timeout|
30s| N| The read timeout for sending requests to Doris.
---|---|---|---
#### Source Configuration
Key| Default Value| Required| Comment| doris.request.query.timeout| 21600s| N|
The timeout for querying Doris. The default value is 6 hours.|
doris.request.tablet.size| 1| N| The number of Doris Tablets corresponding to
one Partition. The smaller this value is set, the more Partitions will be
generated, which can increase the parallelism on the Flink side. However, it
will also put more pressure on Doris.| doris.batch.size| 4064| N| The maximum
number of rows read from BE at one time. Increasing this value can reduce the
number of connections established between Flink and Doris, thereby reducing
the additional time overhead caused by network latency.| doris.exec.mem.limit|
8192mb| N| The memory limit for a single query. The default is 8GB, in bytes.|
source.use-flight-sql| FALSE| N| Whether to use Arrow Flight SQL for reading.|
source.flight-sql-port| -| N| The arrow_flight_sql_port of FE when using Arrow
Flight SQL for reading.
---|---|---|---
**DataStream-Specific Configuration**
Key| Default Value| Required| Comment| doris.read.field| \--| N| The list of
column names for reading Doris tables. Multiple columns should be separated by
commas.| doris.filter.query| \--| N| The expression for filtering read data.
This expression is passed to Doris. Doris uses this expression to complete
source data filtering. For example, age=18.
---|---|---|---
#### Sink Configuration
Key| Default Value| Required| Comment| sink.label-prefix| \--| Y| The label
prefix used for Stream load import. In the 2pc scenario, it is required to be
globally unique to ensure the EOS semantics of Flink.| sink.properties.*| \--|
N| Import parameters for Stream Load. For example,
'sink.properties.column_separator' = ', ' defines the column separator, and
'sink.properties.escape_delimiters' = 'true' means that special characters as
delimiters, like \x01, will be converted to binary 0x01. For JSON format
import, 'sink.properties.format' = 'json', 'sink.properties.read_json_by_line'
= 'true'. For detailed parameters, refer to [here](/cloud/4.x/user-guide/data-
operate/import/import-way/stream-load-manual). For Group Commit mode, for
example, 'sink.properties.group_commit' = 'sync_mode' sets the group commit to
synchronous mode. The Flink connector has supported import configuration group
commit since version 1.6.2. For detailed usage and limitations, refer to
[group commit](/cloud/4.x/user-guide/data-operate/import/group-commit-
manual).| sink.enable-delete| TRUE| N| Whether to enable deletion. This option
requires the Doris table to have the batch deletion feature enabled (enabled
by default in Doris 0.15+ versions), and only supports the Unique model.|
sink.enable-2pc| TRUE| N| Whether to enable two-phase commit (2pc). The
default is true, ensuring Exactly-Once semantics. For details about two-phase
commit, refer to [here](/cloud/4.x/user-guide/data-operate/import/import-
way/stream-load-manual).| sink.buffer-size| 1MB| N| The size of the write data
cache buffer, in bytes. It is not recommended to modify it, and the default
configuration can be used.| sink.buffer-count| 3| N| The number of write data
cache buffers. It is not recommended to modify it, and the default
configuration can be used.| sink.max-retries| 3| N| The maximum number of
retries after a Commit failure. The default is 3 times.| sink.enable.batch-
mode| FALSE| N| Whether to use the batch mode to write to Doris. After
enabling, the writing timing does not rely on Checkpoint, and it is controlled
by parameters such as sink.buffer-flush.max-rows, sink.buffer-flush.max-bytes,
and sink.buffer-flush.interval. Meanwhile, after enabling, Exactly-once
semantics will not be guaranteed, but idempotency can be achieved with the
help of the Uniq model.| sink.flush.queue-size| 2| N| The size of the cache
queue in batch mode.| sink.buffer-flush.max-rows| 500000| N| The maximum
number of rows written in a single batch in batch mode.| sink.buffer-
flush.max-bytes| 100MB| N| The maximum number of bytes written in a single
batch in batch mode.| sink.buffer-flush.interval| 10s| N| The interval for
asynchronously flushing the cache in batch mode.| sink.ignore.update-before|
TRUE| N| Whether to ignore the update-before event. The default is to ignore
it.
---|---|---|---
#### Lookup Join Configuration
Key| Default Value| Required| Comment| lookup.cache.max-rows| -1| N| The
maximum number of rows in the lookup cache. The default value is -1, which
means the cache is not enabled.| lookup.cache.ttl| 10s| N| The maximum time
for the lookup cache. The default is 10 seconds.| lookup.max-retries| 1| N|
The number of retries after a lookup query fails.| lookup.jdbc.async| FALSE|
N| Whether to enable asynchronous lookup. The default is false.|
lookup.jdbc.read.batch.size| 128| N| The maximum batch size for each query in
asynchronous lookup.| lookup.jdbc.read.batch.queue-size| 256| N| The size of
the intermediate buffer queue during asynchronous lookup.|
lookup.jdbc.read.thread-size| 3| N| The number of jdbc threads for lookup in
each task.
---|---|---|---
#### Full Database Synchronization Configuration
**Syntax**
bin/flink run \
-c org.apache.doris.flink.tools.cdc.CdcTools \
lib/flink-doris-connector-1.16-1.6.1.jar \
\
--database \
[--job-name ] \
[--table-prefix ] \
[--table-suffix ] \
[--including-tables ] \
[--excluding-tables ] \
--mysql-conf [--mysql-conf ...] \
--oracle-conf [--oracle-conf ...] \
--postgres-conf [--postgres-conf ...] \
--sqlserver-conf [--sqlserver-conf ...] \
--sink-conf [--table-conf ...] \
[--table-conf [--table-conf ...]]
**Configuration**
Key| Comment| \--job-name| The name of the Flink task, which is optional.|
\--database| The name of the database synchronized to Doris.| \--table-prefix|
The prefix name of the Doris table, for example, --table-prefix ods_.|
\--table-suffix| The suffix name of the Doris table, similar to the prefix.|
\--including-tables| The MySQL tables that need to be synchronized. Multiple
tables can be separated by |, and regular expressions are supported. For
example, --including-tables table1.| \--excluding-tables| The tables that do
not need to be synchronized. The usage is the same as that of --including-
tables.| \--mysql-conf| The configuration of the MySQL CDCSource, for example,
--mysql-conf hostname=127.0.0.1. You can view all the configurations of MySQL-
CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/mysql-cdc/). Among them, hostname,
username, password, and database-name are required. When the synchronized
database and table contain non-primary key tables,
scan.incremental.snapshot.chunk.key-column must be set, and only one non-null
type field can be selected. For example: scan.incremental.snapshot.chunk.key-
column=database.table:column,database.table1:column..., and columns of
different databases and tables are separated by commas.| \--oracle-conf| The
configuration of the Oracle CDCSource, for example, --oracle-conf
hostname=127.0.0.1. You can view all the configurations of Oracle-CDC
[here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/oracle-cdc/). Among them, hostname,
username, password, database-name, and schema-name are required.| \--postgres-
conf| The configuration of the Postgres CDCSource, for example, --postgres-
conf hostname=127.0.0.1. You can view all the configurations of Postgres-CDC
[here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/postgres-cdc/). Among them,
hostname, username, password, database-name, schema-name, and slot.name are
required.| \--sqlserver-conf| The configuration of the SQLServer CDCSource,
for example, --sqlserver-conf hostname=127.0.0.1. You can view all the
configurations of SQLServer-CDC
[here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/sqlserver-cdc/). Among them,
hostname, username, password, database-name, and schema-name are required.|
\--db2-conf| The configuration of the SQLServer CDCSource, for example,
--db2-conf hostname=127.0.0.1. You can view all the configurations of DB2-CDC
[here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/db2-cdc/). Among them, hostname,
username, password, database-name, and schema-name are required.| \--sink-
conf| All the configurations of the Doris Sink can be viewed
[here]( Configuration Items).| \--mongodb-conf| The configuration
of the MongoDB CDCSource, for example, --mongodb-conf hosts=127.0.0.1:27017.
You can view all the configurations of Mongo-CDC
[here](https://nightlies.apache.org/flink/flink-cdc-docs-
release-3.2/docs/connectors/flink-sources/mongodb-cdc/). Among them, hosts,
username, password, and database are required. --mongodb-conf schema.sample-
percent is the configuration for automatically sampling MongoDB data to create
tables in Doris, and the default value is 0.2.| \--table-conf| The
configuration items of the Doris table, that is, the content included in
properties (except for table-buckets, which is not a properties attribute).
For example, --table-conf replication_num=1, and --table-conf table-
buckets="tbl1:10,tbl2:20,a._:30,b._ :40,.*:50" means specifying the number of
buckets for different tables in the order of regular expressions. If there is
no match, the BUCKETS AUTO method will be used to create tables.| \--schema-
change-mode| The modes for parsing schema change, including debezium_structure
and sql_parser. The debezium_structure mode is used by default. The
debezium_structure mode parses the data structure used when the upstream CDC
synchronizes data and judges DDL change operations by parsing this structure.
The sql_parser mode parses the DDL statements when the upstream CDC
synchronizes data to judge DDL change operations, so this parsing mode is more
accurate. Usage example: --schema-change-mode debezium_structure. This
function will be available in versions after 24.0.0.| \--single-sink| Whether
to use a single Sink to synchronize all tables. After enabling, it can also
automatically identify newly created tables upstream and create tables
automatically.| \--multi-to-one-origin| The configuration of the source tables
when multiple upstream tables are written to the same table, for example:
--multi-to-one-origin "a_.*|b_.*", refer to
[#208](https://github.com/apache/doris-flink-connector/pull/208)| \--multi-to-
one-target| Used in combination with multi-to-one-origin, the configuration of
the target table, for example: --multi-to-one-target "a|b"| \--create-table-
only| Whether to only synchronize the structure of the table.
---|---
### Type Mapping
Doris Type| Flink Type| NULL_TYPE| NULL| BOOLEAN| BOOLEAN| TINYINT| TINYINT|
SMALLINT| SMALLINT| INT| INT| BIGINT| BIGINT| FLOAT| FLOAT| DOUBLE| DOUBLE|
DATE| DATE| DATETIME| TIMESTAMP| DECIMAL| DECIMAL| CHAR| STRING| LARGEINT|
STRING| VARCHAR| STRING| STRING| STRING| DECIMALV2| DECIMAL| ARRAY| ARRAY|
MAP| STRING| JSON| STRING| VARIANT| STRING| IPV4| STRING| IPV6| STRING
---|---
### Monitoring Metrics
Flink provides multiple [Metrics](https://nightlies.apache.org/flink/flink-
docs-master/docs/ops/metrics/#metrics) for monitoring the indicators of the
Flink cluster. The following are the newly added monitoring metrics for the
Flink Doris Connector.
Name| Metric Type| Description| totalFlushLoadBytes| Counter| The total number
of bytes that have been flushed and imported.| flushTotalNumberRows| Counter|
The total number of rows that have been imported and processed.|
totalFlushLoadedRows| Counter| The total number of rows that have been
successfully imported.| totalFlushTimeMs| Counter| The total time taken for
successful imports to complete.| totalFlushSucceededNumber| Counter| The
number of times that imports have been successfully completed.|
totalFlushFailedNumber| Counter| The number of times that imports have
failed.| totalFlushFilteredRows| Counter| The total number of rows with
unqualified data quality.| totalFlushUnselectedRows| Counter| The total number
of rows filtered by the where condition.| beginTxnTimeMs| Histogram| The time
taken to request the Fe to start a transaction, in milliseconds.|
putDataTimeMs| Histogram| The time taken to request the Fe to obtain the
import data execution plan.| readDataTimeMs| Histogram| The time taken to read
data.| writeDataTimeMs| Histogram| The time taken to execute the write data
operation.| commitAndPublishTimeMs| Histogram| The time taken to request the
Fe to commit and publish the transaction.| loadTimeMs| Histogram| The time
taken for the import to complete.
---|---|---
## Best Practices
### FlinkSQL Quickly Connects to MySQL Data via CDC
-- enable checkpoint
SET 'execution.checkpointing.interval' = '10s';
CREATE TABLE cdc_mysql_source (
id int
,name VARCHAR
,PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = '127.0.0.1',
'port' = '3306',
'username' = 'root',
'password' = 'password',
'database-name' = 'database',
'table-name' = 'table'
);
-- Supports synchronizing insert/update/delete events
CREATE TABLE doris_sink (
id INT,
name STRING
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'database.table',
'username' = 'root',
'password' = '',
'sink.properties.format' = 'json',
'sink.properties.read_json_by_line' = 'true',
'sink.enable-delete' = 'true', -- Synchronize delete events
'sink.label-prefix' = 'doris_label'
);
insert into doris_sink select id,name from cdc_mysql_source;
### Flink Performs Partial Column Updates
CREATE TABLE doris_sink (
id INT,
name STRING,
bank STRING,
age int
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'database.table',
'username' = 'root',
'password' = '',
'sink.properties.format' = 'json',
'sink.properties.read_json_by_line' = 'true',
'sink.properties.columns' = 'id,name,bank,age', -- Columns that need to be updated
'sink.properties.partial_columns' = 'true' -- Enable partial column updates
);
### Flink Imports Bitmap Data
CREATE TABLE bitmap_sink (
dt int,
page string,
user_id int
)
WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'test.bitmap_test',
'username' = 'root',
'password' = '',
'sink.label-prefix' = 'doris_label',
'sink.properties.columns' = 'dt,page,user_id,user_id=to_bitmap(user_id)'
)
### FlinkCDC Updates Key Columns
Generally, in a business database, a number is often used as the primary key
of a table. For example, for the Student table, the number (id) is used as the
primary key. However, as the business develops, the number corresponding to
the data may change. In this scenario, when using Flink CDC + Doris Connector
to synchronize data, the data of the primary key column in Doris can be
automatically updated.
**Principle**
The underlying collection tool of Flink CDC is Debezium. Debezium internally
uses the op field to identify corresponding operations. The values of the op
field are c, u, d, and r, corresponding to create, update, delete, and read
respectively. For the update of the primary key column, Flink CDC will send
DELETE and INSERT events downstream, and the data of the primary key column in
Doris will be automatically updated after the data is synchronized to Doris.
**Usage**
The Flink program can refer to the above CDC synchronization examples. After
successfully submitting the task, execute the statement to update the primary
key column on the MySQL side (for example, update student set id = '1002'
where id = '1001'), and then the data in Doris can be modified.
### Flink Deletes Data According to Specified Columns
Generally, messages in Kafka use specific fields to mark the operation type,
such as {"op_type":"delete",data:{...}}. For this kind of data, it is hoped to
delete the data with op_type=delete.
The DorisSink will, by default, distinguish the types of events according to
RowKind. Usually, in the case of CDC, the event type can be directly obtained,
and the hidden column `__DORIS_DELETE_SIGN__` can be assigned a value to
achieve the purpose of deletion. However, for Kafka, it is necessary to judge
according to the business logic and explicitly pass in the value of the hidden
column.
-- For example, the upstream data:{"op_type":"delete",data:{"id":1,"name":"zhangsan"}}
CREATE TABLE KAFKA_SOURCE(
data STRING,
op_type STRING
) WITH (
'connector' = 'kafka',
...
);
CREATE TABLE DORIS_SINK(
id INT,
name STRING,
__DORIS_DELETE_SIGN__ INT
) WITH (
'connector' = 'doris',
'fenodes' = '127.0.0.1:8030',
'table.identifier' = 'db.table',
'username' = 'root',
'password' = '',
'sink.enable-delete' = 'false', -- false means not to obtain the event type from RowKind
'sink.properties.columns' = 'id, name, __DORIS_DELETE_SIGN__' -- Explicitly specify the import columns of streamload
);
INSERT INTO DORIS_SINK
SELECT json_value(data,'$.id') as id,
json_value(data,'$.name') as name,
if(op_type='delete',1,0) as __DORIS_DELETE_SIGN__
from KAFKA_SOURCE;
### Flink CDC Synchronize DDL Statements
Generally, when synchronizing upstream data sources such as MySQL, when adding
or deleting fields in the upstream, you need to synchronize the Schema Change
operation in Doris.
For this scenario, you usually need to write a program for the DataStream API
and use the JsonDebeziumSchemaSerializer serializer provided by DorisSink to
automatically perform SchemaChange. For details, please refer to
[CDCSchemaChangeExample.java](https://github.com/apache/doris-flink-
connector/blob/master/flink-doris-
connector/src/test/java/org/apache/doris/flink/example/CDCSchemaChangeExample.java)
In the whole database synchronization tool provided by the Connector, no
additional configuration is required, and the upstream DDL will be
automatically synchronized and the SchemaChange operation will be performed in
Doris.
## Frequently Asked Questions (FAQ)
1. **errCode = 2, detailMessage = Label [label_0_1] has already been used, relate to txn [19650]**
In the Exactly-Once scenario, the Flink Job must be restarted from the latest
Checkpoint/Savepoint, otherwise the above error will be reported. When
Exactly-Once is not required, this problem can also be solved by disabling 2PC
submission (sink.enable-2pc=false) or changing to a different sink.label-
prefix.
2. **errCode = 2, detailMessage = transaction [19650] not found**
This occurs during the Commit stage. The transaction ID recorded in the
checkpoint has expired on the FE side. When committing again at this time, the
above error will occur. At this point, it's impossible to start from the
checkpoint. Subsequently, you can extend the expiration time by modifying the
`streaming_label_keep_max_second` configuration in `fe.conf`. The default
expiration time is 12 hours. After doris version 2.0, it will also be limited
by the `label_num_threshold` configuration in `fe.conf` (default 2000), which
can be increased or changed to -1 (-1 means only limited by time).
3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100**
This is because the concurrent imports into the same database exceed 100. It
can be solved by adjusting the parameter `max_running_txn_num_per_db` in
`fe.conf`. For specific details, please refer to
[max_running_txn_num_per_db](https://doris.apache.org/zh-CN/docs/dev/admin-
manual/config/fe-config/#max_running_txn_num_per_db).
Meanwhile, frequently modifying the label and restarting a task may also lead
to this error. In the 2pc scenario (for Duplicate/Aggregate models), the label
of each task needs to be unique. And when restarting from a checkpoint, the
Flink task will actively abort the transactions that have been pre-committed
successfully but not yet committed. Frequent label modifications and restarts
will result in a large number of pre-committed successful transactions that
cannot be aborted and thus occupy transactions. In the Unique model, 2pc can
also be disabled to achieve idempotent writes.
4. **tablet writer write failed, tablet_id=190958, txn_id=3505530, err=-235**
This usually occurs before Connector version 1.1.0 and is caused by too high a
writing frequency, which leads to an excessive number of versions. You can
reduce the frequency of Streamload by setting the `sink.batch.size` and
`sink.batch.interval` parameters. After Connector version 1.1.0, the default
writing timing is controlled by Checkpoint, and you can reduce the writing
frequency by increasing the Checkpoint interval.
5. **How to skip dirty data when Flink is importing?**
When Flink imports data, if there is dirty data, such as issues with field
formats or lengths, it will cause StreamLoad to report errors. At this time,
Flink will keep retrying. If you need to skip such data, you can disable the
strict mode of StreamLoad (by setting `strict_mode=false` and
`max_filter_ratio=1`) or filter the data before the Sink operator.
6. **How to configure when the network between Flink machines and BE machines is not connected?**
When Flink initiates writing to Doris, Doris will redirect the write operation
to BE. At this time, the returned address is the internal network IP of BE,
which is the IP seen through the `show backends` command. If Flink and Doris
have no network connectivity at this time, an error will be reported. In this
case, you can configure the external network IP of BE in `benodes`.
7. **stream load error: HTTP/1.1 307 Temporary Redirect**
Flink will first request FE, and after receiving 307, it will request BE after
redirection. When FE is in FullGC/high pressure/network delay, HttpClient will
send data without waiting for a response within a certain period of time (3
seconds) by default. Since the request body is InputStream by default, when a
307 response is received, the data cannot be replayed and an error will be
reported directly. There are three ways to solve this problem: 1. Upgrade to
Connector25.1.0 or above to increase the default time; 2. Modify auto-
redirect=false to directly initiate a request to BE (not applicable to some
cloud scenarios); 3. The unique key model can enable batch mode.
On This Page
* Version Description
* Usage
* Working Principles
* Reading Data from Doris
* Writing Data to Doris
* Quick Start
* Scenarios and Operations
* Reading Data from Doris
* Writing Data to Doris
* Lookup Join
* Full Database Synchronization
* Usage Instructions
* Parameter Configuration
* Type Mapping
* Monitoring Metrics
* Best Practices
* FlinkSQL Quickly Connects to MySQL Data via CDC
* Flink Performs Partial Column Updates
* Flink Imports Bitmap Data
* FlinkCDC Updates Key Columns
* Flink Deletes Data According to Specified Columns
* Flink CDC Synchronize DDL Statements
* Frequently Asked Questions (FAQ)
---
# Source: https://docs.velodb.io/cloud/4.x/integration/data-processing/spark-doris-connector
Version: 4.x
On this page
# Spark Doris Connector
Spark Doris Connector can support reading data stored in Doris and writing
data to Doris through Spark.
Github:
* Support reading data in batch mode from `Doris` through `RDD`, `DataFrame` and `Spark SQL`. It is recommended to use `DataFrame` or `Spark SQL`
* Support writing data to `Doris` in batch or streaming mode with DataFrame API and Spark SQL.
* You can map the `Doris` table to` DataFrame` or `RDD`, it is recommended to use` DataFrame`.
* Support the completion of data filtering on the `Doris` side to reduce the amount of data transmission.
## Version Compatibility
Connector| Spark| Doris| Java| Scala| 25.1.0| 3.5 - 3.1, 2.4| 1.0 +| 8| 2.12,
2.11| 25.0.1| 3.5 - 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 25.0.0| 3.5 - 3.1, 2.4|
1.0 +| 8| 2.12, 2.11| 24.0.0| 3.5 ~ 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 1.3.2| 3.4
~ 3.1, 2.4, 2.3| 1.0 ~ 2.1.6| 8| 2.12, 2.11| 1.3.1| 3.4 ~ 3.1, 2.4, 2.3| 1.0 ~
2.1.0| 8| 2.12, 2.11| 1.3.0| 3.4 ~ 3.1, 2.4, 2.3| 1.0 ~ 2.1.0| 8| 2.12, 2.11|
1.2.0| 3.2, 3.1, 2.3| 1.0 ~ 2.0.2| 8| 2.12, 2.11| 1.1.0| 3.2, 3.1, 2.3| 1.0 ~
1.2.8| 8| 2.12, 2.11| 1.0.1| 3.1, 2.3| 0.12 - 0.15| 8| 2.12, 2.11
---|---|---|---|---
## How To Use
### Maven
org.apache.doris
spark-doris-connector-spark-3.5
25.1.0
::: tip
Starting from version 24.0.0, the naming rules of the Doris connector package
have been adjusted:
1. No longer contains Scala version information.
2. For Spark 2.x versions, use the package named `spark-doris-connector-spark-2` uniformly, and by default only compile based on Scala 2.11 version. If you need Scala 2.12 version, please compile it yourself.
3. For Spark 3.x versions, use the package named `spark-doris-connector-spark-3.x` according to the specific Spark version. Applications based on Spark 3.0 version can use the package `spark-doris-connector-spark-3.1`.
:::
**Note**
1. Please replace the corresponding Connector version according to different Spark and Scala versions.
2. You can also download the relevant version jar package from [here](https://repo.maven.apache.org/maven2/org/apache/doris/).
### Compile
When compiling, you can directly run `sh build.sh`, for details, please refer
to here.
After successful compilation, the target jar package will be generated in the
`dist` directory, such as: spark-doris-connector-spark-3.5-25.1.0.jar. Copy
this file to the `ClassPath` of `Spark` to use `Spark-Doris-Connector`. For
example, for `Spark` running in `Local` mode, put this file in the `jars/`
folder. For `Spark` running in `Yarn` cluster mode, put this file in the pre-
deployment package. You can also
Execute in the source code directory:
`sh build.sh`
Enter the Scala and Spark versions you need to compile according to the
prompts.
After successful compilation, the target jar package will be generated in the
`dist` directory, such as: `spark-doris-connector-spark-3.5-25.1.0.jar`. Copy
this file to the `ClassPath` of `Spark` to use `Spark-Doris-Connector`.
For example, if `Spark` is running in `Local` mode, put this file in the
`jars/` folder. If `Spark` is running in `Yarn` cluster mode, put this file in
the pre-deployment package.
For example, upload `spark-doris-connector-spark-3.5-25.1.0.jar` to hdfs and
add the Jar package path on hdfs to the `spark.yarn.jars` parameter
1. Upload `spark-doris-connector-spark-3.5-25.1.0.jar` to hdfs.
hdfs dfs -mkdir /spark-jars/
hdfs dfs -put /your_local_path/spark-doris-connector-spark-3.5-25.1.0.jar /spark-jars/
2. Add the `spark-doris-connector-spark-3.5-25.1.0.jar` dependency in the cluster.
spark.yarn.jars=hdfs:///spark-jars/spark-doris-connector-spark-3.5-25.1.0.jar
## Example
### Batch Read
#### RDD
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
tableIdentifier = Some("$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME"),
cfg = Some(Map(
"doris.fenodes" -> "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
"doris.request.auth.user" -> "$YOUR_DORIS_USERNAME",
"doris.request.auth.password" -> "$YOUR_DORIS_PASSWORD"
))
)
dorisSparkRDD.collect()
#### DataFrame
val dorisSparkDF = spark.read.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.load()
dorisSparkDF.show(5)
#### Spark SQL
CREATE TEMPORARY VIEW spark_doris
USING doris
OPTIONS(
"table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
"fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
"user"="$YOUR_DORIS_USERNAME",
"password"="$YOUR_DORIS_PASSWORD"
);
SELECT * FROM spark_doris;
#### pySpark
dorisSparkDF = spark.read.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.load()
// show 5 lines data
dorisSparkDF.show(5)
#### Reading via Arrow Flight SQL
Starting from version 24.0.0, data can be read via Arrow Flight SQL (Doris
version >= 2.1.0 is required).
Set `doris.read.mode` to arrow, set `doris.read.arrow-flight-sql.port` to the
Arrow Flight SQL port configured by FE.
For server configuration, refer to [High-speed data transmission link based on
Arrow Flight SQL](https://doris.apache.org/zh-CN/docs/dev/db-connect/arrow-
flight-sql-connect).
val df = spark.read.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("doris.user", "$YOUR_DORIS_USERNAME")
.option("doris.password", "$YOUR_DORIS_PASSWORD")
.option("doris.read.mode", "arrow")
.option("doris.read.arrow-flight-sql.port", "12345")
.load()
df.show()
### Batch Write
#### DataFrame
val mockDataDF = List(
(3, "440403001005", "21.cn"),
(1, "4404030013005", "22.cn"),
(33, null, "23.cn")
).toDF("id", "mi_code", "mi_name")
mockDataDF.show(5)
mockDataDF.write.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
//other options
//specify the fields to write
.option("doris.write.fields", "$YOUR_FIELDS_TO_WRITE")
// Support setting Overwrite mode to overwrite data
// .mode(SaveMode.Overwrite)
.save()
#### Spark SQL
CREATE TEMPORARY VIEW spark_doris
USING doris
OPTIONS(
"table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
"fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
"user"="$YOUR_DORIS_USERNAME",
"password"="$YOUR_DORIS_PASSWORD"
);
INSERT INTO spark_doris VALUES ("VALUE1", "VALUE2", ...);
-- insert into select
INSERT INTO spark_doris SELECT * FROM YOUR_TABLE;
-- insert overwrite
INSERT OVERWRITE SELECT * FROM YOUR_TABLE;
### Streaming Write
#### DataFrame
##### Write structured data
val df = spark.readStream.format("your_own_stream_source").load()
df.writeStream
.format("doris")
.option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.start()
.awaitTermination()
##### Write directly
If the first column of data in the data stream is formatted data that conforms
to the `Doris` table structure, such as CSV format data with the same column
order, or JSON format data with the same field name, it can be written
directly to `Doris` by setting the `doris.sink.streaming.passthrough` option
to `true` without converting to `DataFrame`.
Taking kafka as an example.
And assuming the table structure to be written is:
CREATE TABLE `t2` (
`c0` int NULL,
`c1` varchar(10) NULL,
`c2` date NULL
) ENGINE=OLAP
DUPLICATE KEY(`c0`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c0`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
The value of the message is `{"c0":1,"c1":"a","dt":"2024-01-01"}` in json
format.
val kafkaSource = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "$YOUR_KAFKA_SERVERS")
.option("startingOffsets", "latest")
.option("subscribe", "$YOUR_KAFKA_TOPICS")
.load()
// Select the value of the message as the first column of the DataFrame.
kafkaSource.selectExpr("CAST(value as STRING)")
.writeStream
.format("doris")
.option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
// Set this option to true, and the first column will be written directly without processing.
.option("doris.sink.streaming.passthrough", "true")
.option("doris.sink.properties.format", "json")
.start()
.awaitTermination()
#### Write in JSON format
Set `doris.sink.properties.format` to json
val df = spark.readStream.format("your_own_stream_source").load()
df.write.format("doris")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.option("doris.sink.properties.format", "json")
.save()
### Spark Doris Catalog
Since version 24.0.0, support accessing doris through Spark Catalog.
#### Catalog Config
Key| Required| Comment| spark.sql.catalog.your_catalog_name| true| Set class
name of catalog provider, the only valid value for Doris is
`org.apache.doris.spark.catalog.DorisTableCatalog`|
spark.sql.catalog.your_catalog_name.doris.fenodes| true| Set Doris FE node in
the format fe_ip:fe_http_port|
spark.sql.catalog.your_catalog_name.doris.query.port| false| Set Doris FE
query port, this option is unnecessary if
`spark.sql.catalog.your_catalog_name.doris.fe.auto.fetch` is set to true|
spark.sql.catalog.your_catalog_name.doris.user| true| Set Doris user|
spark.sql.catalog.your_catalog_name.doris.password| true| Set Doris password|
spark.sql.defaultCatalog| false| Set Spark SQL default catalog
---|---|---
tip
All connector parameters that apply to DataFrame and Spark SQL can be set for
catalog.
For example, if you want to write data in json format, you can set the option
`spark.sql.catalog.your_catalog_name.doris.sink.properties.format` to `json`.
#### DataFrame
val conf = new SparkConf()
conf.set("spark.sql.catalog.your_catalog_name", "org.apache.doris.spark.catalog.DorisTableCatalog")
conf.set("spark.sql.catalog.your_catalog_name.doris.fenodes", "192.168.0.1:8030")
conf.set("spark.sql.catalog.your_catalog_name.doris.query.port", "9030")
conf.set("spark.sql.catalog.your_catalog_name.doris.user", "root")
conf.set("spark.sql.catalog.your_catalog_name.doris.password", "")
val spark = builder.config(conf).getOrCreate()
spark.sessionState.catalogManager.setCurrentCatalog("your_catalog_name")
// show all databases
spark.sql("show databases")
// use databases
spark.sql("use your_doris_db")
// show tables in test
spark.sql("show tables")
// query table
spark.sql("select * from your_doris_table")
// write data
spark.sql("insert into your_doris_table values(xxx)")
#### Spark SQL
Start Spark SQL CLI with necessary config.
spark-sql \
--conf "spark.sql.catalog.your_catalog_name=org.apache.doris.spark.catalog.DorisTableCatalog" \
--conf "spark.sql.catalog.your_catalog_name.doris.fenodes=192.168.0.1:8030" \
--conf "spark.sql.catalog.your_catalog_name.doris.query.port=9030" \
--conf "spark.sql.catalog.your_catalog_name.doris.user=root" \
--conf "spark.sql.catalog.your_catalog_name.doris.password=" \
--conf "spark.sql.defaultCatalog=your_catalog_name"
Execute query in Spark SQL CLI.
-- show all databases
show databases;
-- use databases
use your_doris_db;
-- show tables in test
show tables;
-- query table
select * from your_doris_table;
-- write data
insert into your_doris_table values(xxx);
insert into your_doris_table select * from your_source_table;
-- access table with full name
select * from your_catalog_name.your_doris_db.your_doris_table;
insert into your_catalog_name.your_doris_db.your_doris_table values(xxx);
insert into your_catalog_name.your_doris_db.your_doris_table select * from your_source_table;
## Configuration
### General
Key| Default Value| Comment| doris.fenodes| \--| Doris FE http address,
support multiple addresses, separated by commas| doris.table.identifier| \--|
Doris table identifier, eg, db1.tbl1| doris.user| \--| Doris username|
doris.password| Empty string| Doris password| doris.request.retries| 3| Number
of retries to send requests to Doris| doris.request.connect.timeout.ms| 30000|
Connection timeout for sending requests to Doris|
doris.request.read.timeout.ms| 30000| Read timeout for sending request to
Doris| doris.request.query.timeout.s| 21600| Query the timeout time of doris,
the default is 6 hour, -1 means no timeout limit| doris.request.tablet.size|
1| The number of Doris Tablets corresponding to an RDD Partition. The smaller
this value is set, the more partitions will be generated. This will increase
the parallelism on the Spark side, but at the same time will cause greater
pressure on Doris.| doris.read.field| \--| List of column names in the Doris
table, separated by commas| doris.batch.size| 4064| The maximum number of rows
to read data from BE at one time. Increasing this value can reduce the number
of connections between Spark and Doris. Thereby reducing the extra time
overhead caused by network delay.| doris.exec.mem.limit| 8589934592| Memory
limit for a single query. The default is 8GB, in bytes.| doris.write.fields|
\--| Specifies the fields (or the order of the fields) to write to the Doris
table, fileds separated by commas.
By default, all fields are written in the order of Doris table fields.|
doris.sink.batch.size| 500000| Maximum number of lines in a single write BE|
doris.sink.max-retries| 0| Number of retries after writing BE, Since version
1.3.0, the default value is 0, which means no retries are performed by
default. When this parameter is set greater than 0, batch-level failure
retries will be performed, and data of the configured size of
`doris.sink.batch.size` will be cached in the Spark Executor memory. The
memory allocation may need to be appropriately increased.|
doris.sink.retry.interval.ms| 10000| After configuring the number of retries,
the interval between each retry, in ms| doris.sink.properties.format| \--|
Data format of the stream load.
Supported formats: csv, json, arrow
[More Multi-parameter details](/cloud/4.x/user-guide/data-
operate/import/import-way/stream-load-manual)| doris.sink.properties.*| \--|
Import parameters for Stream Load.
For example:
Specify column separator: `'doris.sink.properties.column_separator' = ','`.
[More parameter details](/cloud/4.x/user-guide/data-operate/import/import-
way/stream-load-manual)| doris.sink.task.partition.size| \--| The number of
partitions corresponding to the Writing task. After filtering and other
operations, the number of partitions written in Spark RDD may be large, but
the number of records corresponding to each Partition is relatively small,
resulting in increased writing frequency and waste of computing resources. The
smaller this value is set, the less Doris write frequency and less Doris merge
pressure. It is generally used with doris.sink.task.use.repartition.|
doris.sink.task.use.repartition| false| Whether to use repartition mode to
control the number of partitions written by Doris. The default value is false,
and coalesce is used (note: if there is no Spark action before the write, the
whole computation will be less parallel). If it is set to true, then
repartition is used (note: you can set the final number of partitions at the
cost of shuffle).| doris.sink.batch.interval.ms| 0| The interval time of each
batch sink, unit ms.| doris.sink.enable-2pc| false| Whether to enable two-
stage commit. When enabled, transactions will be committed at the end of the
job, and all pre-commit transactions will be rolled back when some tasks
fail.| doris.sink.auto-redirect| true| Whether to redirect StreamLoad
requests. After being turned on, StreamLoad will write through FE and no
longer obtain BE information explicitly.| doris.enable.https| false| Whether
to enable FE Https request.| doris.https.key-store-path| -| Https key store
path.| doris.https.key-store-type| JKS| Https key store type.|
doris.https.key-store-password| -| Https key store password.| doris.read.mode|
thrift| Doris read mode, with optional `thrift` and `arrow`.|
doris.read.arrow-flight-sql.port| -| Arrow Flight SQL port of Doris FE. When
`doris.read.mode` is `arrow`, it is used to read data via Arrow Flight SQL.
For server configuration, see [High-speed data transmission link based on
Arrow Flight SQL](https://doris.apache.org/zh-CN/docs/dev/db-connect/arrow-
flight-sql-connect)| doris.sink.label.prefix| spark-doris| The import label
prefix when writing in Stream Load mode.| doris.thrift.max.message.size|
2147483647| The maximum size of a message when reading data via Thrift.|
doris.fe.auto.fetch| false| Whether to automatically obtain FE information.
When set to true, all FE node information will be requested according to the
nodes configured by `doris.fenodes`. There is no need to configure multiple
nodes and configure `doris.read.arrow-flight-sql.port` and `doris.query.port`
separately.| doris.read.bitmap-to-string| false| Whether to convert the Bitmap
type to a string composed of array indexes for reading. For the specific
result format, see the function definition [BITMAP_TO_STRING](/cloud/4.x/sql-
manual/sql-functions/scalar-functions/bitmap-functions/bitmap-to-string).|
doris.read.bitmap-to-base64| false| Whether to convert the Bitmap type to a
Base64-encoded string for reading. For the specific result format, see the
function definition [BITMAP_TO_BASE64](/cloud/4.x/sql-manual/sql-
functions/scalar-functions/bitmap-functions/bitmap-to-base64).|
doris.query.port| -| Doris FE query port, used for overwriting and obtaining
metadata of the Catalog.
---|---|---
### SQL & Dataframe Configuration
Key| Default Value| Comment| doris.filter.query.in.max.count| 100| In the
predicate pushdown, the maximum number of elements in the in expression value
list. If this number is exceeded, the in-expression conditional filtering is
processed on the Spark side.
---|---|---
### Structured Streaming Configuration
Key| Default Value| Comment| doris.sink.streaming.passthrough| false| Write
the value of the first column directly without processing.
---|---|---
### RDD Configuration
Key| Default Value| Comment| doris.request.auth.user| \--| Doris username|
doris.request.auth.password| \--| Doris password| doris.filter.query| \--|
Filter expression of the query, which is transparently transmitted to Doris.
Doris uses this expression to complete source-side data filtering.
---|---|---
## Doris & Spark Column Type Mapping
Doris Type| Spark Type| NULL_TYPE| DataTypes.NullType| BOOLEAN|
DataTypes.BooleanType| TINYINT| DataTypes.ByteType| SMALLINT|
DataTypes.ShortType| INT| DataTypes.IntegerType| BIGINT| DataTypes.LongType|
FLOAT| DataTypes.FloatType| DOUBLE| DataTypes.DoubleType| DATE|
DataTypes.DateType| DATETIME| DataTypes.TimestampType| DECIMAL| DecimalType|
CHAR| DataTypes.StringType| LARGEINT| DecimalType| VARCHAR|
DataTypes.StringType| STRING| DataTypes.StringType| JSON|
DataTypes.StringType| VARIANT| DataTypes.StringType| TIME|
DataTypes.DoubleType| HLL| DataTypes.StringType| Bitmap| DataTypes.StringType
---|---
tip
Since version 24.0.0, the return type of the Bitmap type is string type, and
the default return value is string value `Read unsupported`.
## FAQ
1. How to write Bitmap type
In Spark SQL, when writing data through insert into, if the target table of
doris contains data of type `BITMAP` or `HLL`, you need to set the option
`doris.ignore-type` to the corresponding type and map the columns through
`doris.write.fields`. The usage is as follows:
**BITMAP**
CREATE TEMPORARY VIEW spark_doris
USING doris
OPTIONS(
"table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
"fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
"user"="$YOUR_DORIS_USERNAME",
"password"="$YOUR_DORIS_PASSWORD"
"doris.ignore-type"="bitmap",
"doris.write.fields"="col1,col2,col3,bitmap_col2=to_bitmap(col2),bitmap_col3=bitmap_hash(col3)"
);
**HLL**
CREATE TEMPORARY VIEW spark_doris
USING doris
OPTIONS(
"table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
"fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
"user"="$YOUR_DORIS_USERNAME",
"password"="$YOUR_DORIS_PASSWORD"
"doris.ignore-type"="hll",
"doris.write.fields"="col1,hll_col1=hll_hash(col1)"
);
tip
Since version 24.0.0, `doris.ignore-type` has been deprecated and there is no
need to add this parameter when writing.
2. **How to use overwrite to write?**
Since version 1.3.0, overwrite mode writing is supported (only supports data
overwriting at the full table level). The specific usage is as follows:
**DataFrame**
resultDf.format("doris")
.option("doris.fenodes","$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
// your own options
.mode(SaveMode.Overwrite)
.save()
**SQL**
INSERT OVERWRITE your_target_table SELECT * FROM your_source_table
3. **How to read Bitmap type**
Starting from version 24.0.0, it supports reading converted Bitmap data
through Arrow Flight SQL (Doris version >= 2.1.0 is required).
**Bitmap to string**
`DataFrame` example is as follows, set `doris.read.bitmap-to-string` to true.
For the specific result format, see the option definition.
spark.read.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.option("doris.read.bitmap-to-string","true")
.load()
**Bitmap to base64**
`DataFrame` example is as follows, set `doris.read.bitmap-to-base64` to true.
For the specific result format, see the option definition.
spark.read.format("doris")
.option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
.option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
.option("user", "$YOUR_DORIS_USERNAME")
.option("password", "$YOUR_DORIS_PASSWORD")
.option("doris.read.bitmap-to-base64","true")
.load()
4. **An error occurs when writing in DataFrame mode:`org.apache.spark.sql.AnalysisException: TableProvider implementation doris cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.`**
Need to add save mode to append.
resultDf.format("doris")
.option("doris.fenodes","$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
// your own options
.mode(SaveMode.Append)
.save()
On This Page
* Version Compatibility
* How To Use
* Maven
* Compile
* Example
* Batch Read
* Batch Write
* Streaming Write
* Spark Doris Catalog
* Configuration
* General
* SQL & Dataframe Configuration
* Structured Streaming Configuration
* RDD Configuration
* Doris & Spark Column Type Mapping
* FAQ
---
# Source: https://docs.velodb.io/cloud/4.x/integration/data-source/doris-kafka-connector
Version: 4.x
On this page
# Doris Kafka Connector
[Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html)
is a scalable and reliable tool for data transmission between Apache Kafka and
other systems. Connectors can be defined Move large amounts of data in and out
of Kafka.
The Doris community provides the [doris-kafka-
connector](https://github.com/apache/doris-kafka-connector) plug-in, which can
write data in the Kafka topic to Doris.
## Version Description
Connector Version| Kafka Version| Doris Version| Java Version| 1.0.0| 2.4+|
2.0+| 8| 1.1.0| 2.4+| 2.0+| 8| 24.0.0| 2.4+| 2.0+| 8| 25.0.0| 2.4+| 2.0+| 8
---|---|---|---
## Usage
### Download
[doris-kafka-connector](https://doris.apache.org/download)
maven dependencies
org.apache.doris
doris-kafka-connector
25.0.0
### Standalone mode startup
Create the plugins directory under $KAFKA_HOME and put the downloaded doris-
kafka-connector jar package into it
Configure config/connect-standalone.properties
# Modify broker address
bootstrap.servers=127.0.0.1:9092
# Modify to the created plugins directory
# Note: Please fill in the direct path to Kafka here. For example: plugin.path=/opt/kafka/plugins
plugin.path=$KAFKA_HOME/plugins
# It is recommended to increase the max.poll.interval.ms time of Kafka to more than 30 minutes, the default is 5 minutes
# Avoid Stream Load import data consumption timeout and consumers being kicked out of the consumer group
max.poll.interval.ms=1800000
consumer.max.poll.interval.ms=1800000
Configure doris-connector-sink.properties
Create doris-connector-sink.properties in the config directory and configure
the following content:
name=test-doris-sink
connector.class=org.apache.doris.kafka.connector.DorisSinkConnector
topics=topic_test
doris.topic2table.map=topic_test:test_kafka_tbl
buffer.count.records=10000
buffer.flush.time=120
buffer.size.bytes=5000000
doris.urls=10.10.10.1
doris.http.port=8030
doris.query.port=9030
doris.user=root
doris.password=
doris.database=test_db
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
Start Standalone
$KAFKA_HOME/bin/connect-standalone.sh -daemon $KAFKA_HOME/config/connect-standalone.properties $KAFKA_HOME/config/doris-connector-sink.properties
note
Note: It is generally not recommended to use standalone mode in a production
environment.
### Distributed mode startup
Create the plugins directory under $KAFKA_HOME and put the downloaded doris-
kafka-connector jar package into it
Configure config/connect-distributed.properties
# Modify kafka server address
bootstrap.servers=127.0.0.1:9092
# Modify group.id, the same cluster needs to be consistent
group.id=connect-cluster
# Modify to the created plugins directory
# Note: Please fill in the direct path to Kafka here. For example: plugin.path=/opt/kafka/plugins
plugin.path=$KAFKA_HOME/plugins
# It is recommended to increase the max.poll.interval.ms time of Kafka to more than 30 minutes, the default is 5 minutes
# Avoid Stream Load import data consumption timeout and consumers being kicked out of the consumer group
max.poll.interval.ms=1800000
consumer.max.poll.interval.ms=1800000
Start Distributed
$KAFKA_HOME/bin/connect-distributed.sh -daemon $KAFKA_HOME/config/connect-distributed.properties
Add Connector
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name":"test-doris-sink-cluster",
"config":{
"connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector",
"topics":"topic_test",
"doris.topic2table.map": "topic_test:test_kafka_tbl",
"buffer.count.records":"10000",
"buffer.flush.time":"120",
"buffer.size.bytes":"5000000",
"doris.urls":"10.10.10.1",
"doris.user":"root",
"doris.password":"",
"doris.http.port":"8030",
"doris.query.port":"9030",
"doris.database":"test_db",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter"
}
}'
Operation Connector
# View connector status
curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/status -X GET
# Delete connector
curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster -X DELETE
# Pause connector
curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/pause -X PUT
# Restart connector
curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/resume -X PUT
# Restart tasks within the connector
curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/tasks/0/restart -X POST
Refer to: [Connect REST
Interface](https://docs.confluent.io/platform/current/connect/references/restapi.html#kconnect-
rest-interface)
note
Note that when kafka-connect is started for the first time, three topics
`config.storage.topic` `offset.storage.topic` and `status.storage.topic` will
be created in the kafka cluster to record the shared connector configuration
of kafka-connect. Offset data and status updates. [How to Use Kafka Connect -
Get
Started](https://docs.confluent.io/platform/current/connect/userguide.html)
### Access an SSL-certified Kafka cluster
Accessing an SSL-certified Kafka cluster through kafka-connect requires the
user to provide a certificate file (client.truststore.jks) used to
authenticate the Kafka Broker public key. You can add the following
configuration in the `connect-distributed.properties` file:
# Connect worker
security.protocol=SSL
ssl.truststore.location=/var/ssl/private/client.truststore.jks
ssl.truststore.password=test1234
# Embedded consumer for sink connectors
consumer.security.protocol=SSL
consumer.ssl.truststore.location=/var/ssl/private/client.truststore.jks
consumer.ssl.truststore.password=test1234
For instructions on configuring a Kafka cluster connected to SSL
authentication through kafka-connect, please refer to: [Configure Kafka
Connect](https://docs.confluent.io/5.1.2/tutorials/security_tutorial.html#configure-
kconnect-long)
### Dead letter queue
By default, any errors encountered during or during the conversion will cause
the connector to fail. Each connector configuration can also tolerate such
errors by skipping them, optionally writing the details of each error and
failed operation as well as the records in question (with varying levels of
detail) to a dead-letter queue for logging.
errors.tolerance=all
errors.deadletterqueue.topic.name=test_error_topic
errors.deadletterqueue.context.headers.enable=true
errors.deadletterqueue.topic.replication.factor=1
## Configuration items
Key| Enum| Default Value| **Required**| **Description**| name| -| -| Y|
Connect application name, must be unique within the Kafka Connect environment|
connector.class| -| -| Y| org.apache.doris.kafka.connector.DorisSinkConnector|
topics| -| -| Y| List of subscribed topics, separated by commas. like: topic1,
topic2| doris.urls| -| -| Y| Doris FE connection address. If there are
multiple, separate them with commas. like: 10.20.30.1,10.20.30.2,10.20.30.3|
doris.http.port| -| -| Y| Doris HTTP protocol port| doris.query.port| -| -| Y|
Doris MySQL protocol port| doris.user| -| -| Y| Doris username|
doris.password| -| -| Y| Doris password| doris.database| -| -| Y| The database
to write to. It can be empty when there are multiple libraries. At the same
time, the specific library name needs to be configured in topic2table.map.|
doris.topic2table.map| -| -| N| The corresponding relationship between topic
and table table, for example: topic1:tb1,topic2:tb2
The default is empty, indicating that topic and table names correspond one to
one.
The format of multiple libraries is topic1:db1.tbl1,topic2:db2.tbl2|
buffer.count.records| -| 50000| N| The number of records each Kafka partition
buffers in memory before flushing to doris. Default 50000 records|
buffer.flush.time| -| 120| N| Buffer refresh interval, in seconds, default 120
seconds| buffer.size.bytes| -| 10485760(100MB)| N| The cumulative size of
records buffered in memory for each Kafka partition, in bytes, default 100MB|
jmx| -| true| N| To obtain connector internal monitoring indicators through
JMX, please refer to: [Doris-Connector-JMX](https://github.com/apache/doris-
kafka-connector/blob/master/docs/en/Doris-Connector-JMX.md)| enable.2pc| -|
true| N| Whether to enable two-phase commit (TwoPhaseCommit) of Stream Load,
the default is true.| enable.delete| -| false| N| Whether to delete records
synchronously, default false| label.prefix| -| ${name}| N| Stream load label
prefix when importing data. Defaults to the Connector application name.|
auto.redirect| -| true| N| Whether to redirect StreamLoad requests. After
being turned on, StreamLoad will redirect to the BE where data needs to be
written through FE, and the BE information will no longer be displayed.|
sink.properties.*| -| `'sink.properties.format':'json'`,
`'sink.properties.read_json_by_line':'true'`| N| Import parameters for Stream
Load.
For example: define column separator `'sink.properties.column_separator':','`
Detailed parameter reference [here](/cloud/4.x/user-guide/data-
operate/import/import-way/stream-load-manual)
**Enable Group Commit** , for example, enable group commit in sync_mode mode:
`"sink.properties.group_commit":"sync_mode"`. Group Commit can be configured
with three modes: `off_mode`, `sync_mode`, and `async_mode`. For specific
usage, please refer to: [Group-Commit](/cloud/4.x/user-guide/data-
operate/import/group-commit-manual)
**Enable partial column update** , for example, enable update of partial
columns of specified col2: `"sink.properties.partial_columns":"true"`,
`"sink.properties.columns": " col2",`| delivery.guarantee| `at_least_once`,
`exactly_once`| at_least_once| N| How to ensure data consistency when
consuming Kafka data is imported into Doris. Supports `at_least_once`
`exactly_once`, default is `at_least_once`. Doris needs to be upgraded to
2.1.0 or above to ensure data `exactly_once`| converter.mode| `normal`,
`debezium_ingestion`| normal| N| Type conversion mode of upstream data when
using Connector to consume Kafka data.
`normal` means consuming data in Kafka normally without any type conversion.
`debezium_ingestion` means that when Kafka upstream data is collected through
CDC (Changelog Data Capture) tools such as Debezium, the upstream data needs
to undergo special type conversion to support it.| debezium.schema.evolution|
`none`,
`basic`| none| N| Use Debezium to collect upstream database systems (such as
MySQL), and when structural changes occur, the added fields can be
synchronized to Doris.
`none` means that when the structure of the upstream database system changes,
the changed structure will not be synchronized to Doris.
`basic` means synchronizing the data change operation of the upstream
database. Since changing the column structure is a dangerous operation (it may
lead to accidentally deleting columns of the Doris table structure), currently
only the operation of adding columns synchronously upstream is supported. When
a column is renamed, the old column remains unchanged, and the Connector will
add a new column in the target table and sink the renamed new data into the
new column.| database.time_zone| -| UTC| N| When `converter.mode` is not
`normal` mode, it provides a way to specify time zone conversion for date data
types (such as datetime, date, timestamp, etc.). The default is UTC time
zone.| avro.topic2schema.filepath| -| -| N| By reading the locally provided
Avro Schema file, the Avro file content in the Topic is parsed to achieve
decoupling from the Schema registration center provided by Confluent.
This configuration needs to be used with the `key.converter` or
`value.converter` prefix. For example, the local Avro Schema file for
configuring avro-user and avro-product Topic is as follows:
`"value.converter.avro.topic2schema. filepath":"avro-
user:file:///opt/avro_user.avsc, avro-product:file:///opt/avro_product.avsc"`
For specific usage, please refer to: [#32](https://github.com/apache/doris-
kafka-connector/pull/32)| record.tablename.field| -| -| N| Configure this
parameter, data from one kafka topic can flow to multiple doris tables. For
configuration details, refer to: [#58](https://github.com/apache/doris-kafka-
connector/pull/58)| enable.combine.flush| `true`,
`false`| false| N| Whether to merge data from all partitions together and
write them. The default value is false. When enabled, only at_least_once
semantics are guaranteed.| max.retries| -| 10| N| The maximum number of times
to retry on errors before failing the task.| retry.interval.ms| -| 6000| N|
The time in milliseconds to wait following an error before attempting a
retry.| behavior.on.null.values| `ignore`,
`fail`| ignore| N| Defined how to handle records with null values.
---|---|---|---|---
For other Kafka Connect Sink common configuration items, please refer to:
[connect_configuring](https://kafka.apache.org/documentation/#connect_configuring)
## Type mapping
Doris-kafka-connector uses logical or primitive type mapping to resolve the
column's data type.
Primitive types refer to simple data types represented using Kafka connect's
`Schema`. Logical data types usually use the `Struct` structure to represent
complex types, or date and time types.
Kafka Primitive Type| Doris Type| INT8| TINYINT| INT16| SMALLINT| INT32| INT|
INT64| BIGINT| FLOAT32| FLOAT| FLOAT64| DOUBLE| BOOLEAN| BOOLEAN| STRING|
STRING| BYTES| STRING
---|---
Kafka Logical Type| Doris Type| org.apache.kafka.connect.data.Decimal|
DECIMAL| org.apache.kafka.connect.data.Date| DATE|
org.apache.kafka.connect.data.Time| STRING|
org.apache.kafka.connect.data.Timestamp| DATETIME
---|---
Debezium Logical Type| Doris Type| io.debezium.time.Date| DATE|
io.debezium.time.Time| String| io.debezium.time.MicroTime| DATETIME|
io.debezium.time.NanoTime| DATETIME| io.debezium.time.ZonedTime| DATETIME|
io.debezium.time.Timestamp| DATETIME| io.debezium.time.MicroTimestamp|
DATETIME| io.debezium.time.NanoTimestamp| DATETIME|
io.debezium.time.ZonedTimestamp| DATETIME|
io.debezium.data.VariableScaleDecimal| DOUBLE
---|---
## Best Practices
### Load plain JSON data
1. Import data sample
In Kafka, there is the following sample data
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-data-topic --from-beginning
{"user_id":1,"name":"Emily","age":25}
{"user_id":2,"name":"Benjamin","age":35}
{"user_id":3,"name":"Olivia","age":28}
{"user_id":4,"name":"Alexander","age":60}
{"user_id":5,"name":"Ava","age":17}
{"user_id":6,"name":"William","age":69}
{"user_id":7,"name":"Sophia","age":32}
{"user_id":8,"name":"James","age":64}
{"user_id":9,"name":"Emma","age":37}
{"user_id":10,"name":"Liam","age":64}
2. Create the table that needs to be imported
In Doris, create the imported table, the specific syntax is as follows
CREATE TABLE test_db.test_kafka_connector_tbl(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 12;
3. Create an import task
On the machine where Kafka-connect is deployed, submit the following import
task through the curl command
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name":"test-doris-sink-cluster",
"config":{
"connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector",
"tasks.max":"10",
"topics":"test-data-topic",
"doris.topic2table.map": "test-data-topic:test_kafka_connector_tbl",
"buffer.count.records":"10000",
"buffer.flush.time":"120",
"buffer.size.bytes":"5000000",
"doris.urls":"10.10.10.1",
"doris.user":"root",
"doris.password":"",
"doris.http.port":"8030",
"doris.query.port":"9030",
"doris.database":"test_db",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter"
}
}'
### Load data collected by Debezium components
1. The MySQL database has the following table
CREATE TABLE test.test_user (
user_id int NOT NULL ,
name varchar(20),
age int,
PRIMARY KEY (user_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
insert into test.test_user values(1,'zhangsan',20);
insert into test.test_user values(2,'lisi',21);
insert into test.test_user values(3,'wangwu',22);
2. Create the imported table in Doris
CREATE TABLE test_db.test_user(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 12;
3. Deploy the Debezium connector for MySQL component, refer to: [Debezium connector for MySQL](https://debezium.io/documentation/reference/stable/connectors/mysql.html)
4. Create doris-kafka-connector import task
Assume that the MySQL table data collected through Debezium is in the
`mysql_debezium.test.test_user` Topic
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name":"test-debezium-doris-sink",
"config":{
"connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector",
"tasks.max":"10",
"topics":"mysql_debezium.test.test_user",
"doris.topic2table.map": "mysql_debezium.test.test_user:test_user",
"buffer.count.records":"10000",
"buffer.flush.time":"120",
"buffer.size.bytes":"5000000",
"doris.urls":"10.10.10.1",
"doris.user":"root",
"doris.password":"",
"doris.http.port":"8030",
"doris.query.port":"9030",
"doris.database":"test_db",
"converter.mode":"debezium_ingestion",
"enable.delete":"true",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter"
}
}'
### Load Avro serialized data
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name":"doris-avro-test",
"config":{
"connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector",
"topics":"avro_topic",
"tasks.max":"10",
"doris.topic2table.map": "avro_topic:avro_tab",
"buffer.count.records":"100000",
"buffer.flush.time":"120",
"buffer.size.bytes":"10000000",
"doris.urls":"127.0.0.1",
"doris.user":"root",
"doris.password":"",
"doris.http.port":"8030",
"doris.query.port":"9030",
"doris.database":"test",
"load.model":"stream_load",
"key.converter":"io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url":"http://127.0.0.1:8081",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"http://127.0.0.1:8081"
}
}'
### Load Protobuf serialized data
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name":"doris-protobuf-test",
"config":{
"connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector",
"topics":"proto_topic",
"tasks.max":"10",
"doris.topic2table.map": "proto_topic:proto_tab",
"buffer.count.records":"100000",
"buffer.flush.time":"120",
"buffer.size.bytes":"10000000",
"doris.urls":"127.0.0.1",
"doris.user":"root",
"doris.password":"",
"doris.http.port":"8030",
"doris.query.port":"9030",
"doris.database":"test",
"load.model":"stream_load",
"key.converter":"io.confluent.connect.protobuf.ProtobufConverter",
"key.converter.schema.registry.url":"http://127.0.0.1:8081",
"value.converter":"io.confluent.connect.protobuf.ProtobufConverter",
"value.converter.schema.registry.url":"http://127.0.0.1:8081"
}
}'
### Loading Data with Kafka Connect Single Message Transforms
For example, consider data in the following format:
{
"registertime": 1513885135404,
"userid": "User_9",
"regionid": "Region_3",
"gender": "MALE"
}
To add a hard-coded column to Kafka messages, InsertField can be used.
Additionally, TimestampConverter can be used to convert Bigint type timestamps
to time strings.
curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
"name": "insert_field_tranform",
"config": {
"connector.class": "org.apache.doris.kafka.connector.DorisSinkConnector",
"tasks.max": "1",
"topics": "users",
"doris.topic2table.map": "users:kf_users",
"buffer.count.records": "10",
"buffer.flush.time": "11",
"buffer.size.bytes": "5000000",
"doris.urls": "127.0.0.1:8030",
"doris.user": "root",
"doris.password": "123456",
"doris.http.port": "8030",
"doris.query.port": "9030",
"doris.database": "testdb",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms": "InsertField,TimestampConverter",
// Insert Static Field
"transforms.InsertField.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.InsertField.static.field": "repo",
"transforms.InsertField.static.value": "Apache Doris",
// Convert Timestamp Format
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.field": "registertime",
"transforms.TimestampConverter.format": "yyyy-MM-dd HH:mm:ss.SSS",
"transforms.TimestampConverter.target.type": "string"
}
}'
After InsertField and TimestampConverter transformations, the data becomes:
{
"userid": "User_9",
"regionid": "Region_3",
"gender": "MALE",
"repo": "Apache Doris",// Static field added
"registertime": "2017-12-21 03:38:55.404" // Unix timestamp converted to string
}
For more examples of Kafka Connect Single Message Transforms (SMT), please
refer to the [SMT
documentation](https://docs.confluent.io/cloud/current/connectors/transforms/overview.html).
## FAQ
**1\. The following error occurs when reading Json type data:**
Caused by: org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:337)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:91)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$4(WorkerSinkTask.java:536)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:180)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:214)
**reason:** This is because using the
`org.apache.kafka.connect.json.JsonConverter` converter requires matching the
"schema" and "payload" fields.
**Two solutions, choose one:**
1. Replace `org.apache.kafka.connect.json.JsonConverter` with `org.apache.kafka.connect.storage.StringConverter`
2. If the startup mode is **Standalone** mode, change `value.converter.schemas.enable` or `key.converter.schemas.enable` in config/connect-standalone.properties to false; If the startup mode is **Distributed** mode, change `value.converter.schemas.enable` or `key.converter.schemas.enable` in config/connect-distributed.properties to false
**2\. The consumption times out and the consumer is kicked out of the
consumption group:**
org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1318)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doCommitOffsetsAsync(ConsumerCoordinator.java:1127)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:1093)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitAsync(KafkaConsumer.java:1590)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitAsync(WorkerSinkTask.java:361)
at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:376)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:467)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:381)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:221)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
**Solution:**
Increase `max.poll.interval.ms` in Kafka according to the scenario. The
default value is `300000`
* If it is started in Standalone mode, add the `max.poll.interval.ms` and `consumer.max.poll.interval.ms` parameters in the configuration file of config/connect-standalone.properties, and configure the parameter values.
* If it is started in Distributed mode, add the `max.poll.interval.ms` and `consumer.max.poll.interval.ms` parameters in the configuration file of config/connect-distributed.properties, and configure the parameter values.
After adjusting the parameters, restart kafka-connect
**3\. Doris-kafka-connector reports an error when upgrading version from 1.0.0
or 1.1.0 to 24.0.0**
org.apache.kafka.common.config.ConfigException: Topic 'connect-status' supplied via the 'status.storage.topic' property is required to have 'cleanup.policy=compact' to guarantee consistency and durability of connector and task statuses, but found the topic currently has 'cleanup.policy=delete'. Continuing would likely result in eventually losing connector and task statuses and problems restarting this Connect cluster in the future. Change the 'status.storage.topic' property in the Connect worker configurations to use a topic with 'cleanup.policy=compact'.
at org.apache.kafka.connect.util.TopicAdmin.verifyTopicCleanupPolicyOnlyCompact(TopicAdmin.java:581)
at org.apache.kafka.connect.storage.KafkaTopicBasedBackingStore.lambda$topicInitializer$0(KafkaTopicBasedBackingStore.java:47)
at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:247)
at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:231)
at org.apache.kafka.connect.storage.KafkaStatusBackingStore.start(KafkaStatusBackingStore.java:228)
at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:164)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run
**Solution:** Adjust the clearing strategy of `connect-configs` `connect-
status` Topic to compact
$KAFKA_HOME/bin/kafka-configs.sh --alter --entity-type topics --entity-name connect-configs --add-config cleanup.policy=compact --bootstrap-server 127.0.0.1:9092
$KAFKA_HOME/bin/kafka-configs.sh --alter --entity-type topics --entity-name connect-status --add-config cleanup.policy=compact --bootstrap-server 127.0.0.1:9092
**4\. Table schema change failed in`debezium_ingestion` converter mode**
[2025-01-07 14:26:20,474] WARN [doris-normal_test_sink-connector|task-0] Table 'test_sink' cannot be altered because schema evolution is disabled. (org.apache.doris.kafka.connector.converter.RecordService:183)
[2025-01-07 14:26:20,475] ERROR [doris-normal_test_sink-connector|task-0] WorkerSinkTask{id=doris-normal_test_sink-connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Cannot alter table org.apache.doris.kafka.connector.model.TableDescriptor@67cd8027 because schema evolution is disabled (org.apache.kafka.connect.runtime.WorkerSinkTask:612)
org.apache.doris.kafka.connector.exception.SchemaChangeException: Cannot alter table org.apache.doris.kafka.connector.model.TableDescriptor@67cd8027 because schema evolution is disabled
at org.apache.doris.kafka.connector.converter.RecordService.alterTableIfNeeded(RecordService.java:186)
at org.apache.doris.kafka.connector.converter.RecordService.checkAndApplyTableChangesIfNeeded(RecordService.java:150)
at org.apache.doris.kafka.connector.converter.RecordService.processStructRecord(RecordService.java:100)
at org.apache.doris.kafka.connector.converter.RecordService.getProcessedRecord(RecordService.java:305)
at org.apache.doris.kafka.connector.writer.DorisWriter.putBuffer(DorisWriter.java:155)
at org.apache.doris.kafka.connector.writer.DorisWriter.insertRecord(DorisWriter.java:124)
at org.apache.doris.kafka.connector.writer.StreamLoadWriter.insert(StreamLoadWriter.java:151)
at org.apache.doris.kafka.connector.service.DorisDefaultSinkService.insert(DorisDefaultSinkService.java:154)
at org.apache.doris.kafka.connector.service.DorisDefaultSinkService.insert(DorisDefaultSinkService.java:135)
at org.apache.doris.kafka.connector.DorisSinkTask.put(DorisSinkTask.java:97)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:583)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:336)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:237)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:202)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:257)
at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:177)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
**Solution:**
In `debezium_ingestion` converter mode, table schema changes are turned off by
default. You need to configure `debezium.schema.evolution` to `basic` to
enable table schema changes.
It should be noted that enabling table structure changes does not accurately
keep this changed column as the only column in the Doris table (see
`debezium.schema.evolution` parameter description for details). If you need to
keep only unique columns in the upstream and downstream, it is best to
manually add the changed columns to the Doris table, and then restart the
Connector task. The Connector will continue to consume the unconsumed `offset`
to maintain data consistency.
On This Page
* Version Description
* Usage
* Download
* Standalone mode startup
* Distributed mode startup
* Access an SSL-certified Kafka cluster
* Dead letter queue
* Configuration items
* Type mapping
* Best Practices
* Load plain JSON data
* Load data collected by Debezium components
* Load Avro serialized data
* Load Protobuf serialized data
* Loading Data with Kafka Connect Single Message Transforms
* FAQ
---
# Source: https://docs.velodb.io/cloud/4.x/integration/overview
Version: 4.x
On this page
# Integration Overview
VeloDB integrations are categorized into **BI, Lakehouse, Observability, SQL
client, Data Source, Data Ingestion and Data Processing** categories.
This list of VeloDB / Apache Doris integrations is continuously being updated
and is not yet complete. We welcome any contributions of relevant VeloDB /
Apache Doris integrations to help expand it. [Contact
Us](mailto:contact@velodb.io) to update the integration list.
## Lakehouse
Name| Logo| Description| Resources| Apache Iceberg| | Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/iceberg-catalog)| Apache Hudi| | By connecting to the Hive Metastore, or a metadata service compatible with the Hive Metastore, Doris can automatically obtain Hudi's database and table information and perform data queries.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/hive-catalog)| Amazon Glue| | Using AWS Glue Catalog to access Iceberg tables or Hive tables through CREATE CATALOG.| [Documentation](/cloud/4.x/user-guide/lakehouse/metastores/aws-glue)| Apache Paimon| | Doris currently supports accessing Paimon table metadata through various metadata services and querying Paimon data.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/paimon-catalog)| Apache Hive| | By connecting to Hive Metastore or metadata services compatible with Hive Metastore, Doris can automatically retrieve Hive database and table information for data querying.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/hive-catalog)| BigQuery| | BigQuery Catalog uses the Trino Connector compatibility framework to access BigQuery tables through the BigQuery Connector.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/bigquery-catalog)| Apache Kudu| | Kudu Catalog uses the Trino Connector compatibility framework to access Kudu tables through the Kudu Connector.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/kudu-catalog)| LakeSoul| | Doris supports accessing and reading LakeSoul table data using metadata stored in PostgreSQL.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/lakesoul-catalog)| MaxCompute| | MaxCompute is an enterprise-level SaaS (Software as a Service) cloud data warehouse on Alibaba Cloud.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/maxcompute-catalog)
---|---|---|---
## Observability
Name| Logo| Description| Resources| Opentelemetry| | Using AWS Glue Catalog to access Iceberg tables or Hive tables through CREATE CATALOG.| Documentation| Logstash| | Logstash is a log ETL framework (collect, preprocess, send to storage systems) that supports custom output plugins to write data into storage systems.| [Documentation](/cloud/4.x/ecosystem/observability/logstash)| Beats| | Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables.| [Documentation](/cloud/4.x/ecosystem/observability/beats)| Fluentbit| | Doris currently supports accessing Paimon table metadata through various metadata services and querying Paimon data.| [Documentation](/cloud/4.x/ecosystem/observability/fluentbit)
---|---|---|---
## Data Processing
Name| Logo| Description| Resources| Apache Spark| Apache Spark logo| Spark
Doris Connector can support reading data stored in Doris and writing data to
Doris through Spark.| [GitHub](https://github.com/apache/doris-spark-
connector)
[Documentation](/cloud/4.x/integration/data-processing/spark-doris-connector)|
Apache Flink|
| The Flink
Doris Connector is used to read from and write data to a Doris cluster through
Flink.| [GitHub](https://github.com/apache/doris-flink-connector)
[Documentation](/cloud/4.x/integration/data-processing/flink-doris-connector)|
dbt| dbt| The dbt-doris adapter is developed based on dbt-core and relies on
the mysql-connector-python driver to convert data to doris.|
[Documentation](/cloud/4.x/integration/data-processing/dbt-doris-adapter)
---|---|---|---
## BI
Name| Logo| Description| Resources| Tableau| | Interactive data visualization software focused on business intelligence| [Documentation](/cloud/4.x/integration/bi/tableau)| Power BI| | Microsoft Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence.| [Documentation](/cloud/4.x/integration/bi/powerbi)| QuickSight| | Amazon QuickSight powers data-driven organizations with unified business intelligence (BI).| [Documentation](/cloud/4.x/integration/bi/quicksight)| Apache Superset| | Apache Superset is an open-source data exploration platform. It supports a rich variety of data source connections and numerous visualization methods.| [Documentation](/cloud/4.x/integration/bi/apache-superset)| FineBI| | FineBI supports rich data source connection and analysis and management of tables with multiple views.| [Documentation](/cloud/4.x/integration/bi/finebi)| SmartBI| | Smartbi is a collection of software services and application connectors that can connect to a variety of data sources, including Oracle, SQL Server, MySQL, and Doris, enabling users to integrate and cleanse their data easily.| [Documentation](/cloud/4.x/integration/bi/smartbi)| QuickBI| | Quick BI is a data warehouse-based business intelligence tool that helps enterprises set up impressive visual analyses quickly.| [Documentation](/cloud/4.x/integration/bi/quickbi)
---|---|---|---
## SQL Client
Name| Logo| Description| Resources| DBeaver|
|
DBeaver is a cross-platform database tool for developers, database
administrators, analysts and anyone who works with data.|
[Documentation](/cloud/4.x/integration/sql-client/dbeaver)| DataGrip|
icon_DataGrip| DataGrip is a powerful cross-platform database tool for
relational and NoSQL databases from JetBrains.|
[Documentation](/cloud/4.x/integration/sql-client/datagrip)
---|---|---|---
## Data Source
Name| Logo| Description| Resources| Apache Kafka| | Doris integrates with Kafka via its efficient Routine Load for real-time streaming (CSV/JSON, Exactly-Once) and the Doris Kafka Connector for advanced formats.| [GitHub](https://github.com/apache/doris-kafka-connector)
[Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/kafka)| Doris Kafka Connector| | Doris integrates with Kafka via its efficient Routine Load for real-time streaming (CSV/JSON, Exactly-Once) and the Doris Kafka Connector for advanced formats.| [GitHub](https://github.com/apache/doris-kafka-connector)
[Documentation](/cloud/4.x/integration/data-source/doris-kafka-connector)| MySQL| | Doris JDBC Catalog supports connecting to MySQL databases via the standard JDBC interface.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/jdbc-mysql-catalog)| PostgreSQL| | Doris JDBC Catalog supports connecting to PostgreSQL databases via the standard JDBC interface.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/jdbc-pg-catalog)| Amazon S3| | Doris supports loading S3 files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/amazon-s3)| Azure| | Doris supports loading Azure Storage files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/azure-storage)| Google Cloud Storage| | For loading files from Google Cloud Storage, Doris provides two methods: the asynchronous S3 Load and the synchronous TVF.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/google-cloud-storage)| MinIO| | Doris supports loading MinIO files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/lakehouse/storages/minio)| HDFS| | By connecting to Hive Metastore or metadata services compatible with Hive Metastore, Doris can automatically retrieve Hive database and table information for data querying.| [Documentation](/cloud/4.x/user-guide/lakehouse/storages/hdfs)
---|---|---|---
## Data Ingestion
Name| Logo| Description| Resources| Doris Streamloader| | Doris Streamloader is a client tool designed for loading data into Apache Doris. In comparison to single-threaded load using curl, it reduces the load latency of large datasets by its concurrent loading capabilities.| [Documentation](/cloud/4.x/integration/data-ingestion/doris-streamloader)| Apache SeaTunnel| | SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data.| [Documentation](/cloud/4.x/integration/data-ingestion/seatunnel)| BladePipe| | BladePipe is a real-time end-to-end data replication tool, moving data between 30+ databases, message queues, search engines, caching, real-time data warehouses, data lakes and more, with ultra-low latency less than 3 seconds.| [Documentation](/cloud/4.x/integration/data-ingestion/cloudcanal)
---|---|---|---
## More
Name| Logo| Description| Resources| AutoMQ|
|
AutoMQ is a cloud-native fork of Kafka by separating storage to object storage
like S3.| [Documentation](/cloud/4.x/integration/more/automq-load)| DataX|
|
The DataX Doriswriter plugin supports synchronizing data from various data
sources, such as MySQL, Oracle, and SQL Server, into Doris using the Stream
Load method.| [Documentation](/cloud/4.x/integration/more/datax)| Kettle|
| Kettle
Doris Plugin is used to write data from other data sources to Doris through
Stream Load in Kettle.| [Documentation](/cloud/4.x/integration/more/kettle)|
Apache Kyuubi|
|
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless
SQL on Data Warehouses and Lakehouses.|
[Documentation](/cloud/4.x/integration/more/kyuubi)
---|---|---|---
On This Page
* Lakehouse
* Observability
* Data Processing
* BI
* SQL Client
* Data Source
* Data Ingestion
* More
---
# Source: https://docs.velodb.io/cloud/4.x/integration/sql-client/dbeaver
Version: 4.x
On this page
# DBeaver
## introduce
DBeaver is a cross-platform database tool for developers, database
administrators, analysts and anyone who works with data.
Apache Doris is highly compatible with the MySQL protocol. You can use
DBeaver's MySQL driver to connect to Apache Doris and query data in the
internal catalog and external catalog.
## Preconditions
Dbeaver installed You can visit to download and install
DBeaver
## Add data source
Note
Currently verified using DBeaver version 24.0.0
1. Start DBeaver
2. Click the plus sign (**+**) icon in the upper left corner of the DBeaver window, or select **Database > New Database Connection** in the menu bar to open the **Connect to a database** interface.


3. Select the MySQL driver
In the **Select your database** window, select **MySQL**.

4. Configure Doris connection
In the **main** tab of the **Connection Settings** window, configure the
following connection information:
* Server Host: FE host IP address of the Doris cluster.
* Port: FE query port of Doris cluster, such as 9030.
* Database: The target database in the Doris cluster.
* Username: The username used to log in to the Doris cluster, such as admin.
* Password: User password used to log in to the Doris cluster.
tip
Database can be used to distinguish between internal catalog and external
catalog. If only the Database name is filled in, the current data source will
be connected to the internal catalog by default. If the format is catalog.db,
the current data source will be connected to the catalog filled in Database by
default, as shown in DBeaver The database tables are also database tables in
the connected catalog, so you can use DBeaver's MySQL driver to create
multiple Doris data sources to manage different Catalogs in Doris.
Note
Managing the external catalog connected to Doris through the Database form of
catalog.db requires Doris version 2.1.0 and above.
* internal catalog 
* external catalog 
5. Test data source connection
After filling in the connection information, click Test Connection in the
lower left corner to verify the accuracy of the database connection
information. DBeaver returns to the following dialog box to confirm the
configuration of the connection information. Click OK to confirm that the
configured connection information is correct. Then click Finish in the lower
right corner to complete the connection configuration. 
6. Connect to database
After the database connection is established, you can see the created data
source connection in the database connection navigation on the left, and you
can connect and manage the database through DBeaver. 
## Function support
* fully support
* Visual viewing class
* Databases
* Tables
* Views
* Users
* Administer
* Session Manager
* System Info
* Session Variables
* Global Variables
* Engines
* Charsets
* User Priviages
* Plugin
* Operation class
* SQL editor
* SQL console
* basic support
The basic support part means that you can click to view without error, but due
to protocol compatibility issues, there may be incomplete display.
* Visual viewing class
* dash board
* Users/user/properties
* Session Status
* Global Status
* not support
The unsupported part means that when using DBeaver to manage Doris, errors may
be reported when performing certain visual operations, or some visual
operations are not verified. Such as visual creation of database tables,
schema change, addition, deletion and modification of data, etc.
On This Page
* introduce
* Preconditions
* Add data source
* Function support
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/backup
Version: 4.x
On this page
# Backup and Restore
VeloDB Cloud supports backing up databases to object storage either
periodically or as a one-time operation, and allows users to quickly restore
data based on specified backup sets, comprehensively ensuring high
availability of data.
## Backup
### Create a Backup Plan
Click **Backup** in the left navigation bar, and click **Create Backup Plan**
on the Backup page. You can choose between periodic or one-time backups as
needed. Periodic and one-time backups have a mutual exclusion relationship.
Updating the backup plan will overwrite the original backup plan.
If periodic backup is selected, you need to choose whether to enable it, the
backup execution cycle, start time, backup objects, retention days, and the
cluster used for backup, and save the selected settings for them to take
effect.

If you choose one-time backup, you need to select the start time, backup
objects, retention days, and the cluster used for backup. Similarly, you need
to save the selected settings for them to take effect.

Parameter| Description| Backup Every| Multiple selections are allowed from
Monday to Sunday, with at least one day and at most seven days.| Start Time|
The startup time of the backup task.| Backup Objects| Internal Catalog:
Database; External Catalog: Only backs up DDL, not data.| Backup Retention
Days| Set the retention days for backup sets, and backup sets exceeding the
retention days will be cleared.| Backup Cluster| The backup process consumes
computing resources. In the case of multiple clusters, it is necessary to
specify the cluster to be used for backup operations.
---|---
### View Backup Tasks
VeloDB Cloud will automatically execute backup tasks according to the plan you
set. View all backup tasks in the **Backup Tasks** list, including backup
status, retention days, data size, and backup start and completion times.

Click the **View Details** in the operation column to obtain detailed
information about backup task execution.

## Restore
You can select the row where the target backup set is located in the list of
backup tasks, click **Restore** in the operation column, specify the target
warehouse and cluster for the restore task, and then restore the backup.

Restore tasks will be displayed in the **Restore Tasks** list, where you can
view detailed information such as task status, data size, start and completion
time, etc.

Click the **View Details** in the operation column to obtain detailed
information about restore task execution.

On This Page
* Backup
* Create a Backup Plan
* View Backup Tasks
* Restore
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/cluster-management
Version: 4.x
On this page
# Cluster Management
In each paid warehouse, you can create multiple clusters to support different
workloads, such as writing data, customer-facing reporting, user profiles, and
behavior analytics.
The cluster only contain compute resource, cache resource and cached data. All
clusters in the warehouse share the stored data.
## New Cluster
To create a new cluster in a paid warehouse, you can click **Clusters** on the
navigation bar.
If a cluster already exists, you will see the **Cluster Overview** page.

Click **New Cluster** on the wizard page or **Cluster Overview** page to
create a new cluster.

**Parameter** | **Description**| Cluster Name| Required. Must start with a letter, up to 32 characters, you can use letters (case insensitive), numbers and _.| Compute| Default is minimum 4 vCPU, maximum 1024 vCPU per cluster, if you need a higher quota, please [get help](mailto:support@velodb.io) to apply. Currently, the ratio of vCPU to memory is fixed at 1:8.| Cache| The upper and lower limits of the cache space will vary depending on the compute size.| Storage| Pay as you go, no need to preset storage space. All clusters in the warehouse share the stored data.| Billing Method| Default is **On-Demand (Hourly)** billing, suitable for scenarios that need to be flexibly changed or deleted at any time, such as temporary test verification.| Auto Pause/Resume| When enabled, the compute cluster will automatically pause after a period of inactivity. It will automatically resume upon a new query request.
---|---
Creating a new cluster will incur a charge. Therefore, before creation, please
ensure sufficient available amount or open and enable the cloud marketplace
deduction channel. Otherwise, you will see the following error prompt.

> **Note**
>
> * After confirming the creation, you can see the new cluster on the
> **Cluster overview** page. It takes about 3 minutes to complete the
> creation, and the cluster status will be changed from "**Creating** " to
> "**Running** ".
> * The SaaS model free trial clusters do not support new cluster creation.
>
## Reboot Cluster
In certain situations (such as cluster exceptions or modification of certain
parameters), you may need to reboot the cluster. On the **Cluster Overview**
page, find the target cluster card, click **Reboot** operation, and confirm
again. The cluster status will be changed to "**Rebooting** ", and no other
operations can be performed on the cluster at this status.

> **Note**
>
> * It takes about 3 minutes for the cluster to reboot. When it is done, the
> cluster status will be changed from "**Rebooting** " to "**Running** ".
> * The rebooting of cluster may cause business requests to experience
> crashes or delayed responses.
> * During the cluster rebooting process, VeloDB Cloud will still meter and
> charge the cluster.
>
## Pause/Resume Cluster
### Manual Pause/Resume Cluster
You may wish to save costs when the cluster is idle. On the **Cluster
Overview** page, find the target cluster card. When the cluster status is
"**Running** " and it is confirmed that the cluster is unloaded, you can
manually pause the cluster, click **Pause** operation and confirm again. The
cluster status will be changed to "**Pausing** ", and no other operations can
be performed on the cluster at this time. VeloDB Cloud will release the
computing resource of the cluster while retaining the cache space and its
data.


> **Note**
>
> * It takes about 3 minutes for the cluster to pause. When it is done, the
> cluster status will be changed from "**Pausing** " to "**Paused** ".
> * The cluster will not respond to business requests during the pause
> period.
> * During the cluster suspension period, VeloDB Cloud will no longer meter
> and charge for computing resource, but will still meter and charge for cache
> space.
> * Clusters containing monthly/yearly billing resources do not support the
> pause/resume function.
>
When you need the cluster to continue responding to business requests, you can
manually resume the "**paused** " cluster. On the **Cluster Overview** page,
find the target cluster card, click **Resume** operation and confirm again.
The cluster status will be changed to "**Resuming** ", and no other operations
can be performed on the cluster at this status. VeloDB Cloud will pull up
computing resource and mount the reserved cache space and its data.


> **Note**
>
> * It takes about 3 minutes for the cluster to resume. When it is done, the
> cluster status will be changed from "**Resuming** " to "**Running** ".
> * The cluster will not respond to business requests during the resuming
> process.
> * After the cluster is resumed, it can respond to business requests, and
> VeloDB Cloud will restore metering and billing for the pulled up computing
> resource.
> * Clusters containing monthly/yearly billing resources do not support the
> pause/resume function.
>
### Auto Pause/Resume Cluster
If you want to automatically start and stop idle clusters, you can click **Set
Auto Start/Stop** to the right of **Started On** or in the upper right corner
on the **Cluster Details** page, and turn on the **Auto Start/Stop** switch to
customize the idle duration of the shutdown trigger condition.

## Cluster Details
Before performing any operation on a cluster, you may need to first know the
detailed information of the cluster. On the **Cluster Overview** page, find
the target cluster card, and if the cluster status supports, click on the
cluster card to enter the **Cluster Details** page.

The **Cluster Details** page includes two main content areas: basic
information and on-demand billing resources, as well as corresponding
functional operations. The specific explanation is as follows: **Basic
Information** :
**Parameter** | **Description**| Cluster ID| The globally unique ID of the cluster. Start with "c-", followed by 18 characters, randomly combined with 26 lowercase letters and 10 numbers.| Cluster Name| It is unique in a warehouse, and supporting one click copying and locally renaming. If you need to modify the cluster name, click the edit icon, enter the new cluster name in the input box that appears (it is recommended that the name should indicate the meaning), click the confirm icon and confirm again.
**Note**
\- The VeloDB Core syntax will use the cluster name, for example:
`USE { [catalog_name.]database_name[@cluster_name] }`
\- The cluster name must start with a letter, up to 32 characters, you can use
letters (case insensitive), numbers and _.
\- After modifying the cluster name, it is necessary to ensure that the
business uses the new cluster name or sets the default cluster for the
relevant database users, otherwise it will cause the relevant requests to
fail.| Created By| The user who created the cluster. Multiple users in the
same organization can perform corresponding operations on warehouses and their
clusters according to their privileges.| Created At| The time when the cluster
was created.| Started At| The time when the cluster was last rebooted or
resumed.| Running Time| The running time of the cluster since it was last
rebooted or resumed.| Zone| The availability zone where the cluster is
located.| CPU Architecture| The CPU architecture of cluster computing
resource.
**Note**
\- Currently, only VeloDB Cloud warehouses deployed on AWS can see the CPU
architecture of the cluster, which may be x86 or ARM.
\- Core version 4.0.4 or above is required to create an ARM architecture
cluster. If the core version is too low, please upgrade the core version.
\- On the same specifications, ARM architecture has a performance improvement
of over 30% compared to x86 architecture.
\- In the SaaS model, the pricing of cluster computing resources for ARM
architecture and x86 architecture is consistent in the same cloud platform and
region. In the BYOC model, the pricing of computing resources for different
CPU architectures may vary within the same cloud platform and region,
depending on the cloud provider.
\- It cannot be modified after the cluster is created.| **On-Demand
Resources** :|
---|---
**Parameter** | **Description**| Compute| Displays the current compute resource of the cluster.| Cache| Displays the current cache space of the cluster.| Scale Out/In| If the performance of the current cluster does not meet the business requirements, you can increase or decrease compute resource or cache space to adjust the capacity of the current cluster by clicking **Scale Out/In**.
---|---
## Scale Cluster
### Manual Scaling
Based on your business requirements, you can click **Scale Out/In** in the
upper right corner on the **On-Demand Resources** content area of the
**Cluster Details** page, and select **Manual Scaling** to adjust the capacity
of the current cluster.

> **Note**
>
> * After confirming the scaling, you can see the cluster status be changed
> from "**Running** " to "**Scaling** " on the **Cluster Overview** page. It
> takes about 3 minutes to complete the scaling, and the cluster status will
> be changed from "**Scaling** " to "**Running** ".
> * The SaaS free trial clusters do not support scaling.
>
### Time-based Scaling
If the cluster needs to deal with periodic business peaks and lows, you can
click **Scale Out/In** in the upper right corner on the **On-Demand
Resources** content area of the **Cluster Details** page, and select **Time-
based scaling** , customize and add at least two different target vCPU time-
based rules, and enable time-based scaling policy.

> **Note**
>
> * The SaaS free trial clusters do not support scaling.
> * The on-demand billing cluster does not support configuring a time-based
> rule with a target vCPU of 0.
> * The time-based rule is valid and executed when the cluster is running
> normally. When the cluster is not running normally (such as pausing,
> rebooting, upgrading, etc.), it will wait for a retry, and will not be
> executed after more than 30 minutes.
> * If the current organization does not have sufficient available amount or
> open and enable the cloud marketplace deduction channel, the time-based rule
> will be considered invalid and abandoned by VeloDB Cloud.
> * The execution period of the time-based rule defaults to every day and
> does not currently support modification.
> * There should be at least an hour interval between the time-based rules,
> so a maximum of 23 time-based rules can be configured.
> * The execution time of the time-based rule cannot be repeated with
> existing time-based rules.
> * Scaling cluster may cause some requests to experience crashes or delayed
> responses.
> * When scaling in, the cache space will automatically scale in
> proportionally with the computing resource (vCPU), and cache data that
> exceeds the target cache space will be eliminated. The response time of some
> requests may experience significant delays.
>
## Delete Cluster
If the business no longer requires the current cluster, you can delete it. In
the upper right corner of the **Cluster Details** page, click **Delete
Cluster** operation and confirm again.

> **Note**
>
> * Deleting the SaaS model free trial cluster will also delete the free
> trial warehouse, storage resources, and their data.
> * Clusters containing monthly/yearly billing resources do not support
> early deletion. You need to wait until the cluster expires and be converted
> to on-demand billing by default. If you want monthly billing resources to
> expire and be converted to on-demand billing as soon as possible, you need
> to confirm that the auto renew function is not enabled, otherwise the
> cluster may not expire.
> * All resources and cached data of the cluster will be deleted by VeloDB
> Cloud, and you need to adjust the business accessing the cluster in a timely
> manner, otherwise related business requests will fail.
>
## Multi-Availability Zone Disaster Recovery
The virtual cluster provides high availability and disaster recovery
capabilities across Availability Zones by establishing an active-standby
cluster architecture. In the event of a failure in the primary Availability
Zone, the system automatically triggers a failover to ensure business
continuity. Leveraging a real-time data synchronization mechanism, it
effectively prevents service interruptions and data loss, thereby guaranteeing
high availability for your business.

Before creating a high-availability virtual cluster, two physical clusters
must be prepared. They must be in the Running state and located in different
Availability Zones.

On the Virtual Cluster page, click **New Virtual Cluster** to navigate to the
cluster configuration page.
  **Parameter** | **Description**| Virtual Cluster Name| The cluster name must start with a letter, up to 32 characters, you can use letters (case insensitive), numbers and _.| Active Cluster| The cluster that is actively serving traffic.| Standby Cluster| The disaster recovery cluster that becomes active upon failover. Note: Identical specifications are recommended.
---|---
After the virtual cluster is successfully created, you can click on its card
on the overview page to navigate to the details page. There, you can modify
the active/standby cluster configuration or delete the virtual cluster.

On This Page
* New Cluster
* Reboot Cluster
* Pause/Resume Cluster
* Manual Pause/Resume Cluster
* Auto Pause/Resume Cluster
* Cluster Details
* Scale Cluster
* Manual Scaling
* Time-based Scaling
* Delete Cluster
* Multi-Availability Zone Disaster Recovery
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/connections
Version: 4.x
On this page
# Connections
## Private Link
Private Link can help you securely and stably access services deployed in
other VPCs through a private network in VPC environments, greatly simplifying
network architecture and avoiding security risks associated with accessing
services through the public network.
The VeloDB Cloud warehouse is created and run in the VeloDB VPC, and
application systems or clients within the user's VPC can access the VeloDB
Cloud warehouse across VPC via Private Link. Private Link includes two parts:
endpoint service and endpoint.
When the user needs to access VeloDB in their own private network, VeloDB
Cloud will create and manage the endpoint service, and the user creates and
manages the endpoint.
When the user needs to use VeloDB to access their own private network, they
need to create an endpoint service and register it in VeloDB Cloud.
Subsequently, VeloDB Cloud will create an endpoint to connect to the user's
endpoint service.
### Access VeloDB from Your VPC

Creating a connection to allow your data applications, such as reporting,
profiling, and log analytics, within your private network to access the VeloDB
Cloud warehouse.
> **Note** There is no additional fee on the VeloDB Cloud service side, but
> users need to pay the cloud platform for endpoint instances and traffic
> fees.
#### AWS
1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **Set up Connection** to **Connect Your VPC to VeloDB** on the **Private Link** tab to create an endpoint.

2. The page displays the Endpoint Service information required for creating an endpoint. You can click **Set up one or more endpoints** to go to the cloud platform's Private Link product console and create an endpoint.

3. On the cloud platform's Private Link product console, you need to confirm that the current region is the same as the warehouse's endpoint service (limited by the cloud platform's Private Link product) and click **Create endpoint**.

> **Note** You need to sign in to AWS with the principal that has been allowed
> to access the endpoint service of VeloDB Cloud, so that you can successfully
> pass the service name verification when creating the endpoint.
4. Follow the wizard prompts to fill in the form as follows:


**Parameter**| **Description**| Name tag| Optional. Creates a tag with a key
of 'Name' and a value that you specify.| Service category| Required. Select
the service category. The endpoint service of the VeloDB Cloud warehouse
belongs to **Endpoint services that use NLBs and GWLBs** , so click to select
it.| Service name| Required. One-click shortcut to copy the Service Name of
the endpoint service of VeloDB Cloud warehouse in the page that displays the
Endpoint Service information required for creating an endpoint, fill in the
input box and click **Verify service** .| VPC| Required. Select the VPC in
which to create your endpoint.| Subnets| Required. Select the same
Availability Zone as the one where the endpoint service of the VeloDB Cloud
warehouse is located (limited by the cloud vendor's Private Link product), and
then select an appropriate subnet ID under it.| Security groups| Required.
Select a preset security group. Note that the security rules should allow the
protocol and port used by the VeloDB Cloud warehouse, as well as the IP
address of the source where the application/client connects to the VeloDB
Cloud warehouse.| Tags| Optional. You can add tags associated with the
resource.
---|---
5. After the endpoint is created, its status will be changed from " **Pending** " to " **Available** ", indicating that the endpoint has successfully connected with the warehouse's endpoint service.

6. After refreshing the **Connections** page of the VeloDB Cloud warehouse, the endpoint list will display the connection information of the endpoint.


> **Note** You need to click **Find DNS Name** to open the **Endpoint
> Details** page of AWS Private Link product console, find the **DNS Name** of
> the endpoint and use it to access the VeloDB Cloud warehouse.
7. The application/client can access the VeloDB Cloud warehouse through the DNS name of the endpoint by MySQL protocol or HTTP protocol. For the specific connection method, refer to the pop-up bubble for **Connection Examples** .

> **Note**
>
> * VeloDB Cloud includes two independent account systems: One is used to
> connect to the warehouse, as described in this topic. The other one is used
> to log into the console, which is described in the [Registration and
> Login](/cloud/4.x/management-guide/user-and-organization) topic.
>
> * For first-time connection, please use the admin username and its
> password which can be initialized or reset on the **Settings** page.
>
>
#### Azure
1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** to **Access VeloDB from Your VPC** on the **Private Link** tab to create an endpoint. Firstly, you need to approve a subscription to access the endpoint service of VeloDB Cloud warehouse.


2. After approving a subscription to access the endpoint service, the page displays the Endpoint Service information required for creating an endpoint. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint.

3. In the **Basics** tab of the **Create a private endpoint** page on the cloud platform's Private Link product console, you need to confirm that the current region is the same as the endpoint service of VeloDB Cloud warehouse (limited by the cloud platform's Private Link product). Follow the wizard prompts to fill in the form as follows and click **Next: Resource**.

Parameter| Category| Description| Subscription| Project details| Required.
Select the subscription to access the endpoint service of VeloDB Cloud
warehouse. All resources in an Azure subscription are billed together.|
Resource group| Project details| Required. Select a resource group for the
private endpoint to be created in it. If there is no suitable one, you can
create a new one. A resource group is a collection of resources that share the
same lifecycle, permissions, and policies.| Name| Instance details| Required.
The instance name of the private endpoint to be created. You can customize
it.| Network Interface Name| Instance details| Required. The network interface
name of the private endpoint to be created. When you enter the instance name,
it will be automatically generated and you can modify it.| Region| Instance
details| "Required. Select the region for the private endpoint to be created
in it. Note: You need to select the region is the same as the endpoint service
of VeloDB Cloud warehouse (limited by the cloud platform's Private Link
product)."
---|---|---
4. In the **Resource** tab of the **Create a private endpoint** page, choose the connection method **Connect to an Azure resource** with a resource ID or alias and fill in the form as follows and click **Next: Virtual Network**.

Parameter| Description| Resource ID or alias| Required. When connecting to
someone else's resource, they must provide you with the resource ID or alias
for that resource in order for you to initiate a connection request. In the
current scene, you can one-click shortcut to copy the **Service Alias** value
of the endpoint service of VeloDB Cloud warehouse in the page that displays
the Endpoint Service information required for creating an endpoint, then fill
in the input box.| Request message| Optional. This message will be sent to the
resource owner (This refers to VeloDB Cloud.) to assist them in the connection
management process. Don't include private or sensitive information.
---|---
5. In the **Virtual Network** tab of the **Create a private endpoint** page, Select the virtual network and subnet for the private endpoint to be created in it. Follow the wizard prompts to fill in the form as follows and click **Next: DNS**.

Parameter| Category| Description| Virtual network| Networking| Required. Only
virtual networks in the currently selected subscription and location are
listed. Select the virtual network for the private endpoint to be created in
it. If there is no suitable one, you can create a new one on the cloud
platform's Virtual network product console.| Subnet| Networking| Required.
Only subnets in the currently selected virtual network are listed. Select a
subnet for the private endpoint to be created in it. If there is no suitable
one, you can create a new one on the cloud platform's Virtual network product
console.| Network policy for private endpoints| Networking| Optional. The
network policy for the private endpoint to be created. The default is
disabled, you can edit it.| Private IP configuration| Private IP
configuration| Optional. You can choose Dynamically allocate IP address or
Statically allocate IP address. According to the virtual network and subnet
configured above, Dynamically allocate IP address is selected by default.|
Application security group| Application security group| Optional. Select the
application security group for the private endpoint to be created. If there is
no suitable one, you can create a new one.
---|---|---
6. In the **DNS** tab of the **Create a private endpoint** page, Keep the default settings and click **Next: Tags**.
Note: To connect privately with your private endpoint, you need a DNS record.
You need to configure the resource configuration to support Private DNS.

7. In the **Tags** tab of the **Create a private endpoint** page. , Keep the default settings and click **Next: Review + create**.
Note: If you want to categorize the private endpoint and view consolidated
billing, you can configure the tag for the private endpoint to be created.

8. In the **Review + create** tab of the **Create a private endpoint** page, you can review the settings for the private endpoint to be created. If some settings are not as expected, you can click **Previous** back to modify. If there is no problem, you can click **Create**.

9. After the endpoint is created, its status will be changed from "**Created** " to "**OK** ", indicating that the endpoint has successfully connected with the endpoint service of VeloDB Cloud warehouse.


10. After refreshing the **Connections** page of the VeloDB Cloud warehouse, the endpoint list will display the connection information of the endpoint.

11. The application/client can access the VeloDB Cloud warehouse through the IP or DNS name of the endpoint by MySQL protocol or HTTP protocol. You can click **Find DNS Name** in the endpoint list to open the details page of the endpoint to find the IP or DNS name of it.

12. For the specific connection method, you can hover the pop-up bubble for **Connection Examples** in the **Connections** page of the VeloDB Cloud warehouse.

### VeloDB Accesses Your VPC

> **Note** The endpoint instance and traffic fees generated by VeloDB's access
> to the private network are currently not charged to users.
#### AWS
1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** for **VeloDB Accesses Your VPC** on the **Private Link** tab to create a connection to your endpoint service.


2. After clicking **\+ Endpoint Service** , the pages will display the **Current Region** of the warehouse and the **ARN of VeloDB**. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint service.
3. Sign in to the AWS Console, select VPC-Endpoint services and switch to the same region as the current warehouse.
4. Click **Create endpoint service**.

5. On the Endpoint Service configuration page, configure the relevant parameters and click **Create**.


6. (Optional) If there is no available network load balancer, you need to click **Create Network Load Balancer** first. After the creation is completed, click the filter button to make a selection.




7. (Optional) If there is no available target group, you need to click **Create Target Group** first. After the creation is completed, click the refresh button on the right to make a selection.


8. After creating the endpoint service, add the **ARN of VeloDB** in the **Allow principals** Tab of the endpoint service.


9. Copy the **Service ID** and **Service Name** from the **Endpoint Service Details** page, and fill them in the Endpoint Service registration page of VeloDB Cloud.

10. After the registration is complete, go to the next step, specify the **Endpoint Name** of VeloDB Cloud warehouse, and click **Create Now**.


11. Accept endpoint connection request in the **Endpoint connections** Tab of the endpoint service.


12. Refresh the page and wait for the status of the endpoint of VeloDB Cloud warehouse to be changed from "pendingAcceptance" to "available", which means the connection is successful.


#### Azure
1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** for **VeloDB Accesses Your VPC** on the **Private Link** tab to create a connection to your endpoint service.

2. After clicking **\+ Endpoint Service** , the page will display the **Current Region** of the warehouse and the **Subscription ID of VeloDB**. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint service (This refers to Azure private link service).

3. Sign in to the **[Azure portal](https://portal.azure.com/)** with your Azure account. In the **Basics** tab of the **Create private link service** page on the Private Link product console, you need to confirm that the region is the same as the VeloDB Cloud warehouse (limited by the cloud platform's Private Link product). Follow the wizard prompts to fill in the form as follows and click **Next: Outbound settings**.

Parameter| Category| Description| Subscription| Project details| Required.
Select the subscription to create the private link service for database or
datalake. All resources in an Azure subscription are billed together.|
Resource group| Project details| Required. Select a resource group for the
private link service to be created in it. If there is no suitable one, you can
create a new one. A resource group is a collection of resources that share the
same lifecycle, permissions, and policies.| Name| Instance details| Required.
The instance name of the private link service to be created. You can customize
it.| Region| Instance details| Required. Select the Azure region for the
private link service to be created and located in it.Note: You need to select
the region is the same as the VeloDB Cloud warehouse (limited by the cloud
platform's Private Link product).
---|---|---
4. In the **Outbound settings** tab of the **Create private link service** page. Follow the wizard prompts to fill in the form as follows and click **Next: Access Security**.

Parameter| Description| Load balancer| Required. Select a load balancer behind
the private link service to load balances database or datalake. If there is no
suitable one, you can create a new one on the cloud platform's Load Balancer
product console.| Load balancer frontend IP address| Required. Select frontend
IP address of the load balancer you selected above.| Source NAT Virtual
network| Required.| Source NAT subnet| Required.| Enable TCP proxy V2|
Required. Leave the default of No. If your application expects a TCP proxy v2
header, select Yes.| Private IP address settings| Leave the default settings
---|---
5. In the **Access Security** tab of the **Create private link service** page, you need to choose **Restricted by subscription** for whom can request access to the private link service, and add the **Subscription ID of VeloDB** into the access whitelist of the private link service and choose **Yes** for auto-approve. Then click **Next: Tags**.

6. In the **Tags** tab of the **Create private link service** page, keep the default settings and click **Next: Review + create**. Note: If you want to categorize the private link service and view consolidated billing, you can configure the tag for the private link service to be created.

7. In the **Review + create** tab of the **Create private link service** page, you can review the settings for the private link service to be created. If some settings are not as expected, you can click **Previous** back to modify. If there is no problem, you can click **Create**.

8. After the private link service is created, its status will be changed from "**Created** " to "**OK** ", indicating that the private link service has ready to be connected by the private endpoint of VeloDB Cloud warehouse.


9. After creating the private link service, copy the **Rescource ID** and **Alias** from the private link service **Details** page, and fill them in the Endpoint Service registration page of VeloDB Cloud.



10. After the registration is complete, go to the next step, specify the **Endpoint Name** of VeloDB Cloud warehouse, and click **Create Now**.


11. Refresh the page and wait for the status of the endpoint of VeloDB Cloud warehouse to be changed from "**pendingAcceptance** " to "**Approve** ", which means the connection is successful.


## Public Link
On the **Connections** page, switch to the **Public Link** tab to manage the
public network connection.
### Add IP Whitelist
In order to access the VeloDB Cloud warehouse via the public network, you need
to add the source public network IP address to the whitelist.
Click **IP Whitelist Management** on the right of the **Connect Warehouse**
card to add the source IP addresses or segments.


In the IP whitelist, you can add or delete IP addresses to enable or disable
their access to the warehouse.
> **Note** By default, the IP segment 0.0.0.0/0 is set, which means the
> warehouse is completely open to the public network. It is recommended to
> remove it in time after use to reduce security risks.
### Access Warehouse
After adding the source public network IP address to the whitelist, you can
click **WebUI Login** to access the VeloDB Cloud warehouse through the public
network. For the specific connection method, please refer to the **Other
Methods**.

On This Page
* Private Link
* Access VeloDB from Your VPC
* VeloDB Accesses Your VPC
* Public Link
* Add IP Whitelist
* Access Warehouse
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/console-overview
Version: 4.x
On this page
# Overview
VeloDB Cloud is a cloud-native data warehouse that runs on multiple clouds,
providing a consistent user experience and fully managed service. It is
extremely fast, cost-effective, single-unified, and easy to use.
This topic gives a brief overview of the main features the VeloDB Cloud
console includes and how to navigate it. Later topics provide detailed
descriptions of the specific features.
## Main Features
* **Registration and Login**.
* **Warehouse Management** : Provides free trial, paid warehouse creation, warehouse list, etc.
* **Cluster Management** : Provides one-click creation, elastic resizing, fast upgrade, deletion, etc.
* **Connections** : Provides the connection methods of the warehouses in the private network (VPC) and the public network. The public network connection supports whitelists.
* **Metrics** : Provides metrics in dimensions such as resource usage, query, and write and supports flexible and easy-to-use alarm capability.
* **Billing Center** : Provides usage statistics for the internal parts of organizations and warehouses. Billing is based on usage statistics.
* **Others** : Including organization management, access control, notification, etc.
## Navigate VeloDB Cloud
The overall layout of VeloDB Cloud console web interface is as follows:

### Navigation Bar
Located on the left side of the web interface, **Navigation Bar** provides the
main features of VeloDB Cloud's most crucial concept, **Warehouse** ,
including cluster management, connections, query, metrics, usage statistics,
etc.
### Warehouse Selector
Located at the top of the left navigation bar,**Warehouse Selector** displays
all the warehouses under the current organization. You can switch warehouses,
view warehouse info, create a new warehouse, etc.
After switching to a warehouse, you can use it to experience all the features
in the left navigation bar.

### User Menu
Located at the bottom of the left navigation bar, **User Menu** provides some
management features related to users and organizations, including security,
notifications, users and roles, billings, etc.

On This Page
* Main Features
* Navigate VeloDB Cloud
* Navigation Bar
* Warehouse Selector
* User Menu
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/monitoring-overview
Version: 4.x
On this page
# Monitoring Overview
VeloDB Cloud provides monitoring and alerting so that you can track the health
and performance of your warehouse or clusters and make adjustments.
You can find the **Metrics** feature on the navigation bar, and you can
* View metrics by warehouse or cluster.
* Use **Starred** to display the metrics of interest in warehouse or different clusters together.
* View historical metric data by adjusting the time selector, and you can view metric data of the past 15 days.
* Use the auto-refresh feature to update metrics in real-time (5s).
The metrics you can use in VeloDB Cloud fall into two categories.
* **Basic Metrics** \- Basic metrics data helps you monitor physical aspects of your cluster, such as CPU usage, memory usage, and network throughput.
* **Service Metrics** \- Query performance data helps you monitor warehouse or cluster activity and performance, such as QPS, query success rates, and more. It helps to understand the specific workload of the cluster.
## Basic Metrics

Basic metrics provide physical monitoring information of the cluster by "node"
dimension.
You can determine whether the cluster is abnormal within a specified time
frame by using the cluster's basic metrics. You can also see if historical or
current queries are impacting cluster performance.
You can use the cluster basic metrics to diagnose the cause of slow queries
and take possible measures such as scaling up or scaling down the cluster
capacity, optimizing SQL statements, etc.
We provide the following cluster base metrics.
### CPU Utilization
Displays the CPU utilization percentage of all nodes. You can find the lowest
cluster utilization time from this chart before planning to scale a cluster
and other resource-consuming operations.
### Memory Usage
Displays the memory usage of all nodes. If memory usage is consistently high,
you should consider scaling up your cluster.
### Memory Utilization
Displays the memory utilization of all nodes. If memory utilization is
consistently high, you should consider scaling up your cluster.
### I/O Utilization
Displays the utilization of hard disk I/O. If I/O utilization is always
maintained at a high level, you may consider scaling out more nodes for better
query performance.
### Network Outbound Throughput
Displays the average outbound speed of nodes per second over the network in
MB/s. Queries that read data over the network are slower, and you should set
up the cache correctly to minimize network reads.
### Network Inbound Throughput
Displays the average inbound speed of nodes per second over the network in
MB/s.
### Cache Read Throughput
Displays the read throughput per second over the cache in MB/s.
### Cache Write Throughput
Displays the write throughput per second over the cache in MB/s.
### Support Range of Basic Metrics
Metrics| Warehouse| Cluster| CPU Utilization| Supported| Supported| Memory
Usage| Supported| Supported| Memory Utilization| Supported| Supported| I/O
Utilization| Supported| Supported| Network Outbound Throughput| Supported|
Supported| Network Inbound Throughput| Supported| Supported| Cache Read
Throughput| Not supported| Supported| Cache Write Throughput| Not supported|
Supported
---|---|---
## Service Metrics

### Query Per Second (QPS)
Displays the number of query requests per second. The required compute
resource of a cluster can be determined based on your system's QPS during peak
time.
### Query Success Rate
Displays the percentage of successful queries to all queries updated by
minutes. When the query success rate decreases abnormally, consider whether
there is a cluster or node failure.
### Dead Nodes
Displays the number of current cluster dead nodes.
### Average Query Runtime
Displays the average time of queries updated by minutes. If the average query
time rises abnormally, consider troubleshooting.
### Query 99th Latency
Display the response time of the request that ranks at the 99th percentile in
ascending order during a given time period, which reflects the speed of slow
queries in the cluster.
### Cache Hit Rate
Displays the percentage of I/O operations that hit the cache in all I/O
operations. If the cache hit rate is too low, consider changing the cache
policy or scaling up the space.
### Remote Storage Read Throughput
Read the amount of data stored remotely per unit time.
### Sessions
Display the number of sessions for the current warehouse, without
distinguishing clusters.
### Load Rows Per Second
A metric measuring the efficiency of data write operations, indicating the
speed at which records are currently being successfully written to a database
or other data storage systems.
### Load Bytes Per Second
Display the current write task's rate, reflected by data size.
### Finished Load Tasks
Display the number of tasks completed in the recent period. A sharp increase
or decrease might indicate a business anomaly.
### Compaction Score
Indicates the merging pressure of data files. The greater the Score, the
greater the merging pressure.
### Transaction Latency
Indicates the transaction latency of the warehouse write task. The smaller the
delay, the faster the data can be queried.
### Support Range of Service Metrics
Metrics| Warehouse| Cluster| Query Per Second| Supported| Supported| Query
Success Rate| Supported| Supported| Dead Nodes| Not supported| Supported|
Average Query Time| Supported| Supported| Query 99th Latency| Supported|
Supported| Cache Hit Rate| Not supported| Supported| Remote Storage Read
Throughput| Not supported| Supported| Sessions| Supported| Not supported| Load
Rows Per Second| Supported| Supported| Load Bytes Per Second| Supported|
Supported| Finished Load Tasks| Supported| Not supported| Compaction Score|
Not supported| Supported| Transaction Latency| Supported| Not supported
---|---|---
# Alert Overview
In addition to SMS alert notifications, VeloDB Cloud provides monitoring and
alerting services at no additional charge.
You can configure alert rules to be notified when cluster monitoring metrics
change.

## Alert Configuration
### View Alert Rules
You can view existing alerting rules and their current alerting status on the
list page.
"Red dot" means the alert rule is in effect, and "green dot" indicates the
current alert rule is not triggered.
### Enable One-Click Alert

You can click **Enable One-Click Alert** to quickly set up basic alert rules,
which will be applied to both current and future warehouses or clusters.
### New/Edit Alert Rule

You can create an alert rule by clicking **New Alert Rule** or copying an
existing one. You can also modify a current alert rule.
The alert rule configuration consists of four parts.
#### Rule Name
You can customize the rule name, which must be unique within the warehouse.
#### Cluster
You can specify the cluster for which the alert rule is in effect. When a
cluster is deleted, its alert rules will not be deleted but invalidated.
#### Conditions
You can set one or more rules for metrics to be met and how these conditions
are combined (and, or).
#### In Last
"In Last" means the duration of time to meet the conditions. You should set
this time appropriately to balance between timeliness and accuracy of alerts.
### Channel
You can set one or more notification channels, and the alert messages will be
pushed through the channels you set respectively.
#### In-site Notification
Configuration method: Select user.
#### Email
Configuration method: Select user.
#### SMS
Configuration mode: Select user / fill in cell phone numbers.
#### WeCom
Configuration method: fill in the robot webhook.
1. On WeCom for PC, find the target WeCom group for receiving alarm notifications.
2. Right-click the WeCom group. In the window that appears, click **Add Group Bot** .
3. In the window that appears, click **Create a Bot** .
4. In the window that appears, enter a custom bot name and click **Add** .
5. Copy the webhook URL.

> **NOTE** If you need to restrict message sources, please set up IP
> whitelist. VeloDB Cloud server IP address is 3.222.235.198.
#### Lark
Configuration method: fill in the robot webhook.
To make a custom bot instantly push messages from an external system to the
group chat, you need to use a webhook to connect the group chat and your
external system. Enter your target group and click **Settings** > **BOTs** >
**Add Bot** . Select **Custom Bot** . Enter a suitable name and description
for your bot and click **Next** .

You'll then get the webhook URL.

> **NOTE** If you need to restrict message sources, please set up IP
> whitelist. VeloDB Cloud server IP address is 3.222.235.198.
#### DingTalk
Configuration method: fill in the robot webhook.
To get the DingTalk robot webhook, please see
[here](https://www.alibabacloud.com/help/en/application-real-time-monitoring-
service/latest/obtain-the-webhook-url-of-a-dingtalk-chatbot)
1. Run the DingTalk client on a PC, go to the DingTalk group to which you want to add a chatbot, and then click the Group Settings icon in the upper-right corner.
2. In the **Group Settings** panel, click **Group Assistant** .
3. In the **Group Assistant** panel, click **Add Robot** .
4. In the **ChatBot** dialog box, click the **+** icon in the **Add Robot** section. Then, click **Custom** .

5. In the **Robot details** dialog box, click **Add** .
6. In the **Add Robot** dialog box, perform the following steps:

> **NOTE** If you need to restrict message sources, please set up IP
> whitelist. VeloDB Cloud server IP address is 3.222.235.198.
7. Set a profile picture and a name for the chatbot.
8. Select **Custom Keywords** for the **Security Settings** parameter. Then, enter **alert** .
9. Read the terms of service and select **I have read and accepted _DingTalk Custom Robot Service Terms of Service_** .
10. Click **Finished** .
11. In the **Add Robot** dialog box, copy the webhook address of the DingTalk chatbot and click **Finished** .

## View Alert History
You can view the alert history and filter it.
On This Page
* Basic Metrics
* CPU Utilization
* Memory Usage
* Memory Utilization
* I/O Utilization
* Network Outbound Throughput
* Network Inbound Throughput
* Cache Read Throughput
* Cache Write Throughput
* Support Range of Basic Metrics
* Service Metrics
* Query Per Second (QPS)
* Query Success Rate
* Dead Nodes
* Average Query Runtime
* Query 99th Latency
* Cache Hit Rate
* Remote Storage Read Throughput
* Sessions
* Load Rows Per Second
* Load Bytes Per Second
* Finished Load Tasks
* Compaction Score
* Transaction Latency
* Support Range of Service Metrics
* Alert Configuration
* View Alert Rules
* Enable One-Click Alert
* New/Edit Alert Rule
* Channel
* View Alert History
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/more/amazon-aws/create-data-credential
Version: 4.x
On this page
# Create a Data Credential
VeloDB adopts a storage-compute separation architecture, where data is
typically stored in object storage. To ensure that the warehouse can access
the underlying data properly, a Data Credential must be created in advance.
The core of a Data Credential involves creating an IAM policy and an IAM role.
VeloDB will automatically attach this role to the EC2 used by the VeloDB
warehouse. Below are the detailed steps.
## Step 1: Create an S3 Bucket
First, you need to prepare an S3 Bucket. If you already have one, you can skip
this step and proceed to Step 2.
> **NOTE** The S3 bucket you use must be located in the same AWS region where
> your VeloDB warehouses are deployed. If you do not already have a bucket in
> that region, please create one before proceeding.
1. Log in to the AWS S3 Console as a user with administrator privileges.
2. Click the **Create bucket** button.
3. On the **create bucket** page, set the following options:
1. Enter a name for the bucket.
2. Select the AWS region that you will use for your VeloDB warehouse deployment.
3. Enable Bucket Versioning (recommended).
4. Click **Create bucket**.
5. Copy the bucket name to add to VeloDB console.
## Step 2: Create an IAM Policy
After the S3 bucket is provisioned, create an IAM policy that grants read and
write access to the bucket.
1. Log into the **[AWS IAM Console](https://console.aws.amazon.com/iam/)** as a user with administrator privileges.
2. Click the **Policies** tab in the sidebar.
3. Click the **Create policy** button.
4. In the policy editor, click the **JSON** tab.
5. Copy and paste the following access policy into the editor, replacing `` with the name of the S3 bucket you prepared in the previous step.
{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Resource": "arn:aws:s3:::",
"Action":
[
"s3:GetBucketLocation",
"s3:GetBucketVersioning",
"s3:PutBucketCORS",
"s3:ListBucket",
"s3:ListBucketVersions",
"s3:ListBucketMultipartUploads"
]
},
{
"Effect": "Allow",
"Resource": "arn:aws:s3:::/*",
"Action":
[
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:DeleteObject",
"s3:DeleteObjectVersion",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
]
},
{
"Effect": "Allow",
"Action":
[
"sts:AssumeRole"
],
"Resource": "*"
}
]
}
6. Click the **Next** button.
7. In the **Name** field, enter a policy name.(e.g.VeloDBDataStorageAccess)
8. Click **Create policy**.
## Step 3: Create a Service IAM Role
1. Click the **Roles** tab in the IAM console sidebar.
2. Click **Create role**.
1. Trusted entity type: Select **AWS** service.
2. Use cases: Select **EC2**.
3. Click the **Next** button.
4. Attach Permission Policies: In the policy search box, enter the name of the policy you created in Step 2.
5. In the **role name** field, enter a role name. (e.g. **VeloDBDataStorageAccessRole**)
6. Click **Create role**.
3. Update the Role's Trust Relationships.
Now that you have created the role, you must update its trust policy to make
it self-assuming. In the IAM role you just created, go to the Trust
Relationships tab and edit the trust relationship policy as follows, replacing
the`` and `` values.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com",
"AWS": "arn:aws:iam:::role/"
},
"Action": "sts:AssumeRole"
}
]
}
4. In the role summary, copy the **Instance Profile ARN** (format: arn:aws:iam::``:instance-profile/``) to add to VeloDB console.
On This Page
* Step 1: Create an S3 Bucket
* Step 2: Create an IAM Policy
* Step 3: Create a Service IAM Role
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/studio
Version: 4.x
On this page
# Studio
VeloDB Cloud Studio ("Studio") is a data development platform for data
development scenarios. It is a data development platform on the cloud provided
by VeloDB, which can assist users in managing and exploring data, and can
replace Navicat.
## Main Function
* **Warehouse Login** : Use different database users to log in to the warehouse in the Studio.
* **Data query** :
* ** SQL Editor ** : An easy-to-use SQL query editor that supports query execution, automatic SQL saving, query profiles, historical query records, etc.
* **Log Analytics** : A user-friendly analysis tool for log scenarios, supporting SQL filtering, searching, and other functions.
* **Session Management** : Manage running SQL queries and allow viewing and terminating SQL queries.
* **Query Audit** : A one-stop historical query audit tool that can filter slow queries and view their execution.
* ** Workload Management ** : Support quick creation, editing and viewing of Workload Group.
* **Data Management** : View and manage data in the database, currently supports viewing.
* **Privilege Management** : Manage users and roles in the database, and grant and revoke permissions to them.
* **Data Integration** : Easily connect to data in object storage on the cloud, connect to data lakes, and import sample data.
* ** Import: ** Support the view of import tasks and operate on import tasks.
## Register and Login
### Using the Studio service
In VeloDB Cloud Manager ("Manager"), each warehouse has a corresponding Studio
service. In the "Connection" module of Manager, you can find the entrance to
the Studio through a private network or a public network.
You can also save the entry address of the Studio for direct access.

### Login to Studio

You need to enter the **Username** and **Password** of the warehouse on the
login page. If you clicked the link to log in from the Manager, the warehouse
name should be pre-filled.
We will not record your login account and password, but you can use the
recording function that comes with your browser.
## Data
The "data" module is the basic function of Studio to manage the database, and
it mainly has two functions:
1. Check the data and its organizational form, such as database table structure, data size, table creation statement, table field information, data preview, etc.
2. Add, delete and modify database objects, including new creation, deletion, and renaming of database objects.
The data module is displayed according to the organizational form of the data
in the database, and is divided into **Catalog** -**Database** -**Table**
/**View** .
### Catalog
Catalog is a collection of databases.
Catalog is divided into internal catalog and external catalog. Internal
catalog contains VeloDB's own database, external directories can be connected
to Hive, Iceberg, Hudi, etc., as VeloDB supports the data lake features.
VeloDB Studio supports direct deletion of Catalog objects.

### Database
A database is a collection of tables, views, materialized views, and
functions. The database belongs to the directory. When a directory is
selected, you can view the database under the directory and the size of the
database. At the same time, you can create, delete, and rename the database
under the page.

### Table
Table is the basic unit of VeloDB data warehouse, and table belongs to
database.
When a database is selected, you can see the tables under the database, as
well as the size of the table, creation and modification time.

When you click on a table, you can enter the details management page of the
table and view the DDL definition, fields, index and other information of the
table.

The Data Preview page is used to quickly preview the data data of the table,
and by default preview of the first 100 pieces of data of the table from the
interface. "Total x data" is obtained from the metadata service, so there may
be delays.

### View
A view is a visual table based on the result set of SQL statements. The view
page is roughly similar to the table page. Attributes (such as indexes,
details) that the view does not have will not be displayed. The view also
supports data preview function (the first 100 pieces of data).
### Materialized View
Materialized View is a table that pre-calculates query results and stores,
which can be used to accelerate query performance and reduce real-time
computing pressure. The Studio database page can list the materialized view
information under the database.
### Function
The Studio database page can list the function information under the database,
and supports viewing the function type, return type, creation statement and
other information.
## SQL Editor
The query result will be returned below the edit box, and the error or success
status and information returned by the query will also be displayed at the
query result.
At the same time, you can click the drop-down button on the right side of
**Run (LIMIT 1000)** and switch to **Run and Download** to download your query
results.

Session records are the history of the Tab you open in the SQL Editor. You can
click on the SQL statement in the record and copy it to the SQL Editor for
execution.

Query history is the history of the SQL statement you execute in the SQL
editor. You can click on the SQL statement in the record to view the Profile
information of the statement.
> **NOTE** There is no Query ID for non-query statements, nor for failed
> statements.

By default, query plans are enabled for queries initiated in the Studio, which
will not affect the performance of a single query. Click "Query Statement" to
enter the execution plan page.
The download button can download Profile information, including Profile
information in pure TEXT format and visual Profile images.
The Import Profile button can import Profile information in the TEXT format,
and after importing, you can visually view the Profile. This helps you
visually analyze queries initiated from other clients.

We have built-in sample query statements for some test datasets in Studio to
help you do some simple performance testing.

In the results panel, you can see the execution results of SQL statements,
including query results, execution time, number of rows, etc. You can also
search for results through the search box, or click the table header to sort
the results.

## Session Management
Session management allows administrator users to manage the use of resources
and prioritize critical queries to improve system performance and provides
detailed information about each session, such as execution time, the user who
initiated the query, and the resources being used.
You can view all currently running SQL queries and terminate any queries that
cause problems or run time exceeds expectations.

You can check the table to display more information about running SQL queries,
such as scan size, scan number of rows, return number of rows, etc.

Click the Query ID of the session to further view the complete information of
the session, including the executing user, the FE node that received the
session, and the execution plan (Profile) of the SQL.

## Query Audit
Query audits are used to audit and analyze query history executed in the
system. It allows you to filter and identify poor performance queries to
optimize database performance.
The tool includes analytics to gain insight into the execution plan and
resource usage of each query. As a one-stop solution for tracking query
performance, discovering trends, and diagnosing problems.
You can filter historical queries and in List Selection, select more
dimensions to assist in analysis.
Click "Query ID" to enter the query detailed page. You can view more Query
information. If Profile is enabled, you can view the query profile on this
page.

## Search Analysis
Search and analysis is launched by VeloDB Studio. It is a query tool for log
analysis scenarios, which can easily search, query and count logs.
The interactive search and analysis interface is similar to the Kibana
Discover page, which optimizes in-depth experience for log retrieval and is
divided into 4 areas:
* **Input area at the top** : Select the cluster, table, time field, and query time period. The main input box supports two modes: keyword retrieval and SQL.
* **The field display and selection area on the left** : Display all fields in the current table. You can select which fields are displayed in the detailed display area on the right. Hovering over the field will show the 5 values and the proportion of the occurrence of this field. You can further filter by value. The filtering conditions are reflected in the filtering part of the input area.
* **The trend chart display and interaction area in the middle** : Display the number of logs that meet the conditions at a certain time interval. Users can select a period of time in the box on the trend chart to adjust the query time period.
* **Detailed data display and interaction area below** :: Display log details, you can click to view the details of a certain log. It supports two formats: table and JSON. The table form also supports interactive creation of filter conditions.
Click `Query > Search Analysis` and select the table as `internal_schema >
audit_log`, Studio will automatically query the fields in the table and select
the first time field.

Hover over the state field on the left to display the highest frequency state
values EOF, OK, ERR, and you can also view the proportion. In addition, you
can also create filter conditions by clicking the plus sign (+) or minus sign
(—) button, for example, by clicking the minus sign (—) button to the right of
ERR, state != ERR is displayed in the filter conditions by clicking the minus
sign (—) button to the right of ERR.

In the main input box, use search and SQL modes to query keywords.Search mode
is supported only on tables with inverted indexes.
Under the search box, select Search, and then enter GET on the right, click
Query. In search mode, search for a log containing the keyword GET. The GET in
the details will be highlighted, and the number of data strips in the trend
chart will change accordingly.

> **NOTE** Searching for the MATCH_ANY statement that matches any keyword can
> match any field in the log. Note that the highlighting of the search results
> will match all search keywords as much as possible, but due to some special
> characters, it does not always match the search keywords exactly.
You can use double quotes to wrap phrases in searches, such as `"GET
/api/v1/user"`. Will match the entire phrase. The phrase uses `MATCH_PHRASE`
to match the phrase.
If more precise matches are required, you can use SQL pattern.
Under the search box, select `SQL`, and in `SQL mode`, enter the SQL WHERE
condition and click `Query`.

Expand log details, optionally in Table or JSON format, the Table format
supports interactive creation of filters.

Click the context search on the right to view the 10 logs before and after
this log. You can continue to add filter conditions in the context search.

Introduced a new data type `VARIANT`, it can store semi-structured JSON data.
The `VARIANT` type is especially suitable for handling complex nested
structures that may change at any time. Studio will recognize the `VARIANT`
data type, automatically expand the hierarchy of that data type, and provide a
special filtering method.
Let's take the github_events table as an example to show how to filter fields
of `VARIANT` data type.
In the filtering condition, we can select the field of the `VARIANT` data type
and select the subfields in it for filtering.

## Workload Group Management
> **NOTE** Workload Group Management supports VeloDB Cloud 4.0.0 and above.
Workload Group Management supports the rapid creation, editing and viewing of
Workload Group. Using Workload Group, you can manage the CPU/memory/IO
resource usage used by querying and importing loads in the cluster, and
control the maximum concurrency of queries in the cluster.

You can view more items in the table filter above the Workload Group list.

In the New Workload Group interface, you can click on the question mark of the
parameter, and the description of the parameter will be displayed.

## Integrations
Integrations are portals connecting VeloDB Cloud with data outside the
warehouse.
Currently, you can create two new integrations, namely Stage integration
(object storage) and sample data.

### Object Storage
By creating a new object storage integration, you can establish a
**Connection** with data in object storage. Through the **Integrate + Copy
Into** command, you can **Import** the data in the object storage to the
warehouse.
When creating a new object storage integration, you need to enter the
following:
* **Integration Name** : Consistent with the database object naming rules, up to 64 characters, letters, numbers, and underscores can be used.
* **Comments** : Integrated comments.
* **Bucket** : The bucket you need to integrate.
* **Default file path** : The file path to be accessed in the bucket. VeloDB will only access the files under the path you fill in. If you do not fill in, the default is that the data in the entire bucket can be accessed.
* **Access Authorization** : The way to allow VeloDB to access your bucket. It is divided into Access key and cross-account authorization. We recommend using cross-account authorization for better security. For guidelines on cross-account authorization, please refer to: [ IAM Cross-Account Access Guide ](https://docs.velodb.io/cloud/management-guide/studio#iam-cross-account-access-guide-aws)。You must pass the permissions check to successfully create an integration.
* **Advanced Configuration** : Details below.

Divided into **File Type** and **Import Configuration**. These are the
parameters that you may use when importing integrated data. You can set them
here, or specify them when importing. If you do not set or specify them, the
system will execute the import task of the integration with the default
configuration.

* **File type** : The default type of the integrated storage file, currently supports `csv`, `json`, `orc`, `parquet`. The default is that the system infers from the filename suffix.
* **Compression method** : The default compression type of the integrated storage file, currently supports `gz`, `bz2`, `lz4`, `lzo`, `deflate`. The default is that the system infers from the filename suffix.
* **Column separator** : The default column separator of the integrated storage file, the default `\t`.
* **Line separator** : The default line separator of the integrated storage file, the default `\n`.
* **File size** : When importing files under this integration, the default import size limit is unlimited by default.
* **On Error** : When importing files under this integration, when the data quality is unqualified, the default error handling method. There are three types: continue importing, stop importing, and continue importing when the proportion of error data does not exceed a certain value.
* **Strict Mode** : Strictly filter the column type conversion during the import process. Default is off.
### Sample Data
Creating a new sample data integration will import sample data into the
database on the basis of creating an object storage integration. Therefore,
you need to select the cluster to complete the new creation. TPCH, Github
Event, SSB-FLAT test data size has the following choices: sf1 (1GB), sf10
(10GB), sf100 (100GB), select through the drop-down menu, and the test
warehouse can only choose 1sf (1GB).
Clickbench only has the option of sf100 (100GB), we recommend that you use a
larger cluster to import Clickbench sample data.

You can view the import progress in the sample data details.

## Permissions
### User
Display the users in the VeloDB repository. Note that the root user will not
be displayed here.
Only users with Admin authority can add and modify other users.

You can create a new user on this page, except for the username, other content
is optional. However, we strongly recommend that you add passwords for users
and restrict access to hosts for enhanced security.

### Role
Here you can manage the roles in VeloDB, and also perform authorized
operations on the roles.
Only users with Admin permissions can add and modify other roles.
VeloDB currently does not support managing users under roles through roles,
which means you need to specify your roles when creating or modifying users.


### Authorize
On the user or role details page, click on the specific user or role name to
enter the permission configuration page, and you can perform
authorization/revocation operations. You need to have Admin or Grant
permissions at the corresponding level in order to perform
authorization/revocation.
In VeloDB, permissions are divided into the following categories:
* **Global** : Global permissions are permissions at the entire database level, with global permissions, and automatically have corresponding permissions for all corresponding objects in the database.
* **Data** : refers to the permissions of data resources. You can authorize them according to the level, have the permissions at the parent level, and automatically have the corresponding permissions of its children's content.
* **Workload Group** : Usage permissions only.
* **Resource** : It is the permission of Resource, including Grant and Usage.
* **Compute Group** : The memory separation cluster exists in VeloDB 3.0 and controls the Usage permissions of different computing groups.
* **Cluster** : Exist in VeloDB Cloud connections, controlling Usage permissions for different clusters.

## Import
VeloDB Studio supports the management of load tasks such as Stream Load,
Routine Load, Broker Load, and Insert Into in connection, and currently
supports the following operations:
* Information query for load tasks
* Stop Routine Load, Broker Load, Insert Into
* Pause/Edit/Recover Routine Load
You first select a database and then view all load tasks under that database
in the load task list.

Click the load task name to view the detailed information of the load task.

## IAM Role Setup Guide (AWS)
Please use the following steps to create the role and add permissions in your
AWS console:
1. Access the **IAM** service and select **Roles** from the menu. Click on the **Create role** button.

2. Select **Custom trust policy** in the **Select trusted entity** section.
 Replace the ``
in the following trust policy with the actual IAM Role ARN of your VeloDB
warehouse .
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ""
},
"Action": "sts:AssumeRole"
}
]
}
3. Select the permission policies you would like to attach to the role. Click on the **Next** button.

4. Config **Role name** ,and click on the **Create role** button to finish.
 5\. Click on the role name in the
list of roles. Copy the value of the **ARN** from the **Summary** section to
provide the value in VeloDB Cloud.

On This Page
* Main Function
* Register and Login
* Using the Studio service
* Login to Studio
* Data
* Catalog
* Database
* Table
* View
* Materialized View
* Function
* SQL Editor
* Session Management
* Query Audit
* Search Analysis
* Workload Group Management
* Integrations
* Object Storage
* Sample Data
* Permissions
* User
* Role
* Authorize
* Import
* IAM Role Setup Guide (AWS)
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/usage-and-billing
Version: 4.x
On this page
# Billings
This topic describes how to manage fee deduction channels and view bills for
organization administrators.
Before applied in the production environment, it is recommended to link a
credit card or open a cloud marketplace deduction channel to ensure the
continuous operation of the service.
## Deduction Channels
VeloDB Cloud currently supports four deduction channels, which are credit
card, cloud marketplace, cash and vouchers. VeloDB Cloud will generate bills
periodically and deducts fees from those channels.
Click the **User Menu** > **Billings** , enter Billing Overview page to view
the overall usage of the those fee deduction channels.

The following describes the use of the above deduction channels:
### Credit Card
In Billing Overview page, click **Add** on **Credit Card** to complete the
setup.

You can’t remove a credit card in the Billing Overview page, but you can
update it anytime. This helps ensure your organization always has a valid
payment method. If you need to remove your credit card, please contact VeloDB
Cloud support for help.
### Open Cloud Marketplace
#### AWS Marketplace
This topic mainly describes how to use the AWS Marketplace deduction channel.
The specific opening process is as follows:
1. In Billing Overview page, click **Subscribe** on **Cloud Marketplace** card, find **AWS Marketplace** in the drawer page, then click **Go to Subscribe** to enter the VeloDB Cloud commodity page of the AWS Marketplace.

2. Click **View purchase options** to enter the Subscription page of the AWS Marketplace.

3. Click **Subscribe** ,when the page displays "You are currently subscribed to this offer", click **Set up your account** to go to the authorization page of VeloDB Cloud.

4. Log in with your VeloDB Cloud account on the authorization page.

5. Select the target organization from the organization list, and click **Confirm Authorization**. Once the authorization is successful, your AWS account will deduct the subsequent expenses.

6. Click **Check** go back to Billing Overview page after completing the authorization. If **AWS Marketplace** is displayed on the cloud marketplace deduction channels, that means you have successfully opened the cloud marketplace deduction channel.


#### GCP Marketplace
This topic mainly describes how to use the AWS Marketplace deduction channel.
> Note: The additional commission rate in GCP Marketplace is 3% of the paid
> amount.
**1\. Go to GCP Marketplace VeloDB Cloud product**
You can jump to the GCP Marketplace through the VeloDB Cloud console, or
search for VeloDB directly in Marketplace.
* Jump to the GCP Marketplace through the VeloDB Cloud console
On the **Billing Overview** page in [VeloDB Cloud
console](https://www.velodb.cloud/), click **Open** on the **Cloud Marketplace
Deduction Channel** card, find **GCP Marketplace** on the drawer page, then
click **Go to Subscribe** to jump to the VeloDB Cloud product page in GCP
Marketplace.

* Search for VeloDB directly in GCP Marketplace
You can also find the VeloDB Cloud product on [GCP
Marketplace](https://console.cloud.google.com/marketplace) by searching for
"**VeloDB** " or "**Doris** " and then enter the VeloDB Cloud product page in
GCP Marketplace.

**2\. Subscribe VeloDB Cloud**
On the VeloDB Cloud product page in GCP Marketplace, click the button
**SUBSCRIBE** to go to the order confirmation page.


On the order confirmation page, check the terms and click the button
**SUBSCRIBE**.

In the secondary confirmation dialog box, you can click the button **GO TO
PRODUCT PAGE** to view the subscription effect, or you can click the button
**MANAGE ORDERS** to view the order changes.



On the VeloDB Cloud product page in GCP Marketplace, you need to click the
button **MANAGE ON PROVIDER** to jump to VeloDB Cloud console to register as a
user and log in to complete the authorization process.


If you have already registered, you can **login** directly.

You can log in via your mobile phone number or email and proceed to the second
step: **Authorize Organization**.

Select the target organization and click the button **Confirm Authorization**.

> Note: There may be a delay in the order status, and you need to wait about 1
> minute before authorization. 
After successful authorization, proceed to the third step: **View
Authorization Result**. Click the button **Check** to go to the **Billing
Overview** page to view the status of the cloud marketplace deduction channel
activation.


**3\. Unsubscribe VeloDB Cloud**
On the **Billing Overview** page in [VeloDB Cloud
console](https://www.velodb.cloud/), click **Change** on the **Cloud
Marketplace Deduction Channel** card, find **GCP Marketplace** on the drawer
page, then click the button **Go to Unsubscribe** to jump to the order page in
GCP Marketplace.

On the order page in GCP Marketplace, find the target order by **Order
Number** (the status is currently "Active"), click the action column on the
right, expand the drop-down menu, and click **Cancel order**.

In the secondary confirmation dialog box, enter the **Order Number** and click
**CANCEL ORDER**.

After success, you can see that the status of the target order is changed to
"**Canceled** " and the **GCP Marketplace Deduction Channel** card in VeloDB
Cloud console is restored to the initial unsubscribed state.

> Note: There may be a delay in the order status. You need to wait about 1
> minute and refresh the Billing Overview page in [VeloDB Cloud
> console](https://www.velodb.cloud/) to see the change of the Cloud
> Marketplace Deduction Channel.
**4\. Contact Sales**
If you want to know more about the product, you can CONTACT SALES by email.

### Recharge Cash
You can pay directly into your VeloDB account. After you complete the payment,
please provide the **payment receipt** , your **organization id** and
**organization name** to VeloDB.
You can contact VeloDB sales or send email to `support@velodb.io`. We will
recharge your account cash balance.
You can find in your organization id and organization name in Organization
Management. For Details, please refer to Organization Management.
VeloDB Bank Account Information:
- Beneficiary Name: VELODB INC
- Beneficiary Address: 1142 Juniper Ct, San Jacinto, CA 92582
- Beneficiary Bank: Citibank, N.A.
- Beneficiary Bank Address: 388 Greenwich Street New York, NY 10013
- Beneficiary Account Number: 40806519
- SWIFT Code: CITIUS33XXX
- BRANCH CODE: 930
- ABA: 021000089
### Activate Voucher
In Billing Overview page, switch to **Vouchers**.

Click **add voucher** and input the voucher activation code issued by VeloDB
Cloud to activate the voucher.

You can also view voucher usage and voucher activation history in Voucher
Management page.
## Bills Statements
The VeloDB Cloud collects the above usage information of the entire
organization every minute, deducts fees by hour, and generates hourly bills
and monthly bills, which are mainly provided to the organization
administrators for reconciliation and cost analysis.
In Billing Statements page, you can view or export the bills.

If the credit card, cash and voucher balance are insufficient or no cloud
marketplace deduction channel can be used, VeloDB Cloud will stop the service,
but the data will be retained for 7 days.
To keep the online service running continuously, please ensure that the cash
balance is sufficient or open a cloud marketplace deduction channel.
## Usage
After creating a warehouse, VeloDB Cloud collects the usage of different
resources in a warehouse, including compute, cache and storage, to help you
analyze the cost distribution in a specific warehouse.
Click **Usage** in the navigation bar on the left to view the current
warehouse usage information.

On This Page
* Deduction Channels
* Credit Card
* Open Cloud Marketplace
* Recharge Cash
* Activate Voucher
* Bills Statements
* Usage
---
# Source: https://docs.velodb.io/cloud/4.x/management-guide/user-and-organization
Version: 4.x
On this page
# User and Organization
## Registration and Login
Click to enter the VeloDB Cloud registration and
trial page and fill in the relevant information to complete the registration.

> **Tip** VeloDB Cloud includes two independent account systems: One is used
> for logging into the console, as described in this topic. The other one is
> used to connect to the warehouse, which is described in the Connections
> topic.
If you have already registered on VeloDB Cloud, you can click Go to login
below to log in directly.

## Account Management
### Change Password
After login, click **User Menu** > **Security** to change the login password
for the VeloDB Cloud console.

Once you have successfully changed the password for the first time, you can
use the password for subsequent logins.
### Manage Multi-Factor Authentication (MFA)
Multi-factor authentication adds additional security by requiring an
Authenticator app to generate a one-time verification code for login.
When you log in, VeloDB Cloud verifies both your password and the MFA
verification code.
You can use any Authenticator app from the iOS or Android App Store to
generate this password, such as Google Authenticator and Authy.

### Notifications
At the bottom of the left navigation bar, click **User Menu** >
**Notifications** to go to the message center.
Users, organizations, authorized warehouses, cluster operations, and alarms in
the platform will be notified to remind users when they are triggered.
You can filter by time range, filter unread/read messages with one click, view
messages in pages, mark all messages as read with one click, mark checked
messages as read with one click, etc.

You can switch to the **Scheduled Events** page to see scheduled events.
Scheduled events include system-initiated events (for example, the system
automatically upgrades the core version according to the policy set by the
user) and user-initiated events (for example, manually upgrading the core
version by specifying an execution time window).
Some events (such as version upgrades) may cause disconnection and other
impacts on the business. Please ensure that the business has the reconnection
mechanism.
Before the event is executed, you can modify the scheduled execution time
window or cancel the event.
## Organization Management
Organization is the billing unit. Each organization will be billed
individually. We recommend that you divide organizations by cost unit, and one
user can be affiliated to multiple organizations.
Multiple warehouses can be created under one organization, and the data of
different warehouses are isolated.
You can switch the current organization in the menu bar - switch organization
in the user menu.

### Role Management
In the lower left corner, click **User Menu** > **Access Control** > **Role
Management**.
There are three roles by default in an organization, and you can create
multiple custom roles.
| **Manage Access Control**| **Manage Billing**| **Manage Organization**|
**Manage Warehouse**| Organization Admin| Yes| Yes| Yes| All warehouse:
Create / Edit / View / Query / Monitor| Warehouse Admin| No| No| No| All
warehouse: Edit / View / Query / Monitor| Warehouse Viewer| No| No| No| All
warehouse: View / Query / Monitor
---|---|---|---|---
* View existing roles:

* New role:
You can specify the role name and its corresponding privileges during
creation.
Custom roles can also be deleted or edited.
The user who creates the organization will be the organization administrator
role by default.

### User Management
Organization administrators can invite new users to the current organization
and grant different roles.
New users can join the organization by activating the link in the invitation
email.

### MFA Settings
After enabling MFA, all organization users must complete secondary
authentication before logging in.

### Audit
After login, click **User Menu** > **Audit** to see the audit log for the
VeloDB Cloud console.
VeloDB Cloud logs the historical activities at the organization level. An
event indicates a change in your VeloDB Cloud organization. You can view the
logged activitied on the audit page, including the activity, time, IP and
user.

### Organization Details
click **User Menu** > **Organization Details** to see the organization ID,
create time and organization name.

On This Page
* Registration and Login
* Account Management
* Change Password
* Manage Multi-Factor Authentication (MFA)
* Notifications
* Organization Management
* Role Management
* User Management
* MFA Settings
* Audit
* Organization Details
---
# Source: https://docs.velodb.io/cloud/4.x/release-notes/platform-release-notes
Version: 4.x
On this page
# Platform Release Notes
This article describes the release notes for the management and control
platform of VeloDB Cloud.
## December 2025
**New Features**
* Added a one-click alert feature to rapidly set up an alerting system, enabling timely awareness of exceptions in key monitoring items.
* Optimized the AWS Cloud BYOC template mode by upgrading authentication from AK/SK to IAM Role and supporting reuse through a credential wizard.
* Added support for visual creation of external Catalogs, lowering the barrier for multi-source data integration.
## November 2025
**New Features**
* Added Transparent Data Encryption (TDE) function, providing higher-level security protection for static data.
* Supports data backup and recovery, ensuring the reliability and continuity of business data.
* Added operational audit logs, meeting security compliance and operational traceability requirements.
* BYOC supports wizard mode, making bring-your-own-cloud cluster deployment easier and faster.
* Supports seamless data import from Confluent Cloud and Kafka, simplifying real-time data integration.
* Added credit card payment method, providing users with more convenient and flexible payment options.
* AWS Marketplace supports Private Offer/Contract/Free Trial.
**Improvements**
* Deeply integrated with SQL editor, it delivers a seamless and smooth experience for data development and management.
## August 2025
**New Features**
* Added support for Single Sign-On (SSO) via Google and Microsoft.
* MFA now supports authenticator apps (such as Google Authenticator, Microsoft Authenticator, or Authy).
**Pricing**
* Warehouse usage is now free of charge, no separate fees will be applied.
**Improvements**
* Warehouse connections are now public by default, with an optimized private endpoint configuration process and improved connection information display.
* Reorganized the management platform menu for clearer organization and personal configuration options.
* Added a new warehouse usage guide and improved guidance from usage to payment.
* Enhanced alert notifications and added alert recovery reminders.
* Supports synchronous deployment of warehouse and cluster with parallelized processes.
**New Regions**
* The SaaS model was launched in the Tokyo region of AWS.
## June 2025
**New Features**
* Add premium technical support service billing item. Customers who purchase this service will need to pay an additional fee.
* Supported monitoring and alerting of cache space utilization.
**Improvements**
* Smooth out the commission fee difference between the cloud marketplace deduction channel and the cash deduction method. Customers who use AWS Marketplace or GCP Marketplace deduction channels will have the same cost as the cash deduction method for recharging on VeloDB Cloud. By using the cloud marketplace deduction channel, there is no need to bear additional commission fees.
* Optimized Studio login prompt information.
**New Regions**
* The BYOC model was launched in the Middle East (Bahrain) region of AWS.
## May 2025
**New Features**
* Supported multi availability zone disaster recovery, by mounting the active and standby clusters through the virtual cluster, it can automatically failover to the standby cluster in another availability zone when the active cluster fails, and continue to provide service. When users need to test and rehearse, they can also manually switch between the active and standby clusters. This feature has requirements for core version and region: core version not lower than 4.0.7, and at least 3 availability zones in the region.
**Improvements**
* Optimized the email notification content for warehouse code version upgrade failures.
* Optimized the BYOC warehouse core version upgrade function prompt content, reminding that after upgrading the core version, the cluster HTTP protocol port will change, and users need to add the new port to the access control whitelist to allow outgoing requests to access this port.
* The VeloDB Cloud product introduction page on AWS Marketplace had added the "Deploy on AWS" designation.
## April 2025
**New Features**
* Supported bank corporate transfer recharge.
* Supported VeloDB Professional Services purchase on AWS Marketplace.
* When creating a BYOC warehouse, a subnet segment detection step bad been added. If it is too small to allocate the IPs, the process will be interrupted and an error message will be displayed.
**Improvements**
* Optimized the BYOC warehouse core version upgrade function.
**New Regions**
* The BYOC model was launched in the Asia Pacific (Singapore) region of AWS.
## March 2025
**New Features**
* Add basic metrics and service metrics monitoring for warehouse.
* Add alarm for warehouse metrics.
**New Regions**
* The SaaS model was launched in the Asia Pacific (Hong Kong) region of AWS.
## February 2025
**New Features**
* Supported BYOC warehouse and cluster custom tags.
**Improvements**
* Optimize error information when creating the BYOC warehouse.
**New Regions**
* The BYOC model was launched in the us-east4 region of GCP.
## January 2025
**New Features**
* Supported choosing CPU architecture when creating a new cluster in VeloDB Cloud SaaS or BYOC warehouse on AWS, default is x86, customers can choose ARM. Once a cluster is created, modifying the CPU architecture is not supported.
## December 2024
**New Features and Improvements**
* Support multi available zone disaster recovery
* Azure cluster supports independent cache expansion
* When creating a warehouse, you can specify whether the table name is case sensitive
**New Cloud Platforms**
* BYOC mode was launched on Azure
## November 2024
**New Features and Improvements**
* Support GCP Marketplace
* Alert rules and alert history support paging
* Add a permanent Get Help entrance in the lower right corner
## October 2024
**New Features and Improvements**
* Optimization of the registration/login/free trial links between the official website and Cloud
* The SaaS free trial warehouse period has been increased from 7 days to 14 days
* Open personal email registration/login, and add support for mobile phone login
* Automatically create an organization when a new user registers and logs in, reducing operations
* Verify when activating a free warehouse: whether the organization has been associated with an enterprise email, if not, it needs to be associated with an enterprise email
## September 2024
**New Features and Improvements**
* BYOC warehouse usage optimization
* Optimize the process of creating a BYOC warehouse, add preparation guidance and document guidance
* Optimize the deletion of the last BYOC warehouse and clear the BYOC environment
* Optimize WebUI Link availability check, connection distinction between public and private IP
* Optimize the core version upgrade, bind upgrade Meta Service
* Optimize the minimum permission set of Amazon Web Services
* Optimize the source of Amazon Web Services security group, narrow it down to subnet CIDR
* Optimize the unified alarm link
**New Zones**
* BYOC mode was launched in Huawei Cloud Beijing 4 regions.
## August 2024
**New Regions**
* BYOC mode was launched in the US West (Oregon) region of AWS.
## July 2024
**New Features**
* Supported setting the **O &M Time Window** for each warehouse.
* Supported setting the **Patch Version Upgrade Policy** of VeloDB Core, users can choose auto upgrade or manually upgrade.
* Supported the **Scheduled Events** for each warehouse. The event type was only "**Upgrade Version** ", including events that the system automatically upgrading the patch version of VeloDB Core according to the policy set by the user and the user manually upgrading the version of VeloDB Core by specifying an execution time window.
* Supported **Message Center** , currently including **In-site Messages** and **Scheduled Events** list management functions.
**New Cloud Platforms**
* BYOC mode was launched on GCP.
* SaaS mode was launched on Azure.
**New Regions**
* BYOC mode was launched in the Oregon (us-west1) region of GCP.
* SaaS mode was launched in the West US 3 (Arizona) region of Azure.
**Improvements**
* In-site message function optimization, supported list management, including: filtering by time range, one-click filtering of unread/read messages, paging messages, one-click marking of all messages as read, one-click marking of checked messages as read, etc.
## June 2024
**New Features**
* Supported presentation the port information of clusters, allowing users to conveniently import data using Stream Load method.
* Supported users to directly view the statistical results of Consumption Amount, Pretax Amount, and Arrears Amount.
* Supported _compaction score_ metric monitoring and alarm.
* Supported whitelisted personal email registration the organization (account) on VeloDB Cloud.
**Improvements**
* Organization administrators (including organization creators) cannot modify their own roles.
* The Cash Balance, Voucher Balance, Cloud Marketplace Deduction Channel and other information layout optimization in Billing Center -> Billing Overview page.
## March 2024
**New Features**
* Supported users to individually adjust the cache space of the cluster (currently only scaling out is supported).
* The yearly billing resources of VeloDB Cloud cluster on AWS support scaling out.
**Improvements**
* Users can view monitoring information when the cluster is not running.
* When deleting the SaaS mode trial cluster, the SaaS mode trial warehouse will also be deleted.
* When users select the WeCom group, Lark group, or DingTalk group as the alert channel, they will be reminded that the VeloDB Cloud server IP address can be added in the access control whitelist of **Webhook**.
## February 2024
**New Features**
* Supported **on-demand(hourly)** , **subscription(monthly)** , and **subscription(yearly)** billing method for the paid clusters. The paid clusters can have only one of these billing methods, or a combination of [monthly + hourly] or [yearly + hourly] billing methods. Users can directly convert the on-demand(hourly) billing resources after testing and stabilization to monthly or yearly billing to save long-term ownership and use costs; they can also flexibly scale out/in the on-demand(hourly) resources at any time to cope with temporary increases and decreases in business on the basis of monthly or yearly billing resources.
**New Cloud Platforms**
* SaaS mode was launched on Alibaba Cloud.
**New Regions**
* SaaS mode was launched in the Singapore region of Alibaba Cloud.
**Improvements**
* The SaaS mode on Alibaba Cloud is officially commercialized with price.
* The SaaS mode on HUAWEI CLOUD is officially commercialized with price.
* When creating a new warehouse, the configuration parameter _region_ supports classification, corresponding to different price classifications.
## December 2023
**New Features**
* BYOC mode supported distinguishing between the free warehouse and paid warehouses, and the free warehouse can be upgraded to paid use.
* BYOC free warehouse quota limit. Each organization can only activate one free warehouse. Only one free cluster can be created in the free warehouse. The maximum computing resources are 64 vCPU. The upper and lower limits of the cache space are limited by the computing resources and vary.
**Improvements**
* Optimization of the description, graphics and hypertext links for HUAWEI CLOUD **Private Network Connection** function in SaaS mode.
## November 2023
**New Features**
* Supported customizing the cache space when creating a new cluster. The upper and lower limits of the cache space are affected by computing resources and vary.
**New Cloud Platforms**
* SaaS mode was launched on HUAWEI CLOUD.
**New Regions**
* SaaS mode was launched in the AP-Jakarta region of HUAWEI CLOUD.
**Improvements**
* The **WebUI** login entrance had been added to the warehouse function menu, making it more convenient and faster.
## October 2023
**New Features**
* A new private warehouse (**BYOC, Bring Your Own Cloud**) product mode had been added, and whitelist customers were invited to experience it for free. For customers who need to run the VeloDB data warehouse in their own cloud account and VPC, they can use this product mode. This mode of product has the same capabilities as a proprietary warehouse (**SaaS, Software as a Service**) mode, including: cloud native computing and storage separation, elastic scaling, monitoring and alarming, etc. In addition, it can also meet customers' additional needs, including: higher compliance requirements, better cloud resource discounts, and better connection with the surrounding big data ecosystem.
**New Cloud Platforms**
* BYOC mode was launched on AWS.
**New Regions**
* BYOC mode was launched in the US East (N. Virginia) region of AWS.
**Improvements**
* Overall optimization of monitoring metrics.
* Storage resource usage statistics were more accurate.
## September 2023
**New Features**
* Supported **Auto Resume** when receiving a business request when the on-demand cluster was shut down, improving the **Auto Pause/Resume** function.
* Supported the **Auto Pause** function of the SaaS free trial cluster. This function is enabled by default (disable is not supported). It will be automatically paused after being idle for 360 minutes (user-definable). Users need to manually resume it.
**Improvements**
* The functional constraints of the free trial warehouse and cluster in various states are more standardized, and usage statistics are more accurate.
* Usage information display optimization.
* Added 3 new monitoring metrics: Load Rows Per Second (Row/s), Load Bytes Per Second (MB/s), and Finished Load Tasks.
* When deleting a warehouse, the current operator's email address is displayed for receiving verification codes.
## August 2023
**New Features**
* Supported creating and modifying organizations.
* Supported new customer self-registration organizations (login is registration).
**New Regions**
* AWS Europe (Frankfurt)
**Improvements**
* The list of AWS endpoints for private network connection was optimized, and tips and links were given on where to find the Endpoint DNS Name.
* **IP Whitelist Management** optimization for public network connection.
* Quota prompts for **New Organization** , **New Warehouse** , and **New Cluster**.
* Update the content of **In-site Notifications** and **Email Notifications**.
## June 2023
**New Features**
* On-demand billing clusters support **Time-based Scaling** , which can not only meet the needs of business load scenarios with obvious peaks and lows in a day and have time-periodical regularity, but also avoid the situation that the configuration is too low to cause insufficient resources or the configuration is too high to cause resource waste.
* On-demand billing clusters supported **Manual Pause/Resume** , and **Auto Pause**. It can release computing resources while retaining cache space when the cluster has no load, reducing resource waste and saving costs. It can also quickly pull up computing resources and mount reserved cache resources and data, so that business requests can be quickly responded to.
* WebUI supports multiple tab pages, which is convenient for users to process multiple SQL queries in parallel.
**Improvements**
* WebUI space utilization optimization and database table directory tree optimization provide larger query statement/result display space.
## May 2023
**New Features**
* The cluster supported cloud disk caching, the ratio of vCPU memory is fixed at 1:8, and the ratio of vCPU cache is temporarily 1:50.
* Supported "Lake House", integrate structured or semi-structured source data such as Hive, object storage (S3), MySQL, and Elasticsearch from user data lake through public network or private network connections, and perform federated query analysis in one VeloDB Cloud data warehouse; At the same time, the style of the private network connection had been reconstructed, and two methods are supported: access to VeloDB Cloud data warehouse from the user's clients or applications and access to the user's data lake from VeloDB Cloud data warehouse.
* Supported **Multi-Factor Authentication (MFA)** , strengthen login identity authentication and sensitive operation security (related functions include: MFA policy settings, batch invite users, profile, enroll mobile phone, SMS verification, password reset, etc).
* Added 3 information cards to the **Usage** page: Latest Compute Capacity (vCPU), Latest Cache Space (GB), and Latest Storage Size (GB).
**New Regions**
* AWS Asia Pacific (Singapore)
**Improvements**
* The cluster was adjusted to the configuration of the cluster's overall resources (vCPU, memory, and cache) from the configuration of multiplying the node size and the number of nodes.
* Cloud marketplace deduction authorization process optimization (new user guidance prompts, authorized organizations directly enter the console).
* Security certification: Passed six certifications of ISO.
* WebUI login entrance optimization (prominent position, early prediction and prompts whether and how to log in).
* Optimized the **IP whitelist** for public network connections (adding the last operator information).
* Warehouse navigation and detail optimization (added zone and creator information, rearranging the overall information).
## February 2023
**New Features**
* The **Billing Center** page had been revised, and it supported **Monthly Bill** , **Hourly Bill** , **Billing Details** , and **Voucher Management**.
**New Regions**
* AWS US West (N. California)
**Improvements**
* The account system was restructured, and the permissions of VeloDB Cloud users and the database users were separated.
* The **Query** function module was independently used as a **WebUI** tool, and users need to log in to the warehouse to query data.
* The **Usage** page had been revised, and the Unit metering mechanism had been changed to vCPU-Hour and GB-Hour metering mechanisms.
* The **Billing Center** page had been revised, and the Unit billing and deduction mechanism had been changed to currency billing and deduction mechanism.
* Improved message templates for **In-site Notification** function and **Email Notification** function, updates related links and description.
## November 2022
**New Features**
* The core version can be configured when creating a new warehouse, and in the drop-down selection box, only the latest patch version was retained for each minor version x.y.
* The **Warehouse Details** card added the core version number information. If the current version is not the highest version in the region of the cloud, there will be an upgrade reminder. Click the link icon can go to the **Settings** page to upgrade the version.
* The **Warehouse Details** added creation time information.
* The Warehouse statuses added "upgrading".
* **In-site Notification** function, adding support for notification of core version upgrade success and notification of core version upgrade failure.
* Supported the reminder card for the remaining time of the trial warehouse, which can be upgraded to paid warehouse with one click.
**Improvements**
* Adjusted the position of the core version upgrade entry, moved from the **Cluster Details** page to the **Warehouse Details** card, and can upgrade the core version of the warehouse and all clusters in it. The core version number was divided into three levels: Major, Minor, and Patch, and the format is as follows: x.y.z.
* Both the cluster card on the **Cluster Overview** page and the basic information on the **Cluster Details** page shielded the core version number, and the function operation area on the **Cluster Details** page shielded the **Version Upgrade** function.
* The **Cluster Resize** function and the **Cluster Scaling** function were integrated, and the name of the new function was unified as "**Cluster Scaling** ".
## October 2022
**New Features**
* Cluster reconstruction, was split into the warehouse service and the computing cluster.
* Supported storage-computing separation architecture, multiple computing clusters, and shared object storage data.
* Supported local disk as cluster cache.
* Supported **AWS Marketplace Deduction Channel** , AWS customers can reuse the balance of the AWS cloud account, and uniformly issue bills and Invoices from AWS.
* **In-site Notification** function, adding support for notifications of warehouse creation success, notifications of warehouse creation failure, notifications of warehouse deletion success, notifications of warehouse deletion failure, reminders of trial warehouse is about to expire and stop service, notifications of trial warehouse expiration and suspension of service, reminders of trial warehouse and its data will soon be deleted, notifications of trial warehouse recovery service, notifications of trial warehouse and its data deletion, reminders of suspension of service of paid warehouses due to arrears of payment, notification of suspension of service of paid warehouses due to arrears of payment, reminders of paid warehouses and their data will be deleted, notifications of paid warehouse recovery service, and snotifications of paid warehouses and their data are deleted.
* **Email Notification** function, adding support for notifications of welcome to join the organization, notifications of verification code, reminders of trial warehouse is about to expire and stop service, notifications of trial warehouse expiration and suspension of service, reminders of trial warehouse and its data will soon be deleted, notifications of trial warehouse recovery service, notifications of trial warehouse and its data deletion, reminders of suspension of service of paid warehouses due to arrears of payment, notification of suspension of service of paid warehouses due to arrears of payment, reminders of paid warehouses and their data will be deleted, notifications of paid warehouse recovery service, and snotifications of paid warehouses and their data are deleted.
* The console **Login** page supported switching between the Chinese station and the international station.
**New Regions**
* AWS US West (Oregon)
**Improvements**
* For operations that would cause cost changes (including: **New Cluster** , **Cluster Resize** , and **Cluster Scaling**), added a second confirmation.
* The **Organization Management** function supported organization ID (unique identifier) and setting duplicate organization names.
* **Data Query** function was enhanced.
* The entrance position of the **Access Control** function was adjusted, and it was moved from the warehouse operation area to the user operation area.
* The console interface had been revised and optimized, and the overall layout and UI components had been unified and standardized.
## August 2022
**New Features**
* Supported SaaS mode, that is, both the cluster and the management and control platform were deployed in the VeloDB VPC.
* The **Connection** module was independent from **Cluster Management** , and supported public network connection and private network connection, and the **Private Network Connection** function supported AWS PrivateLink.
* Supported cloud disk storage.
* The cluster added the "Trial" free trial node size.
* Supported the On-Demand billing method, and charged for the overall resources of the cluster.
* Both **In-site Notification** function and **Email Notification** function supported reminders of upcoming arrears, notifications of suspension of services due to arrears, reminders of imminent deletion of data, notifications of cluster recovery service, and notifications of cluster release and data deletion.
**Improvements**
* Console interface revision and optimization, including: **New Cluster** , **Cluster Details** , **Cluster Upgrade** , **Cluster Resize** , **Cluster Scaling** , **Cluster Deletion** , **Billing Overview** , **Billing Help** , **Purchase Units** , **Historical Orders** , etc.
* The **Metering and Billing** page was split into the **Usage** page and the **Billing Center** page. The **Usage** page remained in the navigation bar of the cluster operation area, and the entrance to the **Billing Center** page was moved to the user operation area.
* Removed the function of **AK &SK Authorization of Customer Cloud Account**.
## July 2022
**New Features**
* Supported hybrid mode, that is, the cluster was deployed in the customer VPC, and the management and control platform is deployed in the VeloDB VPC.
* Supported basic functions such as **Cluster Management** , **Data Query** , **Performance Monitoring** , **Access Control** , **AK &SK Authorization of Customer Cloud Account**, and **Metering and Billing**.
* Supported the On-Demand billing method, and only charge value-added service fees.
**New Cloud Platforms**
* AWS
**New Regions**
* AWS US East (N. Virginia)
On This Page
* December 2025
* November 2025
* August 2025
* June 2025
* May 2025
* April 2025
* March 2025
* February 2025
* January 2025
* December 2024
* November 2024
* October 2024
* September 2024
* August 2024
* July 2024
* June 2024
* March 2024
* February 2024
* December 2023
* November 2023
* October 2023
* September 2023
* August 2023
* June 2023
* May 2023
* February 2023
* November 2022
* October 2022
* August 2022
* July 2022
---
# Source: https://docs.velodb.io/cloud/4.x/security/audit-plugin
Version: 4.x
On this page
# Audit Log
Doris provides auditing capabilities for database operations, allowing the
recording of user logins, queries, and modification operations on the
database. In Doris, audit logs can be queried directly through built-in system
tables or by viewing Doris's audit log files.
## Enabling Audit Logs
The audit log plugin can be enabled or disabled at any time using the global
variable `enable_audit_plugin` (disabled by default), for example:
`set global enable_audit_plugin = true;`
Once enabled, Doris will write the audit logs to the `audit_log` table.
You can disable the audit log plugin at any time:
`set global enable_audit_plugin = false;`
After disabling, Doris will stop writing to the `audit_log` table. The already
written audit logs will remain unchanged.
## Viewing the Audit Log Table
Note
Before version 2.1.8, as the system version was upgraded, the audit log table
fields may have increased. After upgrading, you need to add fields to the
`audit_log` table using the `ALTER TABLE` command based on the fields in the
audit log table.
Starting from Doris version 2.1, Doris can write user behavior operations to
the audit_log table in the `__internal_schema` database by enabling the audit
log feature.
The audit log table is a dynamically partitioned table, partitioned daily by
default, retaining the most recent 30 days of data. You can adjust the
retention period of dynamic partitions by modifying the
`dynamic_partition.start` property using the `ALTER TABLE` statement.
## Audit Log Files
In `fe.conf`, `LOG_DIR` defines the storage path for FE logs. All database
operations executed by this FE node are recorded in `${LOG_DIR}/fe.audit.log`.
To view all operations in the cluster, you need to traverse the audit logs of
each FE node.
## Audit Log Format
In versions before 3.0.7, the symbols `\n`, `\t`, and `\r` in statements would
be replaced with `\\n`, `\\t`, and `\\r`. These modified statements were then
stored in the `fe.audit.log` file and the `audit_log` table.
Starting from version 3.0.7, for the `fe.audit.log` file, only `\n` in
statements will be replaced with `\\n`. The `audit_log` table, stores the
original format of statements.
## Audit Log Configuration
**Global Variables:**
Audit log variables can be modified using `set [global] =
`.
Variable| Default Value| Description| `audit_plugin_max_batch_interval_sec`|
60 seconds| Maximum write interval for the audit log table.|
`audit_plugin_max_batch_bytes`| 50MB| Maximum data volume per batch for the
audit log table.| `audit_plugin_max_sql_length`| 4096| Maximum length of SQL
statements recorded in the audit log table.| `audit_plugin_load_timeout`| 600
seconds| Default timeout for audit log import jobs.|
`audit_plugin_max_insert_stmt_length`| Int.MAX| The maximum length limit for
`INSERT` statements. If larger than `audit_plugin_max_sql_length`, the value
of `audit_plugin_max_sql_length` is used. This parameter is supported since
3.0.6.
---|---|---
Because some `INSERT INTO VALUES` statements may be too long and submitted
frequently, causing the audit log too large. Therefore, Doris added
`audit_plugin_max_insert_stmt_length` in version 3.0.6 to limit the audit
length of `INSERT` statements separately. This avoids the expansion of the
audit log and ensures that the SQL statements are fully audited.
**FE Configuration Items:**
FE configuration items can be modified by editing the `fe.conf` directory.
Configuration Item| Description| `skip_audit_user_list`| If you do not want
operations of certain users to be recorded in the audit logs, you can modify
this configuration (supported since version 3.0.01). For example, use the
config to exclude `user1` and `user2` from audit log recording:
`skip_audit_user_list=user1,user2`
---|---
On This Page
* Enabling Audit Logs
* Viewing the Audit Log Table
* Audit Log Files
* Audit Log Format
* Audit Log Configuration
---
# Source: https://docs.velodb.io/cloud/4.x/security/auth/authentication-and-authorization
Version: 4.x
On this page
# Authentication and Authorization
The Doris permission management system is modeled after the MySQL permission
management mechanism. It supports fine-grained permission control at the row
and column level, role-based access control, and also supports a whitelist
mechanism.
## Glossary
1. User Identity
Within a permission system, a user is identified as a User Identity. A User
Identity consists of two parts: `username` and `host`. The `username` is the
user's name, consisting of English letters (both uppercase and lowercase).
`host` represents the IP from which the user connection originates. User
Identity is represented as `username@'host'`, indicating `username` from
`host`.
Another representation of User Identity is `username@['domain']`, where
`domain` refers to a domain name that can be resolved into a set of IPs
through DNS. Eventually, this is represented as a set of `username@'host'`,
hence moving forward, we uniformly use `username@'host'` to denote it.
2. Privilege
Privileges apply to nodes, data directories, databases, or tables. Different
privileges represent different operation permissions.
3. Role
Doris allows the creation of custom-named roles. A role can be viewed as a
collection of privileges. Newly created users can be assigned a role,
automatically inheriting the privileges of that role. Subsequent changes to
the role's privileges will also reflect on the permissions of all users
associated with that role.
4. User Property
User properties are directly affiliated with a user, not the User Identity.
Meaning, both `user@'192.%'` and `user@['domain']` share the same set of user
properties, which belong to the user `user`, not to `user@'192.%'` or
`user@['domain']`.
User properties include but are not limited to: maximum number of user
connections, import cluster configurations, etc.
## Authentication and Authorization Framework
The process of a user logging into Apache Doris is divided into two parts:
**Authentication** and **Authorization**.
* Authentication: Identity verification is conducted based on the credentials provided by the user (such as username, client IP, password). Once verified, the individual user is mapped to a system-defined User Identity.
* Authorization: Based on the acquired User Identity, it checks whether the user has the necessary permissions for the intended operations, according to the privileges associated with that User Identity.
## Authentication
Doris supports built-in authentication schemes as well as LDAP authentication.
### Doris Built-in Authentication Scheme
Authentication is based on usernames, passwords, and other information stored
within Doris itself.
Administrators create users with the `CREATE USER` command and view all
created users with the `SHOW ALL GRANTS` command.
When a user logs in, the system verifies whether the username, password, and
client IP address are correct.
#### Password Policy
Doris supports the following password policies to assist users in better
password management.
1. `PASSWORD_HISTORY`
Determines whether a user can reuse a historical password when resetting their
current password. For example, `PASSWORD_HISTORY 10` means the last 10
passwords cannot be reused as a new password. Setting `PASSWORD_HISTORY
DEFAULT` will use the value from the global variable `PASSWORD_HISTORY`. A
setting of 0 disables this feature. The default is 0.
Examples:
* Set a global variable: `SET GLOBAL password_history = 10`
* Set for a user: `ALTER USER user1@'ip' PASSWORD_HISTORY 10`
2. `PASSWORD_EXPIRE`
Sets the expiration time for the current user's password. For instance,
`PASSWORD_EXPIRE INTERVAL 10 DAY` means the password will expire after 10
days. `PASSWORD_EXPIRE NEVER` indicates the password never expires. Setting
`PASSWORD_EXPIRE DEFAULT` will use the value from the global variable
`default_password_lifetime` (in days). The default is NEVER (or 0), indicating
it does not expire.
Examples:
* Set a global variable: `SET GLOBAL default_password_lifetime = 1`
* Set for a user: `ALTER USER user1@'ip' PASSWORD_EXPIRE INTERVAL 10 DAY`
3. `FAILED_LOGIN_ATTEMPTS` and `PASSWORD_LOCK_TIME`
Configures the number of incorrect password attempts after which the user
account will be locked and sets the lock duration. For example,
`FAILED_LOGIN_ATTEMPTS 3 PASSWORD_LOCK_TIME 1 DAY` means if there are 3
incorrect logins, the account will be locked for one day. Administrators can
unlock the account using the `ALTER USER` statement.
Example:
* Set for a user: `ALTER USER user1@'ip' FAILED_LOGIN_ATTEMPTS 3 PASSWORD_LOCK_TIME 1 DAY`
4. Password Strength
This is controlled by the global variable `validate_password_policy`. The
default is `NONE/0`, which means no password strength checking. If set to
`STRONG/2`, the password must include at least three of the following:
uppercase letters, lowercase letters, numbers, and special characters, and
must be at least 8 characters long.
Example:
* `SET validate_password_policy=STRONG`
For more help, please refer to [ALTER USER](/cloud/4.x/sql-manual/sql-
statements/account-management/ALTER-USER).
## Authorization
### Permission Operations
* Create user: [CREATE USER](/cloud/4.x/sql-manual/sql-statements/account-management/CREATE-USER)
* Modify user: [ALTER USER](/cloud/4.x/sql-manual/sql-statements/account-management/ALTER-USER)
* Delete user: [DROP USER](/cloud/4.x/sql-manual/sql-statements/account-management/DROP-USER)
* Grant/Assign role: [GRANT](/cloud/4.x/sql-manual/sql-statements/account-management/GRANT-TO)
* Revoke/Withdraw role: [REVOKE](/cloud/4.x/sql-manual/sql-statements/account-management/REVOKE-FROM)
* Create role: [CREATE ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/CREATE-ROLE)
* Delete role: [DROP ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/DROP-ROLE)
* Modify role: [ALTER ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/ALTER-ROLE)
* View current user's permissions and roles: [SHOW GRANTS](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-GRANTS)
* View all users' permissions and roles: [SHOW ALL GRANTS](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-GRANTS)
* View created roles: [SHOW ROLES](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-ROLES)
* Set user property: [SET PROPERTY](/cloud/4.x/sql-manual/sql-statements/account-management/SET-PROPERTY)
* View user property: [SHOW PROPERTY](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-PROPERTY)
* Change password: [SET PASSWORD](/cloud/4.x/sql-manual/sql-statements/account-management/SET-PASSWORD)
* View all supported privileges: [SHOW PRIVILEGES]
* View row policy: [SHOW ROW POLICY]
* Create row policy: [CREATE ROW POLICY]
### Types of Permissions
Doris currently supports the following permissions:
1. `Node_priv`
Node modification permission. Includes adding, deleting, and offlining FE, BE,
BROKER nodes.
Root users have this permission by default. Users who possess both
`Grant_priv` and `Node_priv` can grant this permission to other users.
This permission can only be granted at the Global level.
2. `Grant_priv`
Permission modification authority. Allows execution of operations including
granting, revoking, adding/deleting/modifying users/roles.
Before version 2.1.2, when granting permissions to other users/roles, the
current user only needed the respective level's `Grant_priv` permission. After
version 2.1.2, the current user also needs permission for the resource they
wish to grant.
When assigning roles to other users, Global level `Grant_priv` permission is
required.
3. `Select_priv`
Read-only permission for data directories, databases, and tables.
4. `Load_priv`
Write permission for data directories, databases, and tables. Includes Load,
Insert, Delete, etc.
5. `Alter_priv`
Alteration permissions for data directories, databases, and tables. Includes
renaming libraries/tables, adding/deleting/modifying columns, adding/deleting
partitions, etc.
6. `Create_priv`
Permission to create data directories, databases, tables, and views.
7. `Drop_priv`
Permission to delete data directories, databases, tables, and views.
8. `Usage_priv`
Usage permissions for Resources and Workload Groups.
9. `Show_view_priv`
Permission to execute `SHOW CREATE VIEW`.
### Permission Levels
#### Global Permissions
Permissions granted through the GRANT statement with `*.*.*` scope. These
permissions apply to any table within any catalog.
#### Catalog Permissions
Permissions granted through the GRANT statement with `ctl.*.*` scope. These
permissions apply to any table within the specified catalog.
#### Database Permissions
Permissions granted through the GRANT statement with `ctl.db.*` scope. These
permissions apply to any table within the specified database.
#### Table Permissions
Permissions granted through the GRANT statement with `ctl.db.tbl` scope. These
permissions apply to any column within the specified table.
#### Column Permissions
Column permissions are primarily used to restrict user access to certain
columns within a table. Specifically, column permissions allow administrators
to set viewing, editing, and other rights for certain columns, controlling
user access and manipulation of specific column data.
Permissions for specific columns of a table can be granted with `GRANT
Select_priv(col1,col2) ON ctl.db.tbl TO user1`.
Currently, column permissions support only `Select_priv`.
#### Row-Level Permissions
Row Policies enable administrators to define access policies based on fields
within the data, controlling which users can access which rows.
Specifically, Row Policies allow administrators to create rules that can
filter or restrict user access to rows based on actual values stored in the
data.
From version 1.2, row-level permissions can be created with the `CREATE ROW
POLICY` command.
From version 2.1.2, support for setting row-level permissions through Apache
Ranger's `Row Level Filter` is available.
#### Usage Permissions
* Resource Permissions
Resource permissions are set specifically for Resources, unrelated to
permissions for databases or tables, and can only assign `Usage_priv` and
`Grant_priv`.
Permissions for all Resources can be granted with the `GRANT USAGE_PRIV ON
RESOURCE '%' TO user1`.
* Workload Group Permissions
Workload Group permissions are set specifically for Workload Groups, unrelated
to permissions for databases or tables, and can only assign `Usage_priv` and
`Grant_priv`.
Permissions for all Workload Groups can be granted with `GRANT USAGE_PRIV ON
WORKLOAD GROUP '%' TO user1`.
### Data Masking
Data masking is a method to protect sensitive data by modifying, replacing, or
hiding the original data, such that the masked data retains certain formats
and characteristics while no longer containing sensitive information.
For example, administrators may choose to replace part or all of the digits of
sensitive fields like credit card numbers or ID numbers with asterisks `*` or
other characters, or replace real names with pseudonyms.
From version 2.1.2, support for setting data masking policies for certain
columns through Apache Ranger's Data Masking is available, currently only
configurable via [Apache
Ranger](/cloud/4.x/security/auth/authorization/ranger).
### Doris Built-in Authorization Scheme
Doris's permission design is based on the RBAC (Role-Based Access Control)
model, where users are associated with roles, and roles are associated with
permissions. Users are indirectly linked to permissions through their roles.
When a role is deleted, users automatically lose all permissions associated
with that role.
When a user is disassociated from a role, they automatically lose all
permissions of that role.
When permissions are added to or removed from a role, the permissions of the
users associated with that role change accordingly.
┌────────┐ ┌────────┐ ┌────────┐
│ user1 ├────┬───► role1 ├────┬────► priv1 │
└────────┘ │ └────────┘ │ └────────┘
│ │
│ │
│ ┌────────┐ │
│ │ role2 ├────┤
┌────────┐ │ └────────┘ │ ┌────────┐
│ user2 ├────┘ │ ┌─► priv2 │
└────────┘ │ │ └────────┘
┌────────┐ │ │
┌──────► role3 ├────┘ │
│ └────────┘ │
│ │
│ │
┌────────┐ │ ┌────────┐ │ ┌────────┐
│ userN ├─┴──────► roleN ├───────┴─► privN │
└────────┘ └────────┘ └────────┘
As shown above:
User1 and user2 both have permission `priv1` through `role1`.
UserN has permission `priv1` through `role3`, and permissions `priv2` and
`privN` through `roleN`. Thus, userN has permissions `priv1`, `priv2`, and
`privN` simultaneously.
For ease of user operations, it is possible to directly grant permissions to a
user. Internally, a unique default role is created for each user. When
permissions are granted to a user, it is essentially granting permissions to
the user's default role.
The default role cannot be deleted, nor can it be assigned to someone else.
When a user is deleted, their default role is automatically deleted as well.
### Authorization Scheme Based on Apache Ranger
Please refer to [Authorization Scheme Based on Apache
Ranger](/cloud/4.x/security/auth/authorization/ranger).
## Common Questions
### Explanation of Permissions
1. Users with ADMIN privileges or GRANT privileges at the GLOBAL level can perform the following operations:
* CREATE USER
* DROP USER
* ALTER USER
* SHOW GRANTS
* CREATE ROLE
* DROP ROLE
* ALTER ROLE
* SHOW ROLES
* SHOW PROPERTY FOR USER
2. GRANT/REVOKE
* Users with ADMIN privileges can grant or revoke permissions for any user.
* Users with ADMIN or GLOBAL level GRANT privileges can assign roles to users.
* Users who have the corresponding level of GRANT privilege and the permissions to be assigned can distribute those permissions to users/roles.
3. SET PASSWORD
* Users with ADMIN privileges or GLOBAL level GRANT privileges can set passwords for non-root users.
* Ordinary users can set the password for their corresponding User Identity. Their corresponding User Identity can be viewed with the `SELECT CURRENT_USER()` command.
* ROOT users can change their own password.
### Additional Information
1. When Doris is initialized, the following users and roles are automatically created:
* operator role: This role has `Node_priv` and `Admin_priv`, i.e., all permissions in Doris.
* admin role: This role has `Admin_priv`, i.e., all permissions except for node changes.
* root@'%': root user, allowed to log in from any node, with the operator role.
* admin@'%': admin user, allowed to log in from any node, with the admin role.
2. Deleting or altering the permissions of default created users, roles, or users is not supported.
* Deleting the users root@'%' and admin@'%' is not supported, but creating and deleting root@'xxx' and admin@'xxx' users (where xxx refers to any host except %) is allowed (Doris treats these users as regular users).
* Revoking the default roles of root@'%' and admin@'%' is not supported.
* Deleting the roles operator and admin is not supported.
* Modifying the permissions of the roles operator and admin is not supported.
3. There is only one user with the operator role, which is Root. There can be multiple users with the admin role.
4. Some potentially conflicting operations are explained as follows:
1. Domain and IP conflict:
Suppose the following user is created:
`CREATE USER user1@['domain'];`
And granted:
`GRANT SELECT_PRIV ON *.* TO user1@['domain']`
This domain is resolved to two IPs: ip1 and ip2.
Suppose later, we grant a separate permission to `user1@'ip1'`:
`GRANT ALTER_PRIV ON . TO user1@'ip1';`
Then `user1@'ip1'` will have permissions for both Select_priv and Alter_priv.
And when we change the permissions for `user1@['domain']` again, `user1@'ip1'`
will not follow the change.
2. Duplicate IP conflict:
Suppose the following users are created:
CREATE USER user1@'%' IDENTIFIED BY "12345";
CREATE USER user1@'192.%' IDENTIFIED BY "abcde";
In terms of priority, `'192.%'` takes precedence over `'%'`, so when user
`user1` from machine `192.168.1.1` tries to log into Doris using password
`'12345'`, access will be denied.
5. Forgotten Password
If you forget the password and cannot log into Doris, you can add
`skip_localhost_auth_check=true` to the FE's config file and restart the FE,
thus logging into Doris as root without a password from the local machine.
After logging in, you can reset the password using the `SET PASSWORD` command.
6. No user can reset the root user's password except for the root user themselves.
7. `Admin_priv` permissions can only be granted or revoked at the GLOBAL level.
8. `current_user()` and `user()`
Users can view their `current_user` and `user` by executing `SELECT
current_user()` and `SELECT user()` respectively. Here, `current_user`
indicates the identity the user authenticated with, while `user` is the actual
User Identity at the moment.
For example:
Suppose `user1@'192.%'` is created, and then user `user1` logs in from
`192.168.10.1`, then the `current_user` would be `user1@'192.%'`, and `user`
would be `user1@'192.168.10.1'`.
All permissions are granted to a specific `current_user`, and the real user
has all the permissions of the corresponding `current_user`.
## Best Practices
Here are some examples of use cases for the Doris permission system.
1. Scenario 1
Users of the Doris cluster are divided into administrators (Admin),
development engineers (RD), and users (Client). Administrators have all
permissions over the entire cluster, primarily responsible for cluster setup
and node management. Development engineers are responsible for business
modeling, including creating databases and tables, importing, and modifying
data. Users access different databases and tables to retrieve data.
In this scenario, administrators can be granted ADMIN or GRANT privileges. RDs
can be granted CREATE, DROP, ALTER, LOAD, and SELECT permissions for any or
specific databases and tables. Clients can be granted SELECT permissions for
any or specific databases and tables. Additionally, different roles can be
created to simplify the authorization process for multiple users.
2. Scenario 2
A cluster may contain multiple businesses, each potentially using one or more
datasets. Each business needs to manage its users. In this scenario, an
administrative user can create a user with DATABASE-level GRANT privileges for
each database. This user can only authorize users for the specified database.
3. Blacklist
Doris itself does not support a blacklist, only a whitelist, but we can
simulate a blacklist through certain means. Suppose a user named
`user@'192.%'` is created, allowing users from `192.*` to log in. If you want
to prohibit a user from `192.168.10.1` from logging in, you can create another
user `cmy@'192.168.10.1'` with a new password. Since `192.168.10.1` has higher
priority than `192.%`, the user from `192.168.10.1` will no longer be able to
log in with the old password.
On This Page
* Glossary
* Authentication and Authorization Framework
* Authentication
* Doris Built-in Authentication Scheme
* Authorization
* Permission Operations
* Types of Permissions
* Permission Levels
* Data Masking
* Doris Built-in Authorization Scheme
* Authorization Scheme Based on Apache Ranger
* Common Questions
* Explanation of Permissions
* Additional Information
* Best Practices
---
# Source: https://docs.velodb.io/cloud/4.x/security/encryption/encryption-function
Version: 4.x
# Encryption and Masking Function
Doris provides the following built-in encryption and masking functions. For
detailed usage, please refer to the SQL manual.
* [AES_ENCRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/aes-encrypt)
* [AES_DECRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/aes-decrypt)
* [SM4_ENCRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm4-encrypt)
* [SM4_DECRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm4-decrypt)
* [MD5](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/md5)
* [MD5SUM](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/md5sum)
* [SM3](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm3)
* [SM3SUM](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm3sum)
* [SHA](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sha)
* [SHA2](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sha2)
* [DIGITAL_MASKING](/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/digital-masking)
---
# Source: https://docs.velodb.io/cloud/4.x/security/integrations/aws-authentication-and-authorization
Version: 4.x
On this page
# AWS authentication and authorization
Doris supports accessing AWS service resources through two authentication
methods: `IAM User` and `Assumed Role`. This article explains how to
configure security credentials for both methods and use Doris features to
interact with AWS services.
# Authentication Methods Overview
## IAM User Authentication
Doris enables access to external data sources by configuring `AWS IAM User`
credentials(equal to `access_key` and `secret_key`), below are the detailed
configuration steps(for more information, refer to the AWS doc [IAM
users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html)):
### Step1 Create an IAM User and configure policies
1. Login to the `AWS Console` and create an `IAM User`

2. Enter the IAM User name and attach policies directly

3. Define AWS resource policies in the policy editor, below are read/write policy templates for accessing an S3 bucket

S3 read policy template, applies to Doris features requiring read/list
access, e.g: S3 Load, TVF, External Catalog
**Notes:**
1. **Replace and with actual values.**
2. **Avoid adding extra / separators.**
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
],
"Resource": "arn:aws:s3:::/your-prefix/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::"
}
]
}
S3 write policy template (Applies to Doris features requiring read/write
access, e.g: Export, Storage Vault, Repository)
**Notes:**
1. **Replace`your-bucket` and `your-prefix` with actual values.**
2. **Avoid adding extra`/` separators.**
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3::://*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetBucketVersioning",
"s3:GetLifecycleConfiguration"
],
"Resource": "arn:aws:s3:::"
}
]
}
4. After successfully creating the IAM User, create access/secret key pair

### Step2 Use doris features with access/secret key pair via SQL
After completing all configurations in Step 1, you will obtain `access_key`
and `secret_key`. Use these credentials to access doris features as shown in
the following examples:
#### S3 Load
LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "S3",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.access_key" = "",
"s3.secret_key" = ""
)
PROPERTIES
(
"timeout" = "3600"
);
#### TVF
SELECT * FROM S3 (
'uri' = 's3://your_bucket/path/to/tvf_test/test.parquet',
'format' = 'parquet',
's3.endpoint' = 's3.us-east-1.amazonaws.com',
's3.region' = 'us-east-1',
"s3.access_key" = "",
"s3.secret_key"=""
)
#### External Catalog
CREATE CATALOG iceberg_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://your_bucket/dir/key',
's3.endpoint' = 's3.us-east-1.amazonaws.com',
's3.region' = 'us-east-1',
"s3.access_key" = "",
"s3.secret_key"=""
);
#### Storage Vault
CREATE STORAGE VAULT IF NOT EXISTS s3_demo_vault
PROPERTIES (
"type" = "S3",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.bucket" = "",
"s3.access_key" = "",
"s3.secret_key"="",
"s3.root.path" = "s3_demo_vault_prefix",
"provider" = "S3",
"use_path_style" = "false"
);
#### Export
EXPORT TABLE s3_test TO "s3://your_bucket/a/b/c"
PROPERTIES (
"column_separator"="\\x07",
"line_delimiter" = "\\x07"
) WITH S3 (
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.access_key" = "",
"s3.secret_key"="",
)
#### Repository
CREATE REPOSITORY `s3_repo`
WITH S3
ON LOCATION "s3://your_bucket/s3_repo"
PROPERTIES
(
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.access_key" = "",
"s3.secret_key"=""
);
#### Resource
CREATE RESOURCE "remote_s3"
PROPERTIES
(
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.bucket" = "",
"s3.access_key" = "",
"s3.secret_key"=""
);
You can specify different IAM User credentials (`access_key` and `secret_key`)
across different business logic to implement access control for external data.
## Assumed Role Authentication
Assumed Role allows accessing external data sources by assuming an AWS IAM
Role(for details, refer to AWS documentation [assume
role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage-
assume.html)), the following diagram illustrates the configuration workflow:

Terminology:
`Source Account`: The AWS account initiating the Assume Role action (where
Doris FE/BE EC2 instances reside);
`Target Account`: The AWS account owning the target S3 bucket;
`ec2_role`: A role created in the source account, attached to EC2 instances
running Doris FE/BE;
`bucket_role`: A role created in the target account with permissions to access
the target bucket;
**Notes:**
1. **The source and target accounts can be the same AWS account;**
2. **Ensure All EC2 instances which Doris FE/BE deployed have been attached on`ec_role`, especially during scaling operations.**
More detailed configuration steps are as follows:
### Step1 Prerequisites
1. Ensure the source account has created an `ec2_role` and attached it to all `EC2 instances` running Doris FE/BE;
2. Ensure the target account has created a `bucket_role` and corresponding bucket;
After attaching `ec2_role` to `EC2 instances`, you can find the `role_arn` as
shown below:

### Step2 Configure Permissions for Source Account IAM Role (EC2 Instance
Role)
1. Log in to the [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iamv2/home#/home),navigate to `Access management` > `Roles`;
2. Find the EC2 instance role and click its name;
3. On the role details page, go to the `Permissions` tab, click `Add permissions`, then select `Create inline policy`;
4. In the `Specify permissions section`, switch to the `JSON` tab, paste the following policy, and click `Review policy`:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["sts:AssumeRole"],
"Resource": "*"
}
]
}
### Step3 Configure Trust Policy and Permissions for Target Account IAM Role
1. Log in [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iamv2/home#/home), navigate to Access management > Roles, find the target role (bucket_role), and click its name;
2. Go to the `Trust relationships` tab, click `Edit trust policy`, and paste the following JSON (replace with your EC2 instance role ARN). Click Update policy

**Note: The`ExternalId` in the `Condition` section is an optional string
parameter used to distinguish scenarios where multiple source users need to
assume the same role. If configured, include it in the corresponding Doris SQL
statements. For a detailed explanation of ExternalId, refer to [aws
doc](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_common-
scenarios_third-party.html)**
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ""
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "1001"
}
}
}
]
}
3. On the role details page, go to the `Permissions` tab, click `Add permissions`, then select `Create inline policy`. In the `JSON` tab, paste one of the following policies based on your requirements;

S3 read policy template, applies to Doris features requiring read/list
access, e.g: S3 Load, TVF, External Catalog
**Notes:**
1. **Replace`your-bucket` and `your-prefix` with actual values.**
2. **Avoid adding extra`/` separators.**
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3::://*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::",
}
]
}
S3 write policy template (Applies to Doris features requiring read/write
access, e.g: Export, Storage Vault, Repository)
**Notes:**
1. **Replace`your-bucket` and `your-prefix` with actual values.**
2. **Avoid adding extra`/` separators.**
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3::://*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::"
}
]
}
### Step4 Use doris features with Assumed Role via SQL, according to
`role_arn` and `external_id` fields
After completing the above configurations, obtain the target account's
`role_arn` and `external_id` (if applicable). Use these parameters in doris
SQL statements as shown below:
Common important key parameters:
"s3.role_arn" = "",
"s3.external_id" = "" -- option parameter
#### S3 Load
LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "S3",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.role_arn" = "",
"s3.external_id" = "" -- option parameter
)
PROPERTIES
(
"timeout" = "3600"
);
#### TVF
SELECT * FROM S3 (
"uri" = "s3://your_bucket/path/to/tvf_test/test.parquet",
"format" = "parquet",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.role_arn" = "",
"s3.external_id" = "" -- option parameter
)
#### External Catalog
CREATE CATALOG iceberg_catalog PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "hadoop",
"warehouse" = "s3://your_bucket/dir/key",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.role_arn" = "",
"s3.external_id" = "" -- option parameter
);
#### Storage Vault
CREATE STORAGE VAULT IF NOT EXISTS s3_demo_vault
PROPERTIES (
"type" = "S3",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.bucket" = "",
"s3.role_arn" = "",
"s3.external_id" = "", -- option parameter
"s3.root.path" = "s3_demo_vault_prefix",
"provider" = "S3",
"use_path_style" = "false"
);
#### Export
EXPORT TABLE s3_test TO "s3://your_bucket/a/b/c"
PROPERTIES (
"column_separator"="\\x07",
"line_delimiter" = "\\x07"
) WITH S3 (
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.role_arn" = "",
"s3.external_id" = ""
)
#### Repository
CREATE REPOSITORY `s3_repo`
WITH S3
ON LOCATION "s3://your_bucket/s3_repo"
PROPERTIES
(
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.role_arn" = "",
"s3.external_id" = ""
);
#### Resource
CREATE RESOURCE "remote_s3"
PROPERTIES
(
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.bucket" = "",
"s3.role_arn" = "",
"s3.external_id" = ""
);
On This Page
* IAM User Authentication
* Step1 Create an IAM User and configure policies
* Step2 Use doris features with access/secret key pair via SQL
* Assumed Role Authentication
* Step1 Prerequisites
* Step2 Configure Permissions for Source Account IAM Role (EC2 Instance Role)
* Step3 Configure Trust Policy and Permissions for Target Account IAM Role
* Step4 Use doris features with Assumed Role via SQL, according to `role_arn` and `external_id` fields
---
# Source: https://docs.velodb.io/cloud/4.x/security/privacy-compliance/security-features
Version: 4.x
On this page
# Security Features
VeloDB Cloud provides a complete security mechanism to ensure the security of
customer data and services, such as isolation, authentication, authorization,
encryption, auditing, disaster recovery, etc.
## Product Architecture
VeloDB Cloud cloud-native data warehouse contains three key concepts:
organization, warehouse and cluster. As the cornerstone of product design,
they build independent, isolated, elastic and scalable services to help
enterprises quickly and safely build the foundation of big data analysis
business.

* **Organization** : An organization represents an enterprise or a relatively independent group. After registering VeloDB Cloud, users use the service as an organization. Organization is the billing settlement object in VeloDB Cloud. The billing, resources and data between different organizations are isolated from each other.
* **Warehouse** : A warehouse is a logical concept, which includes computing and storage resources. Each organization can create multiple warehouses to meet the data analysis needs of different businesses, such as orders, advertising, logistics and other businesses. Similarly, the resources and data between different warehouses are also isolated from each other, which can be used to meet the security needs within the organization.
* **Cluster** : A cluster is a computing resource in a warehouse, which contains one or more computing nodes and can be elastically expanded and reduced. A warehouse can contain multiple clusters, which share the underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other.
From a technical perspective, the core technical architecture of VeloDB Cloud
is divided into three layers:

**Service Layer**
* Manager: Responsible for the management of computing and storage resources. When a user creates a warehouse, the Manager is responsible for creating a storage bucket; when a user creates a cluster, the Manager is responsible for creating computing resources.
* Metadata: Stores metadata such as organizations, users, warehouses, clusters, and database tables.
* Security: Responsible for security policy settings, using the principle of least privilege.
**Compute Layer**
* Data warehouse: It is a logical concept, including physical objects such as warehouse metadata, clusters, and data storage.
* Cluster: The cluster only contains computing resources and cached data. Multiple clusters of the same warehouse share data storage.
**Storage Layer**
* Object storage: The data in the warehouse is stored in the object storage on the cloud service in the form of files.
## Security level
VeloDB Cloud provides complete and full-link data security features from the
dimensions of resource isolation, authentication, data transmission and
storage:
* Resource isolation: Storage and computing between organizations are isolated from each other.
* Identity authentication: Prove the identity of the visitor (user or application).
* Access control: Set user access rights to data to ensure that users can control data permissions in a fine-grained manner.
* Data protection: Storage and transmission encryption ensure that data will not be leaked through physical disks and network monitoring, and support data disaster recovery protection.
* Network security: Public network whitelist, private network links, inter-organization security groups, and optional independent VPC ensure the security of network connections.
* Security audit: Transparent and complete audit of operations in the console and warehouse.
* Application security: VeloDB cloud service has the ability to prevent attacks.
### Resource isolation
**SaaS deployment**
VeloDB Cloud ensures complete isolation of data between different
organizations through storage and computing isolation:
Data storage
1. Each organization uses a separate object storage bucket in each cloud service area, and the bucket is set to private access and uses STS authentication.
2. Each warehouse is assigned a cloud service subaccount, and the storage permission of each warehouse is only granted to this subaccount.
3. Cache data is only stored locally in the cluster, and different warehouses cannot access each other. Computing resources
4. Clusters will not be used across warehouses, that is, a cluster will only belong to one warehouse.
5. Each organization's cluster sets strict firewall rules through security groups to ensure that clusters between different organizations cannot connect to each other.
**BYOC deployment**
In the VeloDB Cloud BYOC deployment form, data storage and computing resources
are completely retained in your own VPC, and data does not leave your VPC,
ensuring the security and compliance of data and computing.
Data storage Data is completely stored in your own VPC, and data does not
leave your VPC. Computing resources
1. Computing resources are completely in your own cloud resource pool, providing data warehouse services.
2. A warehouse can contain multiple clusters, which share underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other.
### Identity Authentication
Any access to the VeloDB Cloud control plane or data plane requires identity
authentication, which is mainly used to confirm the identity of the visitor.
VeloDB Cloud ensures the reliability of authentication through the following
mechanisms:
* Control plane
* Support multi-factor authentication (MFA), and improve security protection capabilities through combined authentication methods such as email password and mobile phone verification code.
* Data plane
* Connect using the MySQL authentication protocol.
* HTTP protocol data interaction requires identity authentication, and the authentication method is consistent with the MySQL protocol.
* Support IP blacklist and whitelist mechanism for identity authentication.
* Password policy
* Prevent setting weak passwords, and use strong passwords.
* Prevent brute force password cracking.
* User passwords are encrypted and stored.
### Access Control
VeloDB has three levels of access control entities: organization, user, and
user in warehouse. Organization is a billing unit, and the same organization
shares the bill. User is used for control, such as creating and deleting data
warehouses and clusters. User in warehouse is used for data, and can operate
on database tables, similar to users in MySQL.
**RBAC permission control**
Multiple warehouses can be created under an organization, and the data between
each warehouse is isolated. Organization administrators can set different
roles for users in the organization, and control the user's permissions to
create/delete/edit/view/query/monitor warehouses through roles. For details,
please refer to VeloDB Cloud User Management. User in warehouse refers to the
permission management mechanism of MySQL, and achieves fine-grained permission
control at the table level, role-based permission access control, and supports
whitelist mechanism. For details, please refer to VeloDB Cloud Permission
Management.
**Row-level security**
Administrators can perform fine-grained permission control on qualified rows,
such as allowing only a certain user to access qualified rows, which is used
when multiple users have different permissions for different data rows in a
table. Syntax description, row policy documentation
CREATE ROW POLICY {NAME} ON {TABLE}
AS {RESTRICTIVE|PERMISSIVE} TO {USER} USING {PREDICATE};
Example Create a policy named test_row_policy_1, which prohibits user1 from
accessing rows in table1 where the col1 column value is equal to 1 or 2.
CREATE ROW POLICY test_row_policy_1 ON db1.table1
AS RESTRICTIVE TO user1 USING (col1 in (1, 2));
Create a policy named test_row_policy_1, which allows user1 to access rows in
table1 where the col1 column value is equal to 1 or 2.
CREATE ROW POLICY test_row_policy_1 ON db1.table1
AS PERMISSIVE TO user1 USING (col1 in (1, 2));
**Column-level security**
Administrators can implement column-level permission control through views.
For example, if a user does not have access to a column, a view that does not
contain this column can be created for this user. Syntax (The following only
shows the basic syntax, please refer to the detailed syntax of view)
CREATE VIEW {name} {view_column_list}
AS
SELECT {table_column_list} FROM {src_table}
Example Authorize user user1 to read columns id and name of table t1
create view view2 (id,name) as select id,name from t1
grant SELECT_PRIV to user1 on view2
**Data masking**
VeloDB provides a convenient mask function that can mask numbers and strings.
Users can use the mask function to create a view, and then manage the view
permissions through the access control of users in the warehouse, thereby
implementing data masking for users.
Syntax Description
VARCHAR mask(VARCHAR str, [, VARCHAR upper[, VARCHAR lower[, VARCHAR number]]])
Example Returns a masked version of str. By default, uppercase letters are
converted to "X", lowercase letters are converted to "x", and numbers are
converted to "n". For example, mask("abcd-EFGH-8765-4321") results in xxxx-
XXXX-nnnn-nnnn. You can override the characters used in the mask by providing
additional parameters: the second parameter controls the mask character for
uppercase letters, the third parameter controls lowercase letters, and the
fourth parameter controls numbers. For example, mask("abcd-EFGH-8765-4321",
"U", "l", "#") results in llll-UUUU-####-####.
// table test
+-----------+
| name |
+-----------+
| abc123EFG |
| NULL |
| 456AbCdEf |
+-----------+
mysql> select mask(name) from test;
+--------------+
| mask(`name`) |
+--------------+
| xxxnnnXXX |
| NULL |
| nnnXxXxXx |
+--------------+
mysql> select mask(name, '*', '#', '$') from test;
+-----------------------------+
| mask(`name`, '*', '#', '$') |
+-----------------------------+
| ###$$$*** |
| NULL |
| $$$*#*#*# |
+-----------------------------+
### Data Protection
**Storage Encryption**
* Use storage encryption of cloud service object storage to ensure that valid data cannot be directly obtained from object storage or physical disk.
* Use cloud service disk encryption to ensure that valid data in cache cannot be directly obtained from disk.
* Use the encryption function provided by VeloDB to ensure that valid data cannot be directly obtained from object storage, physical disk, and cache disk.
* VeloDB key rotation protection: Each customer uses an independent key, rotates the key periodically, and accesses objects through a secure temporary authorization mechanism (STS or pre-signature mechanism) to avoid the risk of key leakage.
* Use RSA encryption algorithm to encrypt data
**Transmission Encryption**
* MySQL and jdbc protocol access supports TLS encrypted transmission and supports two-way TLS verification (two-way TLS).
* HTTPS secure transmission for data interaction.
**Disaster Recovery Protection**
* Data and metadata storage adopts a multi-availability zone storage architecture to ensure that data can be disaster-tolerant across availability zones.
* Versioning is enabled by default in object storage to ensure multi-version redundancy of objects at the application level.
* Routine metadata backup to provide disaster recovery capabilities.
* Routine metadata and data checks to ensure data correctness and reliability.
* Support Warehouse-level TimeTravel (to be released soon).
* Cross-region replication CCR.
### Network security
Under the principle of least privilege, VeloDB strictly restricts the network
security rules of VPC, including:
* External network access must go through the gateway.
* Operation and maintenance must go through VPN.
* Organizational isolation.
The VeloDB warehouse provides two network connection methods: public network
and private network connection:
* Public network: Only IPs in the whitelist can access, which can effectively avoid excessive public network permissions.
* Private network connection: Users can access VeloDB through private network connection in VPC. Private network connection can ensure that only one-way connection and only the set VPC can be connected, which effectively limits the access source.
### Security Audit
There is a complete audit mechanism for the control operations of the console
and the access operations of the warehouse kernel. Customers can obtain the
corresponding audit information through the cloud product console.
### Application Security
VeloDB uses security products such as cloud firewall, Web Application Firewall
(WAF), and database audit to ensure the security of cloud service
applications.
On This Page
* Product Architecture
* Security level
* Resource isolation
* Identity Authentication
* Access Control
* Data Protection
* Network security
* Security Audit
* Application Security
---
# Source: https://docs.velodb.io/cloud/4.x/security/security-overview
Version: 4.x
# Security Overview
Doris provides the following mechanisms to manage data security:
**Authentication:** Doris supports both username/password and LDAP
authentication methods.
* **Built-in Authentication:** Doris includes a built-in username/password authentication method, allowing customization of password policies.
* **LDAP Authentication:** Doris can centrally manage user credentials through LDAP services, simplifying access control and enhancing system security.
**Permission Management:** Doris supports role-based access control (RBAC) or
can inherit Ranger to achieve centralized permission management.
* **Role-Based Access Control (RBAC):** Doris can restrict users' access to and operations on database resources based on their roles and permissions.
* **Ranger Permission Management:** By integrating with Ranger, Doris enables centralized permission management, allowing administrators to set fine-grained access control policies for different users and groups.
**Audit and Logging:** Doris can enable audit logs to record all user actions,
including logins, queries, data modifications, and more, facilitating post-
audit and issue tracking.
**Data Encryption and Masking:** Doris supports encryption and masking of data
within tables to prevent unauthorized access and data leakage.
**Data Transmission Encryption:** Doris supports SSL encryption protocols to
ensure secure data transmission between clients and Doris servers, preventing
data from being intercepted or tampered with during transfer.
**Fine-Grained Access Control:** Doris allows configuring data row and column
access permissions based on rules to control user access at a granular level.
**JAVA-UDF Security:** Doris supports user-defined function functionality, so
root administrators need to review the implementation of user UDFs to ensure
the operations in the logic are safe and prevent high-risk actions in UDFs,
such as data deletion and system disruption.
**Third-Party Packages:** When using Doris features like JDBC Catalog or UDFs,
administrators must ensure that any third-party packages are from trusted and
secure sources. To reduce security risks, it is recommended to use
dependencies only from official or reputable community sources.
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/basic-element/sql-data-types/data-type-overview
Version: 4.x
On this page
# Overview
## Numeric Types
Doris supports the following numeric data types:
### BOOLEAN
There are two possible values: 0 represents false, and 1 represents true.
For more info, please refer [BOOLEAN](/cloud/4.x/sql-manual/basic-element/sql-
data-types/numeric/BOOLEAN)。
### Integer
All are signed integers. The differences among the INT types are the number of
bytes occupied and the range of values they can represent:
* **[TINYINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/TINYINT)** : 1 byte, [-128, 127]
* **[SMALLINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/SMALLINT)** : 2 bytes, [-32768, 32767]
* **[INT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/INT)** : 4 bytes, [-2147483648, 2147483647]
* **[BIGINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/BIGINT)** : 8 bytes, [-9223372036854775808, 9223372036854775807]
* **[LARGEINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/LARGEINT)** : 16 bytes, [-2^127, 2^127 - 1]
### Floating-point
Including imprecise floating-point types [FLOAT](/cloud/4.x/sql-manual/basic-
element/sql-data-types/numeric/FLOAT) and [DOUBLE](/cloud/4.x/sql-
manual/basic-element/sql-data-types/numeric/DOUBLE), corresponding to the
`float` and `double` in common programming languages
### Fixed-point
The precise fixed-point type [DECIMAL](/cloud/4.x/sql-manual/basic-
element/sql-data-types/numeric/DECIMAL), used in financial and other cases
that require strict accuracy.
## Date Types
Date types include DATE, TIME and DATETIME, DATE type only stores the date
accurate to the day, DATETIME type stores the date and time, which can be
accurate to microseconds. TIME type only stores the time, and **does not
support the construction of the table storage for the time being, can only be
used in the query process**.
Do calculation for datetime types or converting them to numeric types, please
use functions like [TIME_TO_SEC](/cloud/4.x/sql-manual/sql-functions/scalar-
functions/date-time-functions/time-to-sec), [DATE_DIFF](/cloud/4.x/sql-
manual/sql-functions/scalar-functions/date-time-functions/datediff),
[UNIX_TIMESTAMP](/cloud/4.x/sql-manual/sql-functions/scalar-functions/date-
time-functions/unix-timestamp) . The result of directly converting them as
numeric types as not guaranteed.
For more information refer to [DATE](/cloud/4.x/sql-manual/basic-element/sql-
data-types/date-time/DATE), [TIME](/cloud/4.x/sql-manual/basic-element/sql-
data-types/date-time/TIME) and [DATETIME](/cloud/4.x/sql-manual/basic-
element/sql-data-types/date-time/DATETIME) documents.
## String Types
Doris supports both fixed-length and variable-length strings, including:
* **[CHAR(M)](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/CHAR)** : A fixed-length string, where M is the byte length. The range for M is [1, 255].
* **[VARCHAR(M)](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/VARCHAR)** : A variable-length string, where M is the maximum length. The range for M is [1, 65533].
* **[STRING](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/STRING)** : A variable-length string with a default maximum length of 1,048,576 bytes (1 MB). This maximum length can be increased up to 2,147,483,643 bytes (2 GB) by configuring the `string_type_length_soft_limit_bytes`setting.
## Semi-Structured Types
Doris supports different semi-structured data types for JSON data processing,
each tailored to different use cases.
* **[ARRAY](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/ARRAY)** / **[MAP](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/MAP)** / **[STRUCT](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/STRUCT)** : They support nested data and fixed schema, making them well-suited for analytical workloads such as user behavior and profile analysis, as well as querying data lake formats like Parquet. Due to the fixed schema, there is no overhead for dynamic schema inference, resulting in high write and analysis performance.
* **[VARIANT](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT)** : It supports nested data and flexible schema. It is well-suited for analytical workloads such as log, trace, and IoT data analysis. It can accommodate any legal JSON data, which will be automatically expanded into sub-columns in a columnar storage format. This approach enables high compression rate in storage and high performance in data aggregation, filtering, and sorting.
* **[JSON](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/JSON)** : It supports nested data and flexible schema. It is optimized for high-concurrency point query use cases. The flexible schema allows for ingesting any legal JSON data, which will be stored in a binary format. Extracting fields from this binary JSON format is more than 2X faster than using regular JSON strings.
## Aggregation Types
The aggregation data types store aggregation results or intermediate results
during aggregation. They are used for accelerating aggregation-heavy queries.
* **[BITMAP](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/BITMAP)** : It is used for exact deduplication, such as in (UV) statistics and audience segmentation. It works in conjunction with BITMAP functions like `bitmap_union`, `bitmap_union_count`, `bitmap_hash`, and `bitmap_hash64`.
* **[HLL](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/HLL)** : It is used for approximate deduplication and provides better performance than `COUNT DISTINCT`. It works in conjunction with HLL functions like `hll_union_agg`, `hll_raw_agg`, `hll_cardinality`, and `hll_hash`.
* **[QUANTILE_STATE](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/QUANTILE-STATE)** : It is used for approximate percentile calculations and offers better performance than the `PERCENTILE` function. It works with functions like `QUANTILE_PERCENT`, `QUANTILE_UNION`, and `TO_QUANTILE_STATE`.
* **[AGG_STATE](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/AGG-STATE)** : It is used to accelerate aggregations, utilized in combination with aggregation function combinators like state/merge/union.
## IP Types
IP data types store IP addresses in a binary format, which is faster and more
space-efficient for querying compared to storing them as strings. There are
two supported IP data types:
* **[IPv4](/cloud/4.x/sql-manual/basic-element/sql-data-types/ip/IPV4)** : It stores IPv4 addresses as a 4-byte binary value. It is used in conjunction with the `ipv4_*` family of functions.
* **[IPv6](/cloud/4.x/sql-manual/basic-element/sql-data-types/ip/IPV6)** : It stores IPv6 addresses as a 16-byte binary value. It is used in conjunction with the `ipv6_*` family of functions.
On This Page
* Numeric Types
* BOOLEAN
* Integer
* Floating-point
* Fixed-point
* Date Types
* String Types
* Semi-Structured Types
* Aggregation Types
* IP Types
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/numeric-functions/abs
Version: 4.x
On this page
# ABS
## Description
Returns the absolute value of `x`
## Syntax
ABS()
## Parameters
Parameter| Description| ``| The value for which the absolute value is to be
calculated
---|---
## Return Value
The absolute value of parameter `x`.
## Example
select abs(-2);
+---------+
| abs(-2) |
+---------+
| 2 |
+---------+
select abs(3.254655654);
+------------------+
| abs(3.254655654) |
+------------------+
| 3.254655654 |
+------------------+
select abs(-3254654236547654354654767);
+---------------------------------+
| abs(-3254654236547654354654767) |
+---------------------------------+
| 3254654236547654354654767 |
+---------------------------------+
On This Page
* Description
* Syntax
* Parameters
* Return Value
* Example
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/numeric-functions/acos
Version: 4.x
On this page
# ACOS
## Description
Returns the arc cosine of `x`, or `NULL` if `x` is not in the range `-1` to
`1`.
## Syntax
ACOS()
## Parameters
Parameter| Description| ``| The value for which the acos value is to be
calculated
---|---
## Return Value
The acos value of parameter `x`.
## Example
select acos(1);
+-----------+
| acos(1.0) |
+-----------+
| 0 |
+-----------+
select acos(0);
+--------------------+
| acos(0.0) |
+--------------------+
| 1.5707963267948966 |
+--------------------+
select acos(-2);
+------------+
| acos(-2.0) |
+------------+
| nan |
+------------+
On This Page
* Description
* Syntax
* Parameters
* Return Value
* Example
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/concat
Version: 4.x
On this page
# CONCAT
## Description
Concatenates multiple strings. Special cases:
* If any of the parameter values is NULL, the result returned is NULL
## Syntax
CONCAT ( [ , ... ] )
## Parameters
Parameter| Description| ``| The strings to be concatenated
---|---
## Return value
Parameter list `` The strings to be concatenated. Special cases:
* If any of the parameter values is NULL, the result returned is NULL
## Example
SELECT CONCAT("a", "b"),CONCAT("a", "b", "c"),CONCAT("a", null, "c")
+------------------+-----------------------+------------------------+
| concat('a', 'b') | concat('a', 'b', 'c') | concat('a', NULL, 'c') |
+------------------+-----------------------+------------------------+
| ab | abc | NULL |
+------------------+-----------------------+------------------------+
On This Page
* Description
* Syntax
* Parameters
* Return value
* Example
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/length
Version: 4.x
On this page
# LENGTH
## Description
Returns the number of bytes in a string.
## Syntax
LENGTH ( )
## Parameters
Parameter| Description| ``| The string whose bytes need to be calculated
---|---
## Return Value
The number of bytes in the string ``.
## Example
SELECT LENGTH("abc"),length("中国")
+---------------+------------------+
| length('abc') | length('中国') |
+---------------+------------------+
| 3 | 6 |
+---------------+------------------+
On This Page
* Description
* Syntax
* Parameters
* Return Value
* Example
---
# Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-statements/data-query/SELECT
Version: 4.x
On this page
# SELECT
## Description
Mainly introduces the use of Select syntax
grammar:
SELECT
[hint_statement, ...]
[ALL | DISTINCT | DISTINCTROW | ALL EXCEPT ( col_name1 [, col_name2, col_name3, ...] )]
select_expr [, select_expr ...]
[FROM table_references
[PARTITION partition_list]
[TABLET tabletid_list]
[TABLESAMPLE sample_value [ROWS | PERCENT]
[REPEATABLE pos_seek]]
[WHERE where_condition]
[GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
[HAVING where_condition]
[ORDER BY {col_name | expr | position}
[ASC | DESC], ...]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
[INTO OUTFILE 'file_name']
1. **Syntax Description:**
1. select_expr, ... Columns retrieved and displayed in the result, when using an alias, as is optional.
2. select_expr, ... Retrieved target table (one or more tables (including temporary tables generated by subqueries)
3. where_definition retrieves the condition (expression), if there is a WHERE clause, the condition filters the row data. where_condition is an expression that evaluates to true for each row to be selected. Without the WHERE clause, the statement selects all rows. In WHERE expressions, you can use any MySQL supported functions and operators except aggregate functions
4. `ALL | DISTINCT ` : to refresh the result set, all is all, distinct/distinctrow will refresh the duplicate columns, the default is all
5. `ALL EXCEPT`: Filter on the full (all) result set, except specifies the name of one or more columns to be excluded from the full result set. All matching column names will be ignored in the output.
This feature is supported since the Apache Doris 1.2 version
6. `INTO OUTFILE 'file_name' ` : save the result to a new file (which did not exist before), the difference lies in the save format.
7. `Group by having`: Group the result set, and brush the result of group by when having appears. `Grouping Sets`, `Rollup`, `Cube` are extensions of group by, please refer to [GROUPING SETS DESIGN](https://doris.apache.org/community/design/grouping_sets_design) for details.
8. `Order by`: Sort the final result, Order by sorts the result set by comparing the size of one or more columns.
Order by is a time-consuming and resource-intensive operation, because all
data needs to be sent to 1 node before it can be sorted, and the sorting
operation requires more memory than the non-sorting operation.
If you need to return the top N sorted results, you need to use the LIMIT
clause; in order to limit memory usage, if the user does not specify the LIMIT
clause, the first 65535 sorted results are returned by default.
9. `Limit n`: limit the number of lines in the output result, `limit m,n` means output n records starting from the mth line.You should use `order by` before you use `limit m,n`, otherwise the data may be inconsistent each time it is executed.
10. The `Having` clause does not filter the row data in the table, but filters the results produced by the aggregate function.
Typically `having` is used with aggregate functions (eg :`COUNT(), SUM(),
AVG(), MIN(), MAX()`) and `group by` clauses.
11. SELECT supports explicit partition selection using PARTITION containing a list of partitions or subpartitions (or both) following the name of the table in `table_reference`
12. `[TABLET tids] TABLESAMPLE n [ROWS | PERCENT] [REPEATABLE seek]`: Limit the number of rows read from the table in the FROM clause, select a number of Tablets pseudo-randomly from the table according to the specified number of rows or percentages, and specify the number of seeds in REPEATABLE to return the selected samples again. In addition, you can also manually specify the TableID, Note that this can only be used for OLAP tables.
13. `hint_statement`: hint in front of the selectlist indicates that hints can be used to influence the behavior of the optimizer in order to obtain the desired execution plan. Details refer to [joinHint using document] ()
**Syntax constraints:**
1. SELECT can also be used to retrieve calculated rows without referencing any table.
2. All clauses must be ordered strictly according to the above format, and a HAVING clause must be placed after the GROUP BY clause and before the ORDER BY clause.
3. The alias keyword AS is optional. Aliases can be used for group by, order by and having
4. Where clause: The WHERE statement is executed to determine which rows should be included in the GROUP BY section, and HAVING is used to determine which rows in the result set should be used.
5. The HAVING clause can refer to the total function, but the WHERE clause cannot refer to, such as count, sum, max, min, avg, at the same time, the where clause can refer to other functions except the total function. Column aliases cannot be used in the Where clause to define conditions.
6. Group by followed by with rollup can count the results one or more times.
**Join query:**
Doris supports JOIN syntax
JOIN
table_references:
table_reference [, table_reference] …
table_reference:
table_factor
| join_table
table_factor:
tbl_name [[AS] alias]
[{USE|IGNORE|FORCE} INDEX (key_list)]
| ( table_references )
| { OJ table_reference LEFT OUTER JOIN table_reference
ON conditional_expr }
join_table:
table_reference [INNER | CROSS] JOIN table_factor [join_condition]
| table_reference LEFT [OUTER] JOIN table_reference join_condition
| table_reference NATURAL [LEFT [OUTER]] JOIN table_factor
| table_reference RIGHT [OUTER] JOIN table_reference join_condition
| table_reference NATURAL [RIGHT [OUTER]] JOIN table_factor
join_condition:
ON conditional_expr
**UNION Grammar:**
SELECT ...
UNION [ALL| DISTINCT] SELECT ......
[UNION [ALL| DISTINCT] SELECT ...]
`UNION` is used to combine the results of multiple `SELECT` statements into a
single result set.
The column names in the first `SELECT` statement are used as the column names
in the returned results. The selected columns listed in the corresponding
position of each `SELECT` statement should have the same data type. (For
example, the first column selected by the first statement should be of the
same type as the first column selected by other statements.)
The default behavior of `UNION` is to remove duplicate rows from the result.
The optional `DISTINCT` keyword has no effect other than the default, since it
also specifies duplicate row removal. With the optional `ALL` keyword, no
duplicate row removal occurs, and the result includes all matching rows in all
`SELECT` statements
**WITH statement** :
To specify common table expressions, use the `WITH` clause with one or more
comma-separated clauses. Each subclause provides a subquery that generates the
result set and associates the name with the subquery. The following example
defines `WITH` clauses in CTEs named `cte1` and `cte2`, and refers to the
`WITH` clause below their top-level `SELECT`:
WITH
cte1 AS (SELECT a,b FROM table1),
cte2 AS (SELECT c,d FROM table2)
SELECT b,d FROM cte1 JOIN cte2
WHERE cte1.a = cte2.c;
In a statement containing the `WITH` clause, each CTE name can be referenced
to access the corresponding CTE result set.
CTE names can be referenced in other CTEs, allowing CTEs to be defined based
on other CTEs.
Recursive CTE is currently not supported.
## Example
1. Query the names of students whose ages are 18, 20, 25
select Name from student where age in (18,20,25);
2. ALL EXCEPT Example
-- Query all information except the students' age
select * except(age) from student;
3. GROUP BY Example
--Query the tb_book table, group by type, and find the average price of each type of book,
select type,avg(price) from tb_book group by type;
4. DISTINCT Use
--Query the tb_book table to remove duplicate type data
select distinct type from tb_book;
5. ORDER BY Example
Sort query results in ascending (default) or descending (DESC) order.
Ascending NULL is first, descending NULL is last
--Query all records in the tb_book table, sort them in descending order by id, and display three records
select * from tb_book order by id desc limit 3;
6. LIKE fuzzy query
Can realize fuzzy query, it has two wildcards: `%` and `_`, `%` can match one
or more characters, `_` can match one character
--Find all books whose second character is h
select * from tb_book where name like('_h%');
7. LIMIT limits the number of result rows
--1. Display 3 records in descending order
select * from tb_book order by price desc limit 3;
--2. Display 4 records from id=1
select * from tb_book where id limit 1,4;
8. CONCAT join multiple columns
--Combine name and price into a new string output
select id,concat(name,":",price) as info,type from tb_book;
9. Using functions and expressions
--Calculate the total price of various books in the tb_book table
select sum(price) as total,type from tb_book group by type;
--20% off price
select *,(price * 0.8) as "20%" from tb_book;
10. UNION Example
SELECT a FROM t1 WHERE a = 10 AND B = 1 ORDER by LIMIT 10
UNION
SELECT a FROM t2 WHERE a = 11 AND B = 2 ORDER by LIMIT 10;
11. WITH clause example
WITH cte AS
(
SELECT 1 AS col1, 2 AS col2
UNION ALL
SELECT 3, 4
)
SELECT col1, col2 FROM cte;
12. JOIN Exampel
SELECT * FROM t1 LEFT JOIN (t2, t3, t4)
ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c)
Equivalent to
SELECT * FROM t1 LEFT JOIN (t2 CROSS JOIN t3 CROSS JOIN t4)
ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c)
13. INNER JOIN
SELECT t1.name, t2.salary
FROM employee AS t1 INNER JOIN info AS t2 ON t1.name = t2.name;
SELECT t1.name, t2.salary
FROM employee t1 INNER JOIN info t2 ON t1.name = t2.name;
14. LEFT JOIN
SELECT left_tbl.*
FROM left_tbl LEFT JOIN right_tbl ON left_tbl.id = right_tbl.id
WHERE right_tbl.id IS NULL;
15. RIGHT JOIN
mysql> SELECT * FROM t1 RIGHT JOIN t2 ON (t1.a = t2.a);
+------+------+------+------+
| a | b | a | c |
+------+------+------+------+
| 2 | y | 2 | z |
| NULL | NULL | 3 | w |
+------+------+------+------+
16. TABLESAMPLE
--Pseudo-randomly sample 1000 rows in t1. Note that several Tablets are actually selected according to the statistics of the table, and the total number of selected Tablet rows may be greater than 1000, so if you want to explicitly return 1000 rows, you need to add Limit.
SELECT * FROM t1 TABLET(10001) TABLESAMPLE(1000 ROWS) REPEATABLE 2 limit 1000;
## Keywords
SELECT
## Best Practice
1. ome additional knowledge about the SELECT clause
* An alias can be specified for select_expr using AS alias_name. Aliases are used as column names in expressions and can be used in GROUP BY, ORDER BY or HAVING clauses. The AS keyword is a good habit to use when specifying aliases for columns.
* table_references after FROM indicates one or more tables participating in the query. If more than one table is listed, a JOIN operation is performed. And for each specified table, you can define an alias for it
* The selected column after SELECT can be referenced in ORDER IN and GROUP BY by column name, column alias or integer (starting from 1) representing the column position
SELECT college, region, seed FROM tournament
ORDER BY region, seed;
SELECT college, region AS r, seed AS s FROM tournament
ORDER BY r, s;
SELECT college, region, seed FROM tournament
ORDER BY 2, 3;
* If ORDER BY appears in a subquery and also applies to the outer query, the outermost ORDER BY takes precedence.
* If GROUP BY is used, the grouped columns are automatically sorted in ascending order (as if there was an ORDER BY statement followed by the same columns). If you want to avoid the overhead of GROUP BY due to automatic sorting, adding ORDER BY NULL can solve it:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
* When sorting columns in a SELECT using ORDER BY or GROUP BY, the server sorts values using only the initial number of bytes indicated by the max_sort_length system variable.
* Having clauses are generally applied last, just before the result set is returned to the MySQL client, and is not optimized. (while LIMIT is applied after HAVING)
The SQL standard requires: HAVING must refer to a column in the GROUP BY list
or used by an aggregate function. However, MySQL extends this by allowing
HAVING to refer to columns in the Select clause list, as well as columns from
outer subqueries.
A warning is generated if the column referenced by HAVING is ambiguous. In the
following statement, col2 is ambiguous:
SELECT COUNT(col1) AS col2 FROM t GROUP BY col2 HAVING col2 = 2;
* Remember not to use HAVING where WHERE should be used. HAVING is paired with GROUP BY.
* The HAVING clause can refer to aggregate functions, while WHERE cannot.
SELECT user, MAX(salary) FROM users
GROUP BY user HAVING MAX(salary) > 10;
* The LIMIT clause can be used to constrain the number of rows returned by a SELECT statement. LIMIT can have one or two arguments, both of which must be non-negative integers.
/*Retrieve 6~15 rows in the result set*/
SELECT * FROM tbl LIMIT 5,10;
/*Then if you want to retrieve all rows after a certain offset is set, you can set a very large constant for the second parameter. The following query fetches all data from row 96 onwards */
SELECT * FROM tbl LIMIT 95,18446744073709551615;
/*If LIMIT has only one parameter, the parameter specifies the number of rows that should be retrieved, and the offset defaults to 0, that is, starting from the first row*/
* SELECT...INTO allows query results to be written to a file
2. Modifiers of the SELECT keyword
* deduplication
The ALL and DISTINCT modifiers specify whether to deduplicate rows in the
result set (should not be a column).
ALL is the default modifier, that is, all rows that meet the requirements are
to be retrieved.
DISTINCT removes duplicate rows.
3. The main advantage of subqueries
* Subqueries allow structured queries so that each part of a statement can be isolated.
* Some operations require complex unions and associations. Subqueries provide other ways to perform these operations
4. Speed up queries
* Use Doris's partition and bucket as data filtering conditions as much as possible to reduce the scope of data scanning
* Make full use of Doris's prefix index fields as data filter conditions to speed up query speed
5. UNION
* Using only the union keyword has the same effect as using union disitnct. Since the deduplication work is more memory-intensive, the query speed using the union all operation will be faster and the memory consumption will be less. If users want to perform order by and limit operations on the returned result set, they need to put the union operation in the subquery, then select from subquery, and finally put the subquery and order by outside the subquery.
select * from (select age from student_01 union all select age from student_02) as t1
order by age limit 4;
+-------------+
| age |
+-------------+
| 18 |
| 19 |
| 20 |
| 21 |
+-------------+
4 rows in set (0.01 sec)
6. JOIN
* In the inner join condition, in addition to supporting equal-valued joins, it also supports unequal-valued joins. For performance reasons, it is recommended to use equal-valued joins.
* Other joins only support equivalent joins
On This Page
* Description
* Example
* Keywords
* Best Practice
---
# Source: https://docs.velodb.io/cloud/4.x/use-cases/observability/overview
Version: 4.x
On this page
# Overview
## What Is Observability?
Observability refers to the ability to infer a system's internal state through
its external output data. An observability platform collects, stores, and
visualizes three core data: Logs, Traces, and Metrics. This helps teams gain a
comprehensive understanding of the operational status of distributed systems,
supports resource optimization, fault prediction, root cause analysis,
improves system reliability, and enhances user experience.
## Why Observability Is Becoming Increasingly Important
Observability platforms have several critical use cases that are vital for
improving system stability, optimizing operations efficiency, and enabling
business innovation.
1. **Fault Diagnosis and Root Cause Analysis** : Real-time monitoring, anomaly detection, and tracing capabilities enable quick identification and analysis of faults. For example, in the financial industry, combining observability with transaction tracing and AI technologies can shorten recovery time and ensure business continuity. It also supports chaos engineering to simulate failure scenarios and validate system fault tolerance.
2. **Performance Optimization and Resource Planning** : Analyzing system resource utilization and response times helps identify performance bottlenecks and dynamically adjust configurations (e.g., load balancing, auto-scaling). Historical data can be used to predict resource needs, optimize cloud resource allocation, and reduce costs.
3. **Business Decision Support** : Correlating IT performance data with business outcomes (such as user retention rates and transaction volumes) helps formulate business strategies. For instance, analyzing user experience metrics can guide product feature improvements.
4. **Security and Compliance Monitoring** : Detects abnormal behaviors (e.g., zero-day attacks) and triggers automated responses to enhance system security. At the same time, log auditing ensures compliance with regulatory requirements.
5. **DevOps Collaboration** : During canary releases, traffic tagging enables tracking of new version behavior. Combined with call chain analysis, it informs release progression and helps developers optimize code performance, reducing production incidents.
**The growing importance of observability in recent years is mainly driven by
two factors:**
1. **Increasing Complexity of Business and IT Systems** : With the development of cloud computing and microservices, business systems are becoming increasingly complex. For example, a GenAI application request might involve dozens of services such as App, service gateway, authentication service, billing service, RAG engine, Agent engine, vector database, business database, distributed cache, message queue, and large model APIs. Traditional methods like checking server status via SSH and analyzing logs are no longer effective in such complex environments. Observability platforms unify Log, Trace, and Metric data collection and storage, providing centralized visualization and rapid issue investigation.
2. **Higher Requirements for Business Reliability** : System failures have increasingly high impacts on user experience. Therefore, the efficiency of fault detection and recovery has become more critical. Observability provides full data visibility and panoramic analytics, allowing teams to quickly locate root causes, reduce downtime, and ensure service availability. Moreover, with global data analytics and forecasting, potential resource bottlenecks can be identified early, preventing failures before they occur.
## How to Choose an Observability Solution
Observability data has several characteristics, and addressing the challenges
of massive data storage and analysis is key to any observability solution.
1. **High Storage Volume and Cost Sensitivity** : Observability data, especially Logs and Traces, are typically enormous in volume and generated continuously. In medium-to-large enterprises, daily data generation often reaches terabytes or even petabytes. To meet business or regulatory requirements, data must often be stored for months or even years, leading to storage volumes reaching the PB or EB scale and resulting in significant storage costs. Over time, the value of this data diminishes, making cost efficiency increasingly important.
2. **High Throughput Writes with Real-Time Requirements** : Handling daily ingestion of TB or PB-scale data offen requires write throughput ranging from 1–10 GB/s or millions to tens of millions of records per second. Simultaneously, due to the need for real-time troubleshooting and security investigations, platforms must support sub-second write latencies to ensure real-time data availability.
3. **Real-Time Analysis and Full-Text Search Capabilities** : Logs and Traces contain large amounts of textual data. Quickly searching for keywords and phrases is essential. Traditional full-scan and string-matching approaches often fail to deliver real-time performance, especially at this scale—especially under high-throughput, low-latency ingestion conditions. Thus, building inverted indexes tailored for text becomes crucial for achieving sub-second query responsiveness.
4. **Dynamic Data Schema and Frequent Expansion Needs** : Logs originally existed as unstructured free-text logs but evolved into semi-structured JSON formats. Producers frequently modify JSON fields, making schema flexibility essential. Traditional databases and data warehouses struggle to handle such dynamic schemas efficiently, while datalake systems offer storage flexibility but fall short in real-time analytical performance.
5. **Integration with Multiple Data Sources and Analysis Tools** : There are many observability ecosystem tools for data collection and visualization. The storage and analysis engine must integrate seamlessly with these diverse tools.
Given options like Elasticsearch, ClickHouse, Doris, and logging services
provided by Cloud vendors, how should one choose? Here are the key evaluation
criteria:
### 1\. **Performance: Includes Write and Query Performance**
Since observability is often used in urgent situations like troubleshooting,
queries must respond quickly—especially for textual content in Logs and
Traces, which require real-time full-text search to support iterative
exploration. Additionally, users must be able to query near real-time
data—queries limited to data from hours or minutes ago are insufficient; fresh
data from the past few seconds is needed.
* **Elasticsearch** is known for inverted indexing and full-text search, offering sub-second retrieval. However, it struggles with high-throughput writes, often rejecting writes or experiencing high latency during peak loads. Its aggregation and statistical analysis performance is also relatively weak.
* **Cloud Logging Services** provide sufficient performance through rich resources but come with higher costs.
* **ClickHouse** delivers high write throughput and high aggregation query performance using columnar storage and vectorized execution. However, its full-text search performance lags behind Elasticsearch and Doris by multiples and remains experimental and unsuitable for production use.
* **Doris** , leveraging columnar storage and vectorized execution, optimizes inverted indexing for observability scenarios. It offers better performance than Elasticsearch, with ~5x faster writes and ~2x faster queries. Aggregation performance is up to 6–21x better than Elasticsearch.
### 2\. **Cost: Includes Storage and Compute Costs**
Observability data volumes are huge, especially Logs and Traces. Medium-to-
large enterprises generate TBs or even PBs of data daily. Due to business or
regulatory needs, data must be retained for months or years, pushing storage
requirements into the PB or even EB range. Compared to business-critical data,
observability data has lower value density, and its value decreases over time,
making cost sensitivity critical. Additionally, processing massive volumes of
data incurs substantial compute costs.
* **Elasticsearch** suffers from high costs. Its storage model combines row-based raw data, inverted indexes, and docvalue columnar storage, with typical compression ratios around 1.5:1. High CPU overhead from JVM and index construction further increases compute costs.
* **Doris** includes numerous optimizations for observability scenarios. Compared to Elasticsearch, it reduces total cost by 50–80%. These include simplified inverted indexing, columnar storage with ZSTD compression (5:1–10:1), cold-hot tiered storage, single-replica writes, time-series compaction to reduce write amplification, and vectorized index building.
* **ClickHouse** uses columnar storage and vectorized engines, delivering lower storage and write costs.
* **Cloud Logging Services** are expensive as Elasticsearch.
### 3\. **Openness: Includes Open Source and Multi-Cloud Neutrality**
When selecting an observability platform, consider openness, including whether
it's open source and multi-cloud neutral.
* **Elasticsearch** is an open-source project maintained by Elastic, available on multiple clouds. Its ELK ecosystem is self-contained and difficult to integrate with other ecosystems, eg. Kibana only supports Elasticsearch and is hard to extend.
* **Doris** is an Apache Top-Level open-source project, supported by major global cloud providers. It integrates well with OpenTelemetry, Grafana, and ELK, maintaining openness and neutrality.
* **ClickHouse** is an open-source project maintained by ClickHouse Inc., available across clouds. While it supports OpenTelemetry and Grafana, its acquisition of an observability company raises concerns about future neutrality.
* **Cloud Logging Services** are tied to their respective clouds, not open source, and differ between vendors, limiting consistent experiences and migration flexibility.
### 4\. **Ease of Use: Includes Manageability and Usability**
Due to the volume of data, observability platforms usually adopt distributed
architectures. Ease of deployment, scaling, upgrades, and other management
tasks significantly affects scalability. The interface provided by the system
determines developer efficiency and user experience.
* **Elasticsearch** 's Kibana web UI is very user-friendly and manageable. However, its DSL query language is complex and hard to learn, posing integration and development challenges.
* **Doris** provides an interactive analysis interface similar to Kibana and integrates natively with Grafana and Kibana (comming soon). Its SQL is standard and MySQL-compatible, making it developer- and analyst-friendly. Doris has a simple architecture that’s easy to deploy and maintain, supports online scaling without service interruption, automatic load balancing, and includes a visual Cluster Manager.
* **ClickHouse** provides SQL interfaces but uses its own syntax. Maintenance is challenging due to exposed concepts like local tables vs. distributed tables and lack of automatic rebalancing during scaling. Typically, developing a custom cluster management system is required.
* **Cloud Logging Services** offer SaaS convenience—users don't manage infrastructure and enjoy ease of use.
Based on the above analysis, **Doris** achieves high-performance ingestion and
queries while keeping costs low. Its SQL interface is easy to use, and its
architecture is simple to maintain and scale. It also ensures consistent
experiences across multiple clouds, making it an optimal choice for building
an observability platform.
## Observability Solution Based on Doris
### System Architecture
Apache Doris is a modern data warehouse with an MPP distributed architecture,
integrating vectorized execution engines, CBO optimizers, advanced indexing,
and materialized views. It supports ultra-fast querying and analysis on large-
scale real-time datasets, delivering an exceptional analytical experience.
Through continuous technical innovation, Doris has achieved top rankings in
authoritative benchmarks such as ClickBench (single table), TPC-H, and TPC-DS
(multi tables).
For observability scenarios, Doris introduces inverted indexing and ultra-fast
full-text search capabilities, achieving optimized write performance and
storage efficiency. This allows users to build high-performance, low-cost, and
open observability platforms based on Doris.
A Doris-based observability platform consists of three core components:
* **Data Collection and Preprocessing** : Supports various observability data collection tools, including OpenTelemetry and ELK ecosystem tools like Logstash and Filebeat. Log, Trace, and Metric data are ingested into Doris via HTTP APIs.
* **Data Storage and Analysis Engine** : Doris provides unified, high-performance, low-cost storage for observability data and exposes powerful search and analysis capabilities via SQL interfaces.
* **Query Analysis and Visualization** : Integrates with popular observability visualization tools such as Grafana and Kibana (from the ELK stack), offering intuitive interfaces for searching, analyzing, alerting, and achieving real-time monitoring and rapid response.

### Key Features and Advantages
#### **High Performance**
* **High Throughput, Low Latency Writes** : Supports stable ingestion of PB-scale (10GB/s) Log, Trace, and Metric data daily with sub-second latency.
* **High-Performance Inverted Index and Full-Text Search** : Supports inverted indexing and full-text search, delivering sub-second response times for common log keyword searches—3–10x faster than ClickHouse.
* **High-Performance Aggregation Analysis** : Utilizing MPP distributed architecture and vectorized pipeline execution engines, Doris excels in trend analysis and alerting in observability scenarios, leading globally in ClickBench tests.
#### **Low Cost**
* **High Compression Ratio and Low-Cost Storage** : Supports PB-scale storage with compression ratios of 5:1 – 10:1 (including indexes), reducing storage costs by 50–80% compared to Elasticsearch. Cold data can be offloaded to S3/HDFS, cutting storage costs by another 50%.
* **Low-Cost Writes** : Consumes 70% less CPU than Elasticsearch for the same write throughput.
#### **Flexible Schema**
* **Schema Changes at the Top Level** : Users can use Light Schema Change to add or drop columns or indexes (ADD/DROP COLUMN/INDEX), and schema modifications can be completed in seconds. When designing an observability platform, users only need to consider which fields and indexes are needed at the current stage.
* **Internal Field Changes** : A semi-structured data type called VARIANT is specially designed for scalable JSON data. It can automatically identify field names and types within JSON, and further split frequently occurring fields into columnar storage, improving compression ratio and analytical performance. Compared to Elasticsearch’s Dynamic Mapping, VARIANT allows changes in the data type of a single field.
#### **User-Friendly**
* **Standard SQL Interface** : Doris supports standard SQL and is compatible with MySQL protocols and syntax, making it accessible to engineers and analysts.
* **Integration with Observability Ecosystems** : Compatible with OpenTelemetry and ELK ecosystems, supporting Grafana and Kibana (comming soon) visualization tools for seamless data collection and analysis.
* **Easy Operations** : Supports online scaling, automatic load balancing, and visual management via Cluster Manager.
#### **Openness**
* **Open Source** : Apache Doris is a top-level open-source project adopted by over 5000 companies worldwide, supporting OpenTelemetry, Grafana, and other observability ecosystems.
* **Multi-Cloud Neutral** : Major cloud providers offer Doris SaaS services, ensuring consistent experiences across clouds.
### Demo & Screenshots
We demonstrate the Doris-based observability platform using a comprehensive
[demo](https://github.com/apache/doris-opentelemetry-demo) from the
OpenTelemetry community.
The observed business system simulates an [e-commerce website]
() composed of frontend,
authentication, cart, payment, logistics, advertising, recommendation, risk
control, and more than ten modules, reflecting a high level of system
complexity, thus presenting significant challenges for observability data
collection, storage, and analysis.
The Load Generator tool sends continuous requests to the entry service,
generating vast volumes of observability data (Logs, Traces, Metrics). These
data are collected using OpenTelemetry SDKs in various languages, sent to the
OpenTelemetry Collector, preprocessed by Processors, and finally written into
Doris via the OpenTelemetry Doris Exporter. Observability visualization tools
such as Grafana connects to Doris through the MySQL interface, providing
visualized query and analysis capabilities.
[](https://youtu.be/LrR4SNyAlg8)
Grafana connects to Doris via MySQL datasource, offering unified visualization
and analysis of Logs, Traces, and Metrics, including cross-analysis between
Logs and Traces.
* **Log** 
* **Trace** 
* **Metrics** 
While Grafana's log visualization and analysis capabilities are relatively
basic compared to Kibana, third-party vendors have implemented Kibana-like
Discover features. These will soon be integrated into Grafana's Doris
datasource, enhancing unified observability visualization. Future enhancements
will include Elasticsearch protocol compatibility, enabling native Kibana
connections to Doris. For ELK users, replacing Elasticsearch with Doris
maintains existing logging and visualization habits while significantly
reducing costs and improving efficiency.

On This Page
* What Is Observability?
* Why Observability Is Becoming Increasingly Important
* How to Choose an Observability Solution
* 1\. **Performance: Includes Write and Query Performance**
* 2\. **Cost: Includes Storage and Compute Costs**
* 3\. **Openness: Includes Open Source and Multi-Cloud Neutrality**
* 4\. **Ease of Use: Includes Manageability and Usability**
* Observability Solution Based on Doris
* System Architecture
* Key Features and Advantages
* Demo & Screenshots
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/admin-manual/system-tables/overview
Version: 4.x
On this page
# Overview
Apache Doris cluster has multiple built-in system databases to store metadata
information about the Doris system itself.
### information_schema
All tables under the `information_schema` database are virtual tables and do
not have physical entities. These system tables contain metadata about the
Doris cluster and all its database objects, including databases, tables,
columns, permissions, etc. They also include functional status information
like Workload Group, Task, etc.
There is an `information_schema` database under each Catalog, containing
metadata only for the corresponding Catalog's databases and tables.
All tables in the `information_schema` database are read-only, and users
cannot modify, drop, or create tables in this database.
By default, all users have read permissions for all tables in this database,
but the query results will vary based on the user's actual permission. For
example, if User A only has permissions for `db1.table1`, querying the
`information_schema.tables` table will only return information related to
`db1.table1`.
### mysql
All tables under the `mysql` database are virtual tables and do not have
physical entities. These system tables contain information such as permissions
and are mainly used for MySQL ecosystem compatibility.
There is a `mysql` database under each Catalog, but the content of tables is
identical.
All tables in the `mysql` database are read-only, and users cannot modify,
delete, or create tables in this database.
### __internal_schema
All tables under the `__internal_schema` database are actual tables in Doris,
stored similarly to user-created data tables. When a Doris cluster is created,
all system tables under this database are automatically created.
By default, common users have read-only permissions for tables in this
database. However, once granted, they can modify, delete, or create tables
under this database.
On This Page
* information_schema
* mysql
* __internal_schema
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/compute-storage-decoupled/before-deployment
Version: 4.x
On this page
# Doris Compute-Storage Decoupled Deployment Preparation
## 1\. Overview
This document describes the deployment preparation work for the Apache Doris
compute-storage decoupled mode. The decoupled architecture aims to improve
system scalability and performance, suitable for large-scale data processing
scenarios.
## 2\. Architecture Components
The Doris compute-storage decoupled architecture consists of three main
modules:
1. **Frontend (FE)** : Handles user requests and manages metadata.
2. **Backend (BE)** : Stateless compute nodes that execute query tasks.
3. **Meta Service (MS)** : Manages metadata operations and data recovery.
## 3\. System Requirements
### 3.1 Hardware Requirements
* Minimum configuration: 3 servers
* Recommended configuration: 5 or more servers
### 3.2 Software Dependencies
* FoundationDB (FDB) version 7.1.38 or higher
* OpenJDK 17
## 4\. Deployment Planning
### 4.1 Testing Environment Deployment
Deploy all modules on a single machine, not suitable for production
environments.
### 4.2 Production Deployment
* Deploy FDB on 3 or more machines
* Deploy FE and Meta Service on 3 or more machines
* Deploy BE on 3 or more machines
When machine configurations are high, consider mixing FDB, FE, and Meta
Service, but do not mix disks.
## 5\. Installation Steps
### 5.1 Install FoundationDB
This section provides a step-by-step guide to configure, deploy, and start the
FoundationDB (FDB) service using the provided scripts `fdb_vars.sh` and
`fdb_ctl.sh`. You can download [doris tools](http://apache-doris-releases.oss-
accelerate.aliyuncs.com/apache-doris-3.0.2-tools.tar.gz) and get `fdb_vars.sh`
and `fdb_ctl.sh` from `fdb` directory.
#### 5.1.1 Machine Requirements
Typically, at least 3 machines equipped with SSDs are required to form a
FoundationDB cluster with dual data replicas and allow for single machine
failures. If SSDs are not available, at least standard cloud disks or local
disks with a standard POSIX-compliant file system must be used for data
storage. Otherwise, FoundationDB may fail to operate properly - for instance,
storage solutions like JuiceFS should not be used as the underlying storage
for FoundationDB.
tip
If only for development/testing purposes, a single machine is sufficient.
#### 5.1.2 `fdb_vars.sh` Configuration
##### Required Custom Settings
Parameter| Description| Type| Example| Notes| `DATA_DIRS`| Specify the data
directory for FoundationDB storage| Comma-separated list of absolute paths|
`/mnt/foundationdb/data1,/mnt/foundationdb/data2,/mnt/foundationdb/data3`| \-
Ensure directories are created before running the script
\- SSD and separate directories are recommended for production environments|
`FDB_CLUSTER_IPS`| Define cluster IPs| String (comma-separated IP addresses)|
`172.200.0.2,172.200.0.3,172.200.0.4`| \- At least 3 IP addresses for
production clusters
\- The first IP will be used as the coordinator
\- For high availability, place machines in different racks| `FDB_HOME`|
Define the main directory for FoundationDB| Absolute path| `/fdbhome`| \-
Default path is /fdbhome
\- Ensure this path is absolute| `FDB_CLUSTER_ID`| Define the cluster ID|
String| `SAQESzbh`| \- Each cluster ID must be unique
\- Can be generated using `mktemp -u XXXXXXXX`| `FDB_CLUSTER_DESC`| Define the
description of the FDB cluster| String| `dorisfdb`| \- It is recommended to
change this to something meaningful for the deployment
---|---|---|---|---
##### Optional Custom Settings
Parameter| Description| Type| Example| Notes| `MEMORY_LIMIT_GB`| Define the
memory limit for FDB processes in GB| Integer| `MEMORY_LIMIT_GB=16`| Adjust
this value based on available memory resources and FDB process requirements|
`CPU_CORES_LIMIT`| Define the CPU core limit for FDB processes| Integer|
`CPU_CORES_LIMIT=8`| Set this value based on the number of available CPU cores
and FDB process requirements
---|---|---|---|---
#### 5.1.3 Deploy FDB Cluster
After configuring the environment with `fdb_vars.sh`, you can deploy the FDB
cluster on each node using the `fdb_ctl.sh` script.
./fdb_ctl.sh deploy
This command initiates the deployment process of the FDB cluster.
### 5.1.4 Start FDB Service
Once the FDB cluster is deployed, you can start the FDB service on each node
using the `fdb_ctl.sh` script.
./fdb_ctl.sh start
This command starts the FDB service, making the cluster operational and
obtaining the FDB cluster connection string, which can be used for configuring
the MetaService.
### 5.2 Install OpenJDK 17
1. Download [OpenJDK 17](https://download.java.net/java/GA/jdk17.0.1/2a2082e5a09d4267845be086888add4f/12/GPL/openjdk-17.0.1_linux-x64_bin.tar.gz)
2. Extract and set the environment variable JAVA_HOME.
## 6\. Next Steps
After completing the above preparations, please refer to the following
documents to continue the deployment:
1. [Deployment](/cloud/4.x/user-guide/compute-storage-decoupled/compilation-and-deployment)
2. [Managing Compute Group](/cloud/4.x/user-guide/compute-storage-decoupled/managing-compute-cluster)
3. [Managing Storage Vault](/cloud/4.x/user-guide/compute-storage-decoupled/managing-storage-vault)
## 7\. Notes
* Ensure time synchronization across all nodes
* Regularly back up FoundationDB data
* Adjust FoundationDB and Doris configuration parameters based on actual load
* Use standard cloud disks or local disks with a POSIX-compliant file system for data storage; otherwise, FoundationDB may not function properly.
* For example, storage solutions like JuiceFS should not be used as FoundationDB's storage backend.
## 8\. References
* [FoundationDB Official Documentation](https://apple.github.io/foundationdb/index.html)
* [Apache Doris Official Website](https://doris.apache.org/)
On This Page
* 1\. Overview
* 2\. Architecture Components
* 3\. System Requirements
* 3.1 Hardware Requirements
* 3.2 Software Dependencies
* 4\. Deployment Planning
* 4.1 Testing Environment Deployment
* 4.2 Production Deployment
* 5\. Installation Steps
* 5.1 Install FoundationDB
* 5.1.4 Start FDB Service
* 5.2 Install OpenJDK 17
* 6\. Next Steps
* 7\. Notes
* 8\. References
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/export/export-overview
Version: 4.x
On this page
# Export Overview
The data export function is used to write the query result set or Doris table
data into the specified storage system in the specified file format.
The differences between the export function and the data backup function are
as follows:
| Data Export| Data Backup| Final Storage Location| HDFS, Object Storage,
Local File System| HDFS, Object Storage| Data Format| Open file formats such
as Parquet, ORC, CSV| Doris internal storage format| Execution Speed| Moderate
(requires reading data and converting to the target data format)| Fast (no
parsing and conversion required, directly upload Doris data files)|
Flexibility| Can flexibly define the data to be exported through SQL
statements| Only supports table-level full backup| Use Cases| Result set
download, data exchange between different systems| Data backup, data migration
between Doris clusters
---|---|---
## Choosing Export Methods
Doris provides three different data export methods:
* **SELECT INTO OUTFILE** : Supports the export of any SQL result set.
* **EXPORT** : Supports the export of partial or full table data.
* **MySQL DUMP** : Compatible with the MySQL dump command for data export.
The similarities and differences between the three export methods are as
follows:
| SELECT INTO OUTFILE| EXPORT| MySQL DUMP| Synchronous/Asynchronous|
Synchronous| Asynchronous (submit EXPORT tasks and check task progress via
SHOW EXPORT command)| Synchronous| Supports any SQL| Yes| No| No| Export
specific partitions| Yes| Yes| No| Export specific tablets| Yes| No| No|
Concurrent export| Supported with high concurrency (depends on whether the SQL
statement has operators such as ORDER BY that need to be processed on a single
node)| Supported with high concurrency (supports tablet-level concurrent
export)| Not supported, single-threaded export only| Supported export data
formats| Parquet, ORC, CSV| Parquet, ORC, CSV| MySQL Dump proprietary format|
Supports exporting external tables| Yes| Partially supported| No| Supports
exporting views| Yes| Yes| Yes| Supported export locations| S3, HDFS| S3,
HDFS| LOCAL
---|---|---|---
### SELECT INTO OUTFILE
Suitable for the following scenarios:
* Data needs to be exported after complex calculations, such as filtering, aggregation, joins, etc.
* Suitable for scenarios that require synchronous tasks.
### EXPORT
Suitable for the following scenarios:
* Large-scale single table export, with simple filtering conditions.
* Scenarios that require asynchronous task submission.
### MySQL Dump
Suitable for the following scenarios:
* Compatible with the MySQL ecosystem, requires exporting both table structure and data.
* Only for development testing or scenarios with very small data volumes.
## Export File Column Type Mapping
Parquet and ORC file formats have their own data types. Doris's export
function can automatically map Doris's data types to the corresponding data
types in Parquet and ORC file formats. The CSV format does not have types, all
data is output as text.
The following table shows the mapping between Doris data types and Parquet,
ORC file format data types:
* ORC
Doris Type| Orc Type| boolean| boolean| tinyint| tinyint| smallint| smallint|
int| int| bigint| bigint| largeInt| string| date| string| datev2| string|
datetime| string| datetimev2| timestamp| float| float| double| double| char /
varchar / string| string| decimal| decimal| struct| struct| map| map| array|
array| json| string| variant| string| bitmap| binary| quantile_state| binary|
hll| binary
---|---
* Parquet
When Doris is exported to the Parquet file format, the Doris memory data is
first converted to the Arrow memory data format, and then written out to the
Parquet file format by Arrow.
Doris Type| Arrow Type| Parquet Physical Type| Parquet Logical Type| boolean| boolean| BOOLEAN| | tinyint| int8| INT32| INT_8| smallint| int16| INT32| INT_16| int| int32| INT32| INT_32| bigint| int64| INT64| INT_64| largeInt| utf8| BYTE_ARRAY| UTF8| date| utf8| BYTE_ARRAY| UTF8| datev2| date32| INT32| DATE| datetime| utf8| BYTE_ARRAY| UTF8| datetimev2| timestamp| INT96/INT64| TIMESTAMP(MICROS/MILLIS/SECONDS)| float| float32| FLOAT| | double| float64| DOUBLE| | char / varchar / string| utf8| BYTE_ARRAY| UTF8| decimal| decimal128| FIXED_LEN_BYTE_ARRAY| DECIMAL(scale, precision)| struct| struct| | Parquet Group| map| map| | Parquet Map| array| list| | Parquet List| json| utf8| BYTE_ARRAY| UTF8| variant| utf8| BYTE_ARRAY| UTF8| bitmap| binary| BYTE_ARRAY| | quantile_state| binary| BYTE_ARRAY| | hll| binary| BYTE_ARRAY|
---|---|---|---
> Note: In versions 2.1.11 and 3.0.7, you can specify the
> `parquet.enable_int96_timestamps` property to determine whether Doris's
> datetimev2 type uses Parquet's INT96 storage or INT64. INT96 is used by
> default. However, INT96 has been deprecated in the Parquet standard and is
> only used for compatibility with some older systems (such as versions before
> Hive 4.0).
On This Page
* Choosing Export Methods
* SELECT INTO OUTFILE
* EXPORT
* MySQL Dump
* Export File Column Type Mapping
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/import/load-manual
Version: 4.x
On this page
# Loading Overview
Apache Doris offers various methods for importing and integrating data,
allowing you to import data from various sources into the database. These
methods can be categorized into four types:
* **Real-Time Writing** : Data is written into Doris tables in real-time via HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying.
* For small amounts of data (once every 5 minutes), you can use [JDBC INSERT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual).
* For higher concurrency or frequency (more than 20 concurrent writes or multiple writes per minute), you can enable [Group Commit](/cloud/4.x/user-guide/data-operate/import/group-commit-manual) and use JDBC INSERT or Stream Load.
* For high throughput, you can use [Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual) via HTTP.
* **Streaming Synchronization** : Real-time data streams (e.g., Flink, Kafka, transactional databases) are imported into Doris tables, ideal for real-time analysis and querying.
* You can use Flink Doris Connector to write Flink’s real-time data streams into Doris.
* You can use [Routine Load](/cloud/4.x/user-guide/data-operate/import/import-way/routine-load-manual) or Doris Kafka Connector for Kafka’s real-time data streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, CSV, and Protobuf formats.
* You can use Flink CDC or Datax to write transactional database CDC data streams into Doris.
* **Batch Import** : Data is batch-loaded from external storage systems (e.g., Object Storage, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data import needs.
* You can use [Broker Load](/cloud/4.x/user-guide/data-operate/import/import-way/broker-load-manual) to write files from Object Storage and HDFS into Doris.
* You can use [INSERT INTO SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual) to synchronously load files from Object Storage, HDFS, and NAS into Doris, and you can perform the operation asynchronously using a [JOB](/cloud/4.x/user-guide/admin-manual/workload-management/job-scheduler).
* You can use [Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual) or Doris Streamloader to write local files into Doris.
* **External Data Source Integration** : Query and partially import data from external sources (e.g., Hive, JDBC, Iceberg) into Doris tables.
* You can create a [Catalog](/cloud/4.x/user-guide/lakehouse/lakehouse-overview) to read data from external sources and use [INSERT INTO SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual) to synchronize this data into Doris, with asynchronous execution via [JOB](/cloud/4.x/user-guide/admin-manual/workload-management/job-scheduler).
Each import method in Doris is an implicit transaction by default. For more
information on transactions, refer to [Transactions](/cloud/4.x/user-
guide/data-operate/transaction).
### Quick Overview of Import Methods
Doris import process mainly involves various aspects such as data sources,
data formats, import methods, error handling, data transformation, and
transactions. You can quickly browse the scenarios suitable for each import
method and the supported file formats in the table below.
Import Method| Use Case| Supported File Formats| Import Mode| [Stream
Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-
manual)| Importing local files or push data in applications via HTTP.| csv,
json, parquet, orc| Synchronous| [Broker Load](/cloud/4.x/user-guide/data-
operate/import/import-way/broker-load-manual)| Importing from object storage,
HDFS, etc.| csv, json, parquet, orc| Asynchronous| [INSERT INTO
VALUES](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-
manual)| Writing data via JDBC.| SQL| Synchronous| [INSERT INTO
SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-
manual)| Importing from an external source like a table in a catalog or files
in Object Storage, HDFS.| SQL| Synchronous, Asynchronous via Job| [Routine
Load](/cloud/4.x/user-guide/data-operate/import/import-way/routine-load-
manual)| Real-time import from Kafka| csv, json| Asynchronous| [MySQL
Load](/cloud/4.x/user-guide/data-operate/import/import-way/mysql-load-manual)|
Importing from local files.| csv| Synchronous| [Group Commit](/cloud/4.x/user-
guide/data-operate/import/group-commit-manual)| Writing with high frequency.|
Depending on the import method used| -
---|---|---|---
On This Page
* Quick Overview of Import Methods
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/update/update-overview
Version: 4.x
On this page
# Data Update Overview
In today's data-driven decision-making landscape, data "freshness" has become
a core competitive advantage for enterprises to stand out in fierce market
competition. Traditional T+1 data processing models, due to their inherent
latency, can no longer meet the stringent real-time requirements of modern
business. Whether it's achieving millisecond-level synchronization between
business databases and data warehouses, dynamically adjusting operational
strategies, or correcting erroneous data within seconds to ensure decision
accuracy, robust real-time data update capabilities are crucial.
Apache Doris, as a modern real-time analytical database, has one of its core
design goals to provide ultimate data freshness. Through its powerful data
models and flexible update mechanisms, it successfully compresses data
analysis latency from day-level and hour-level to second-level, providing a
solid foundation for users to build real-time, agile business decision-making
loops.
This document serves as an official guide that systematically explains Apache
Doris's data update capabilities, covering its core principles, diverse update
and deletion methods, typical application scenarios, and performance best
practices under different deployment modes, aiming to help you comprehensively
master and efficiently utilize Doris's data update functionality.
## 1\. Core Concepts: Table Models and Update Mechanisms
In Doris, the **Data Model** of a data table determines its data organization
and update behavior. To support different business scenarios, Doris provides
three table models: Unique Key Model, Aggregate Key Model, and Duplicate Key
Model. Among these, **the Unique Key Model is the core for implementing
complex, high-frequency data updates**.
### 1.1. Table Model Overview
**Table Model**| **Key Features**| **Update Capability**| **Use Cases**|
**Unique Key Model**| Built for real-time updates. Each data row is
identified by a unique Primary Key, supporting row-level UPSERT
(Update/Insert) and partial column updates.| Strongest, supports all update
and deletion methods.| Order status updates, real-time user tag computation,
CDC data synchronization, and other scenarios requiring frequent, real-time
changes.| **Aggregate Key Model**| Pre-aggregates data based on specified Key
columns. For rows with the same Key, Value columns are merged according to
defined aggregation functions (such as SUM, MAX, MIN, REPLACE).| Limited,
supports REPLACE-style updates and deletions based on Key columns.| Scenarios
requiring real-time summary statistics, such as real-time reports,
advertisement click statistics, etc.| **Duplicate Key Model**| Data only
supports append-only writes, without any deduplication or aggregation
operations. Even identical data rows are retained.| Limited, only supports
conditional deletion through DELETE statements.| Log collection, user behavior
tracking, and other scenarios that only need appending without updates.
---|---|---|---
### 1.2. Data Update Methods
Doris provides two major categories of data update methods: **updating through
data load** and **updating through DML statements**.
#### 1.2.1. Updating Through Load (UPSERT)
This is Doris's **recommended high-performance, high-concurrency** update
method, primarily targeting the **Unique Key Model**. All load methods (Stream
Load, Broker Load, Routine Load, `INSERT INTO`) naturally support `UPSERT`
semantics. When new data is loaded, if its primary key already exists, Doris
will overwrite the old row data with the new row data; if the primary key
doesn't exist, it will insert a new row.

#### 1.2.2. Updating Through `UPDATE` DML Statements
Doris supports standard SQL `UPDATE` statements, allowing users to update data
based on conditions specified in the `WHERE` clause. This method is very
flexible and supports complex update logic, such as cross-table join updates.

-- Simple update
UPDATE user_profiles SET age = age + 1 WHERE user_id = 1;
-- Cross-table join update
UPDATE sales_records t1
SET t1.user_name = t2.name
FROM user_profiles t2
WHERE t1.user_id = t2.user_id;
**Note** : The execution process of `UPDATE` statements involves first
scanning data that meets the conditions, then rewriting the updated data back
to the table. It's suitable for low-frequency, batch update tasks. **High-
concurrency operations on** **`UPDATE`** **statements are not recommended**
because concurrent `UPDATE` operations involving the same primary keys cannot
guarantee data isolation.
#### 1.2.3. Updating Through `INSERT INTO SELECT` DML Statements
Since Doris provides UPSERT semantics by default, using `INSERT INTO SELECT`
can also achieve similar update effects as `UPDATE`.
### 1.3. Data Deletion Methods
Similar to updates, Doris also supports deleting data through both load and
DML statements.
#### 1.3.1. Mark Deletion Through Load
This is an efficient batch deletion method, primarily used for the **Unique
Key Model**. Users can add a special hidden column `DORIS_DELETE_SIGN` when
loading data. When the value of this column for a row is `1` or `true`, Doris
will mark the corresponding data row with that primary key as deleted (the
principle of delete sign will be explained in detail later).
// Stream Load load data, delete row with user_id = 2
// curl --location-trusted -u user:passwd -H "columns:user_id, __DORIS_DELETE_SIGN__" -T delete.json http://fe_host:8030/api/db_name/table_name/_stream_load
// delete.json content
[
{"user_id": 2, "__DORIS_DELETE_SIGN__": "1"}
]
#### 1.3.2. Deletion Through `DELETE` DML Statements
Doris supports standard SQL `DELETE` statements that can delete data based on
`WHERE` conditions.
* **Unique Key Model** : `DELETE` statements will rewrite the primary keys of rows meeting the conditions with deletion marks. Therefore, its performance is proportional to the amount of data to be deleted. The execution principle of `DELETE` statements on Unique Key Models is very similar to `UPDATE` statements, first reading the data to be deleted through queries, then writing it once more with deletion marks. Compared to `UPDATE` statements, DELETE statements only need to write Key columns and deletion mark columns, making them relatively lighter.
* **Duplicate/Aggregate Models** : `DELETE` statements are implemented by recording a delete predicate. During queries, this predicate serves as a runtime filter to filter out deleted data. Therefore, `DELETE` operations themselves are very fast, almost independent of the amount of deleted data. However, note that **high-frequency** **`DELETE`** **operations on Duplicate/Aggregate Models will accumulate many runtime filters, severely affecting subsequent query performance**.
DELETE FROM user_profiles WHERE last_login < '2022-01-01';
The following table provides a brief summary of using DML statements for
deletion:
| **Unique Key Model**| **Aggregate Model**| **Duplicate Model**|
Implementation| Delete Sign| Delete Predicate| Delete Predicate| Limitations|
None| Delete conditions only for Key columns| None| Deletion Performance|
Moderate| Fast| Fast
---|---|---|---
## 2\. Deep Dive into Unique Key Model: Principles and Implementation
The Unique Key Model is the cornerstone of Doris's high-performance real-time
updates. Understanding its internal working principles is crucial for fully
leveraging its performance.
### 2.1. Merge-on-Write (MoW) vs. Merge-on-Read (MoR)
The Unique Key Model has two data merging strategies: Merge-on-Write (MoW) and
Merge-on-Read (MoR). **Since Doris 2.1, MoW has become the default and
recommended implementation**.
**Feature**| **Merge-on-Write (MoW)**| **Merge-on-Read (MoR) - (Legacy)**|
**Core Concept**| Completes data deduplication and merging during data
writing, ensuring only one latest record per primary key in storage.| Retains
multiple versions during data writing, performs real-time merging during
queries to return the latest version.| **Query Performance**| Extremely high.
No additional merge operations needed during queries, performance approaches
that of non-updated detail tables.| Poor. Requires data merging during
queries, taking about 3-10 times longer than MoW and consuming more CPU and
memory.| **Write Performance**| Has merge overhead during writing, with some
performance loss compared to MoR (about 10-20% for small batches, 30-50% for
large batches).| Fast writing speed, approaching detail tables.| **Resource
Consumption**| Consumes more CPU and memory during writing and background
Compaction.| Consumes more CPU and memory during queries.| **Use Cases**|
Most real-time update scenarios. Especially suitable for read-heavy, write-
light businesses, providing ultimate query analysis performance.| Suitable for
write-heavy, read-light scenarios, but no longer mainstream recommended.
---|---|---
The MoW mechanism trades a small cost during the writing phase for tremendous
improvement in query performance, perfectly aligning with the OLAP system's
"read-heavy, write-light" characteristics.
### 2.2. Conditional Updates (Sequence Column)
In distributed systems, out-of-order data arrival is a common problem. For
example, an order status changes sequentially to "Paid" and "Shipped", but due
to network delays, data representing "Shipped" might arrive at Doris before
data representing "Paid".
To solve this problem, Doris introduces the **Sequence Column** mechanism.
Users can specify a column (usually a timestamp or version number) as the
Sequence column when creating tables. When processing data with the same
primary key, Doris will compare their Sequence column values and **always
retain the row with the largest Sequence value** , thus ensuring eventual
consistency even when data arrives out of order.
CREATE TABLE order_status (
order_id BIGINT,
status_name STRING,
update_time DATETIME
)
UNIQUE KEY(order_id)
DISTRIBUTED BY HASH(order_id)
PROPERTIES (
"function_column.sequence_col" = "update_time" -- Specify update_time as Sequence column
);
-- 1. Write "Shipped" record (larger update_time)
-- {"order_id": 1001, "status_name": "Shipped", "update_time": "2023-10-26 12:00:00"}
-- 2. Write "Paid" record (smaller update_time, arrives later)
-- {"order_id": 1001, "status_name": "Paid", "update_time": "2023-10-26 11:00:00"}
-- Final query result, retains record with largest update_time
-- order_id: 1001, status_name: "Shipped", update_time: "2023-10-26 12:00:00"
### 2.3. Deletion Mechanism (`DORIS_DELETE_SIGN`) Workflow
The working principle of `DORIS_DELETE_SIGN` can be summarized as "logical
marking, background cleanup".
1. **Execute Deletion** : When users delete data through load or `DELETE` statements, Doris doesn't immediately remove data from physical files. Instead, it writes a new record for the primary key to be deleted, with the `DORIS_DELETE_SIGN` column marked as `1`.
2. **Query Filtering** : When users query data, Doris automatically adds a filter condition `WHERE DORIS_DELETE_SIGN = 0` to the query plan, thus hiding all data marked for deletion from query results.
3. **Background Compaction** : Doris's background Compaction process periodically scans data. When it finds a primary key with both normal records and deletion mark records, it will physically remove both records during the merge process, eventually freeing storage space.
This mechanism ensures quick response to deletion operations while
asynchronously completing physical cleanup through background tasks, avoiding
performance impact on online business.
The following diagram shows how `DORIS_DELETE_SIGN` works:

### 2.4 Partial Column Update
Starting from version 2.0, Doris supports powerful partial column update
capabilities on Unique Key Models (MoW). When loading data, users only need to
provide the primary key and columns to be updated; unprovided columns will
maintain their original values unchanged. This greatly simplifies ETL
processes for scenarios like wide table joining and real-time tag updates.
To enable this functionality, you need to enable Merge-on-Write (MoW) mode
when creating Unique Key Model tables and set the
`enable_unique_key_partial_update` property to `true`, or configure the
`"partial_columns"` parameter during data load.
CREATE TABLE user_profiles (
user_id BIGINT,
name STRING,
age INT,
last_login DATETIME
)
UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id)
PROPERTIES (
"enable_unique_key_partial_update" = "true"
);
-- Initial data
-- user_id: 1, name: 'Alice', age: 30, last_login: '2023-10-01 10:00:00'
-- load partial update data through Stream Load, only updating age and last_login
-- {"user_id": 1, "age": 31, "last_login": "2023-10-26 18:00:00"}
-- Updated data
-- user_id: 1, name: 'Alice', age: 31, last_login: '2023-10-26 18:00:00'
**Partial Column Update Principle Overview**
Unlike traditional OLTP databases, Doris's partial column update is not in-
place data update. To achieve better write throughput and query performance in
Doris, partial column updates in Unique Key Models adopt an **"load-time
missing field completion followed by full-row writing"** implementation
approach.
Therefore, using Doris's partial column update has **"read amplification"**
and **"write amplification"** effects. For example, updating 10 fields in a
100-column wide table requires Doris to complete the missing 90 fields during
the write process. Assuming each field has similar size, a 1MB 10-field update
will generate approximately 9MB of data reading (completing missing fields)
and 10MB of data writing (writing the complete row to new files) in the Doris
system, resulting in about 9x read amplification and 10x write amplification.
**Partial Column Update Performance Recommendations**
Due to read and write amplification in partial column updates, and since Doris
is a columnar storage system, the data reading process may generate
significant random I/O, requiring high random read IOPS from storage. Since
traditional mechanical disks have significant bottlenecks in random I/O, **if
you want to use partial column update functionality for high-frequency writes,
SSD drives are recommended, preferably NVMe interface** , which can provide
the best random I/O support.
Additionally, **if the table is very wide, enabling row storage is also
recommended to reduce random I/O**. After enabling row storage, Doris will
store an additional copy of row-based data alongside columnar storage. Since
row-based data stores each row continuously, it can read entire rows with a
single I/O operation (columnar storage requires N I/O operations to read all
missing fields, such as the previous example of a 100-column wide table
updating 10 columns, requiring 90 I/O operations per row to read all fields).
## 3\. Typical Application Scenarios
Doris's powerful data update capabilities enable it to handle various
demanding real-time analysis scenarios.
### 3.1. CDC Real-time Data Synchronization
Capturing change data (Binlog) from upstream business databases (such as
MySQL, PostgreSQL, Oracle) through tools like Flink CDC and writing it in
real-time to Doris Unique Key Model tables is the most classic scenario for
building real-time data warehouses.
* **Whole Database Synchronization** : Flink Doris Connector internally integrates Flink CDC, enabling automated, end-to-end whole database synchronization from upstream databases to Doris without manual table creation and field mapping configuration.
* **Ensuring Consistency** : Utilizes the Unique Key Model's `UPSERT` capability to handle upstream `INSERT` and `UPDATE` operations, uses `DORIS_DELETE_SIGN` to handle `DELETE` operations, and combines with Sequence columns (such as timestamps in Binlog) to handle out-of-order data, perfectly replicating upstream database states and achieving millisecond-level data synchronization latency.

### 3.2. Real-time Wide Table Joining
In many analytical scenarios, data from different business systems needs to be
joined into user-wide tables or product-wide tables. Traditional approaches
use offline ETL tasks (such as Spark or Hive) for periodic (T+1) joining,
which has poor real-time performance and high maintenance costs.
Alternatively, using Flink for real-time wide table join calculations and
writing joined data to databases typically requires significant computational
resources.
Using Doris's **partial column update** capability can greatly simplify this
process:
1. Create a Unique Key Model wide table in Doris.
2. Write data streams from different sources (such as user basic information, user behavior data, transaction data, etc.) to this wide table in real-time through Stream Load or Routine Load.
3. Each data stream only updates its relevant fields. For example, user behavior data streams only update `page_view_count`, `last_login_time`, and other fields; transaction data streams only update `total_orders`, `total_amount`, and other fields.
This approach not only transforms wide table construction from offline ETL to
real-time stream processing, greatly improving data freshness, but also
reduces I/O overhead by only writing changed columns, improving write
performance.
## 4\. Best Practices
Following these best practices can help you use Doris's data update
functionality more stably and efficiently.
### 4.1. General Performance Practices
1. **Prioritize load Updates** : For high-frequency, large-volume update operations, prioritize load methods like Stream Load and Routine Load over `UPDATE` DML statements.
2. **Batch Writes** : Avoid using `INSERT INTO` statements for individual high-frequency writes (such as > 100 TPS), as each `INSERT` incurs transaction overhead. If necessary, consider enabling Group Commit functionality to merge multiple small batch commits into one large transaction.
3. **Use High-frequency** **`DELETE`** **Carefully** : On Duplicate and Aggregate models, avoid high-frequency `DELETE` operations to prevent query performance degradation.
4. **Use** **`TRUNCATE PARTITION`** **for Partition Data Deletion** : If you need to delete entire partition data, use `TRUNCATE PARTITION`, which is much more efficient than `DELETE`.
5. **Execute** **`UPDATE`** **Serially** : Avoid concurrent execution of `UPDATE` tasks that might affect the same data rows.
### 4.2. Unique Key Model Practices in Compute-Storage Separation
Architecture
Doris 3.0 introduces an advanced compute-storage separation architecture,
bringing ultimate elasticity and lower costs. In this architecture, since BE
nodes are stateless, a global state needs to be maintained through MetaService
during the Merge-on-Write process to resolve write-write conflicts between
load/compaction/schema change operations. The MoW implementation of Unique Key
Models relies on a distributed table lock based on Meta Service to ensure
write operation consistency, as shown in the following diagram:

High-frequency loads and Compaction lead to frequent competition for table
locks, so special attention should be paid to the following points:
1. **Control Single Table load Frequency** : It's recommended to control the load frequency of a single Unique Key table to within **60 times/second**. This can be achieved by batching and adjusting load concurrency.
2. **Reasonable Partition and Bucket Design** :
1. **Partitions** : Using time partitioning (such as by day or hour) ensures that single loads only update a few partitions, reducing the scope of lock competition.
2. **Buckets** : The number of buckets (Tablet count) should be reasonably set based on data volume, typically between 8-64. Too many Tablets will intensify lock competition.
3. **Adjust Compaction Strategy** : In scenarios with very high write pressure, Compaction strategies can be appropriately adjusted to reduce Compaction frequency, thereby reducing lock conflicts between Compaction and load tasks.
4. **Upgrade to Latest Version** : The Doris community is continuously optimizing Unique Key Model performance under compute-storage separation architecture. For example, the upcoming 3.1 release significantly optimizes the distributed table lock implementation. **Always recommend using the latest stable version** for optimal performance.
## Conclusion
Apache Doris, with its powerful, flexible, and efficient data update
capabilities centered on the Unique Key Model, truly breaks through the
bottleneck of traditional OLAP systems in terms of data freshness. Whether
through high-performance loads implementing `UPSERT` and partial column
updates, or using Sequence columns to ensure consistency of out-of-order data,
Doris provides complete solutions for building end-to-end real-time analytical
applications.
By deeply understanding its core principles, mastering the applicable
scenarios for different update methods, and following the best practices
provided in this document, you will be able to fully unleash Doris's
potential, making real-time data truly become a powerful engine driving
business growth.
On This Page
* 1\. Core Concepts: Table Models and Update Mechanisms
* 1.1. Table Model Overview
* 1.2. Data Update Methods
* 1.3. Data Deletion Methods
* 2\. Deep Dive into Unique Key Model: Principles and Implementation
* 2.1. Merge-on-Write (MoW) vs. Merge-on-Read (MoR)
* 2.2. Conditional Updates (Sequence Column)
* 2.3. Deletion Mechanism (`DORIS_DELETE_SIGN`) Workflow
* 2.4 Partial Column Update
* 3\. Typical Application Scenarios
* 3.1. CDC Real-time Data Synchronization
* 3.2. Real-time Wide Table Joining
* 4\. Best Practices
* 4.1. General Performance Practices
* 4.2. Unique Key Model Practices in Compute-Storage Separation Architecture
* Conclusion
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/db-connect/database-connect
Version: 4.x
On this page
# Connecting by MySQL Protocol
Apache Doris adopts the MySQL network connection protocol. It is compatible
with command-line tools, JDBC/ODBC drivers, and various visualization tools
within the MySQL ecosystem. Additionally, Apache Doris comes with a built-in,
easy-to-use Web UI. This guide is about how to connect to Doris using MySQL
Client, MySQL JDBC Connector, DBeaver, and the built-in Doris Web UI.
## MySQL Client
Download MySQL Client from the [MySQL official
website](https://dev.mysql.com/downloads/mysql/) for Linux. Currently, Doris
is primarily compatible with MySQL 5.7 and later clients.
Extract the downloaded MySQL client. In the `bin/` directory, find the `mysql`
command-line tool. Execute the following command to connect to Doris:
# FE_IP represents the listening address of the FE node, while FE_QUERY_PORT represents the port of the MySQL protocol service of the FE. This corresponds to the query_port parameter in fe.conf and it defaults to 9030.
mysql -h FE_IP -P FE_QUERY_PORT -u USER_NAME
After login, the following message will be displayed.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 236
Server version: 5.7.99 Doris version doris-2.0.3-rc06-37d31a5
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
## MySQL JDBC Connector
Download the corresponding JDBC Connector from the official MySQL website.
Example of connection code:
String user = "user_name";
String password = "user_password";
String newUrl = "jdbc:mysql://FE_IP:FE_PORT/demo?useUnicode=true&characterEncoding=utf8&useTimezone=true&serverTimezone=Asia/Shanghai&useSSL=false&allowPublicKeyRetrieval=true";
try {
Connection myCon = DriverManager.getConnection(newUrl, user, password);
Statement stmt = myCon.createStatement();
ResultSet result = stmt.executeQuery("show databases");
ResultSetMetaData metaData = result.getMetaData();
int columnCount = metaData.getColumnCount();
while (result.next()) {
for (int i = 1; i <= columnCount; i++) {
System.out.println(result.getObject(i));
}
}
} catch (SQLException e) {
log.error("get JDBC connection exception.", e);
}
If you need to initially change session variables when connecting, you can use
the following format:
jdbc:mysql://FE_IP:FE_PORT/demo?sessionVariables=key1=val1,key2=val2
## DBeaver
Create a MySQL connection to Apache Doris:

Query in DBeaver:

## Built-in Web UI of Doris
Doris FE has a built-in Web UI. It allows users to perform SQL queries and
view other related information without the need to install the MySQL client
To access the Web UI, simply enter the URL in a web browser:
http://fe_ip:fe_port, for example, `http://172.20.63.118:8030`. This will open
the built-in Web console of Doris.
The built-in Web console is primarily intended for use by the root account of
the cluster. By default, the root account password is empty after
installation.

For example, you can execute the following command in the Playground to add a
BE node.
ALTER SYSTEM ADD BACKEND "be_host_ip:heartbeat_service_port";

tip
For successful execution of statements that are not related to specific
databases/tables in the Playground, it is necessary to randomly select a
database from the left-hand database panel. This limitation will be removed
later.
The current built-in web console cannot execute SET type SQL statements.
Therefore, the web console does not support statements like SET PASSWORD FOR
'user' = PASSWORD('user_password').
On This Page
* MySQL Client
* MySQL JDBC Connector
* DBeaver
* Built-in Web UI of Doris
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/lakehouse/lakehouse-overview
Version: 4.x
On this page
# Lakehouse Overview
**The lakehouse is a modern big data solution that combines the advantages of
data lakes and data warehouses**. It integrates the low cost and high
scalability of data lakes with the high performance and strong data governance
capabilities of data warehouses, enabling efficient, secure, and quality-
controlled storage and processing analysis of various data in the big data
era. Through standardized open data formats and metadata management, it
unifies **real-time** and **historical** data, **batch processing** , and
**stream processing** , gradually becoming the new standard for enterprise big
data solutions.
## Doris Lakehouse Solution
Doris provides an excellent lakehouse solution for users through an extensible
connector framework, a compute-storage decoupled architecture, a high-
performance data processing engine, and data ecosystem openness.

### Flexible Data Access
Doris supports mainstream data systems and data format access through an
extensible connector framework and provides unified data analysis capabilities
based on SQL, allowing users to easily perform cross-platform data queries and
analysis without moving existing data. For details, refer to [Catalog
Overview](/cloud/4.x/user-guide/lakehouse/catalog-overview)
### Data Source Connectors
Whether it's Hive, Iceberg, Hudi, Paimon, or database systems supporting the
JDBC protocol, Doris can easily connect and efficiently access data.
For lakehouse systems, Doris can obtain the structure and distribution
information of data tables from metadata services such as Hive Metastore, AWS
Glue, and Unity Catalog, perform reasonable query planning, and utilize the
MPP architecture for distributed computing.
For details, refer to each catalog document, such as [Iceberg
Catalog](/cloud/4.x/user-guide/lakehouse/catalogs/iceberg-catalog)
#### Extensible Connector Framework
Doris provides a good extensibility framework to help developers quickly
connect to unique data sources within enterprises, achieving fast data
interoperability.
Doris defines three levels of standard Catalog, Database, and Table, allowing
developers to easily map to the required data source levels. Doris also
provides standard interfaces for metadata service and storage service
accessing, and developers only need to implement the corresponding interface
to complete the data source connection.
Doris is compatible with the Trino Connector plugin, allowing the Trino plugin
package to be directly deployed to the Doris cluster, and with minimal
configuration, the corresponding data source can be accessed. Doris has
already completed connections to data sources such as [Kudu](/cloud/4.x/user-
guide/lakehouse/catalogs/kudu-catalog), [BigQuery](/cloud/4.x/user-
guide/lakehouse/catalogs/bigquery-catalog), and [Delta Lake](/cloud/4.x/user-
guide/lakehouse/catalogs/delta-lake-catalog). You can also [adapt new plugins
yourself](https://doris.apache.org/community/how-to-contribute/trino-
connector-developer-guide).
#### Convenient Cross-Source Data Processing
Doris supports creating multiple data catalogs at runtime and using SQL to
perform federated queries on these data sources. For example, users can
associate query fact table data in Hive with dimension table data in MySQL:
SELECT h.id, m.name
FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
ON h.id = m.id;
Combined with Doris's built-in [job scheduling](/cloud/4.x/user-guide/admin-
manual/workload-management/job-scheduler) capabilities, you can also create
scheduled tasks to further simplify system complexity. For example, users can
set the result of the above query as a routine task executed every hour and
write each result into an Iceberg table:
CREATE JOB schedule_load
ON SCHEDULE EVERY 1 HOUR DO
INSERT INTO iceberg.db.ice_table
SELECT h.id, m.name
FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
ON h.id = m.id;
### High-Performance Data Processing
As an analytical data warehouse, Doris has made numerous optimizations in
lakehouse data processing and computation and provides rich query acceleration
features:
* Execution Engine
The Doris execution engine is based on the MPP execution framework and
Pipeline data processing model, capable of quickly processing massive data in
a multi-machine, multi-core distributed environment. Thanks to fully
vectorized execution operators, Doris leads in computing performance in
standard benchmark datasets like TPC-DS.
* Query Optimizer
Doris can automatically optimize and process complex SQL requests through the
query optimizer. The query optimizer deeply optimizes various complex SQL
operators such as multi-table joins, aggregation, sorting, and pagination,
fully utilizing cost models and relational algebra transformations to
automatically obtain better or optimal logical and physical execution plans,
greatly reducing the difficulty of writing SQL and improving usability and
performance.
* Data Cache and IO Optimization
Access to external data sources is usually network access, which can have high
latency and poor stability. Apache Doris provides rich caching mechanisms and
has made numerous optimizations in cache types, timeliness, and strategies,
fully utilizing memory and local high-speed disks to enhance the analysis
performance of hot data. Additionally, Doris has made targeted optimizations
for network IO characteristics such as high throughput, low IOPS, and high
latency, providing external data source access performance comparable to local
data.
* Materialized Views and Transparent Acceleration
Doris provides rich materialized view update strategies, supporting full and
partition-level incremental refresh to reduce construction costs and improve
timeliness. In addition to manual refresh, Doris also supports scheduled
refresh and data-driven refresh, further reducing maintenance costs and
improving data consistency. Materialized views also have transparent
acceleration capabilities, allowing the query optimizer to automatically route
to appropriate materialized views for seamless query acceleration.
Additionally, Doris's materialized views use high-performance storage formats,
providing efficient data access capabilities through column storage,
compression, and intelligent indexing technologies, serving as an alternative
to data caching and improving query efficiency.
As shown below, on a 1TB TPCDS standard test set based on the Iceberg table
format, Doris's overall execution of 99 queries is only 1/3 of Trino's.

In actual user scenarios, Doris reduces average query latency by 20% and 95th
percentile latency by 50% compared to Presto while using half the resources,
significantly reducing resource costs while enhancing user experience.

### Convenient Service Migration
In the process of integrating multiple data sources and achieving lakehouse
transformation, migrating SQL queries to Doris is a challenge due to
differences in SQL dialects across systems in terms of syntax and function
support. Without a suitable migration plan, the business side may need
significant modifications to adapt to the new system's SQL syntax.
To address this issue, Doris provides a [SQL Dialect Conversion
Service](/cloud/4.x/user-guide/lakehouse/sql-convertor/sql-convertor-
overview), allowing users to directly use SQL dialects from other systems for
data queries. The conversion service converts these SQL dialects into Doris
SQL, greatly reducing user migration costs. Currently, Doris supports SQL
dialect conversion for common query engines such as Presto/Trino, Hive,
PostgreSQL, and Clickhouse, achieving a compatibility of over 99% in some
actual user scenarios.
### Modern Deployment Architecture
Since version 3.0, Doris supports a cloud-native compute-storage separation
architecture. This architecture, with its low cost and high elasticity,
effectively improves resource utilization and enables independent scaling of
compute and storage.

The above diagram shows the system architecture of Doris's compute-storage
separation, decoupling compute and storage. Compute nodes no longer store
primary data, and the underlying shared storage layer (HDFS and object
storage) serves as the unified primary data storage space, supporting
independent scaling of compute and storage resources. The compute-storage
separation architecture brings significant advantages to the lakehouse
solution:
* **Low-Cost Storage** : Storage and compute resources can be independently scaled, allowing enterprises to increase storage capacity without increasing compute resources. Additionally, by using cloud object storage, enterprises can enjoy lower storage costs and higher availability, while still using local high-speed disks for caching relatively low-proportion hot data.
* **Single Source of Truth** : All data is stored in a unified storage layer, allowing the same data to be accessed and processed by different compute clusters, ensuring data consistency and integrity, and reducing the complexity of data synchronization and duplicate storage.
* **Workload Diversity** : Users can dynamically allocate compute resources based on different workload needs, supporting various application scenarios such as batch processing, real-time analysis, and machine learning. By separating storage and compute, enterprises can more flexibly optimize resource usage, ensuring efficient operation under different loads.
In addition, under the storage-computing coupled architecture, [elastic
computing nodes](/cloud/4.x/user-guide/lakehouse/compute-node) can still be
used to provide elastic computing capabilities in lake warehouse data query
scenarios.
### Openness
Doris not only supports access to open lake table formats but also has good
openness for its own stored data. Doris provides an open storage API and
[implements a high-speed data link based on the Arrow Flight SQL
protocol](/cloud/4.x/user-guide/db-connect/arrow-flight-sql-connect), offering
the speed advantages of Arrow Flight and the ease of use of JDBC/ODBC. Based
on this interface, users can access data stored in Doris using
Python/Java/Spark/Flink's ABDC clients.
Compared to open file formats, the open storage API abstracts the specific
implementation of the underlying file format, allowing Doris to accelerate
data access through advanced features in its storage format, such as rich
indexing mechanisms. Additionally, upper-layer compute engines do not need to
adapt to changes or new features in the underlying storage format, allowing
all supported compute engines to simultaneously benefit from new features.
## Lakehouse Best Practices
In the lakehouse solution, Doris is mainly used for **lakehouse query
acceleration** , **multi-source federated analysis** , and **lakehouse data
processing**.
### Lakehouse Query Acceleration
In this scenario, Doris acts as a **compute engine** , accelerating query
analysis on lakehouse data.

#### Cache Acceleration
For lakehouse systems like Hive and Iceberg, users can configure local disk
caching. Local disk caching automatically stores query-designed data files in
local cache directories and manages cache eviction using the LRU strategy. For
details, refer to the [Data Cache](/cloud/4.x/user-guide/lakehouse/data-cache)
document.
#### Materialized Views and Transparent Rewrite
Doris supports creating materialized views for external data sources.
Materialized views store pre-computed results as Doris internal table formats
based on SQL definition statements. Additionally, Doris's query optimizer
supports a transparent rewrite algorithm based on the SPJG (SELECT-PROJECT-
JOIN-GROUP-BY) pattern. This algorithm can analyze the structure information
of SQL, automatically find suitable materialized views for transparent
rewrite, and select the optimal materialized view to respond to query SQL.
This feature can significantly improve query performance by reducing runtime
computation. It also allows access to data in materialized views through
transparent rewrite without business awareness. For details, refer to the
[Materialized Views](/cloud/4.x/user-guide/query-acceleration/materialized-
view/async-materialized-view/overview) document.
### Multi-Source Federated Analysis
Doris can act as a **unified SQL query engine** , connecting different data
sources for federated analysis, solving data silos.

Users can dynamically create multiple catalogs in Doris to connect different
data sources. They can use SQL statements to perform arbitrary join queries on
data from different data sources. For details, refer to the [Catalog
Overview](/cloud/4.x/user-guide/lakehouse/catalog-overview).
### Lakehouse Data Processing
In this scenario, **Doris acts as a data processing engine** , processing
lakehouse data.

#### Task Scheduling
Doris introduces the Job Scheduler feature, enabling efficient and flexible
task scheduling, reducing dependency on external systems. Combined with data
source connectors, users can achieve periodic processing and storage of
external data. For details, refer to the [Job Scheduler](/cloud/4.x/user-
guide/admin-manual/workload-management/job-scheduler).
#### Data Modeling
User typically use data lakes to store raw data and perform layered data
processing on this basis, making different layers of data available to
different business needs. Doris's materialized view feature supports creating
materialized views for external data sources and supports further processing
based on materialized views, reducing system complexity and improving data
processing efficiency.
#### Data Write-Back
The data write-back feature forms a closed loop of Doris's lakehouse data
processing capabilities. Users can directly create databases and tables in
external data sources through Doris and write data. Currently, JDBC, Hive, and
Iceberg data sources are supported, with more data sources to be added in the
future. For details, refer to the documentation of the corresponding data
source.
On This Page
* Doris Lakehouse Solution
* Flexible Data Access
* Data Source Connectors
* High-Performance Data Processing
* Convenient Service Migration
* Modern Deployment Architecture
* Openness
* Lakehouse Best Practices
* Lakehouse Query Acceleration
* Multi-Source Federated Analysis
* Lakehouse Data Processing
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/query-acceleration/performance-tuning-overview/tuning-overview
Version: 4.x
# Tuning Overview
Query performance tuning is a systematic process that requires multi-level and
multi-dimensional adjustments to the database system. Below is an overview of
the tuning process and methodology:
1. Firstly, business personnel and database administrators (DBAs) need to have a comprehensive understanding of the database system being used, including the hardware utilized by the business system, the scale of the cluster, the version of the database software being used, as well as the features provided by the specific software version.
2. Secondly, an effective performance diagnostic tool is a necessary prerequisite for identifying performance issues. Only by efficiently and quickly locating problematic SQL queries or slow SQL queries can subsequent specific performance tuning processes be carried out.
3. After entering the performance tuning phase, a range of commonly used performance analysis tools are indispensable. These include specialized tools provided by the currently running database system, as well as general tools at the operating system level.
4. With these tools in place, specialized tools can be used to obtain detailed information about SQL queries running on the current database system, aiding in the identification of performance bottlenecks. Meanwhile, general tools can serve as auxiliary analysis methods to assist in locating issues.
In summary, performance tuning requires evaluating the current system's
performance status from a holistic perspective. Firstly, it is necessary to
identify business SQL queries with performance issues, then utilize analysis
tools to discover performance bottlenecks, and finally implement specific
tuning operations.
Based on the aforementioned tuning process and methodology, Apache Doris
provides corresponding tools at each of these levels. The following sections
will introduce the performance [diagnostic tools](/cloud/4.x/user-guide/query-
acceleration/performance-tuning-overview/diagnostic-tools), [analysis
tools](/cloud/4.x/user-guide/query-acceleration/performance-tuning-
overview/analysis-tools), and [tuning process](/cloud/4.x/user-guide/query-
acceleration/performance-tuning-overview/tuning-process) respectively.
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/query-data/mysql-compatibility
Version: 4.x
On this page
# MySQL Compatibility
Doris is highly compatible with MySQL syntax and supports standard SQL.
However, there are several differences between Doris and MySQL, as outlined
below.
## Data Types
### Numeric Types
Type| MySQL| Doris| Boolean| \- Supported \- Range: 0 represents false, 1 represents true| \- Supported \- Keyword: Boolean \- Range: 0 represents false, 1 represents true| Bit| \- Supported \- Range: 1 to 64| Not supported| Tinyint| \- Supported \- Supports signed and unsigned \- Range: signed range from -128 to 127, unsigned range from 0 to 255 | \- Supported \- Only supports signed \- Range: -128 to 127| Smallint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^15 to 2^15-1, unsigned range from 0 to 2^16-1| \- Supported \- Only supports signed \- Range: -32768 to 32767| Mediumint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^23 to 2^23-1, unsigned range from 0 to 2^24-1| \- Not supported| Int| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^31 to 2^31-1, unsigned range from 0 to 2^32-1| \- Supported \- Only supports signed \- Range: -2147483648 to 2147483647| Bigint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^63 to 2^63-1, unsigned range from 0 to 2^64-1| \- Supported \- Only supports signed \- Range: -2^63 to 2^63-1| Largeint| \- Not supported| \- Supported \- Only supports signed \- Range: -2^127 to 2^127-1| Decimal| \- Supported \- Supports signed and unsigned (deprecated after 8.0.17) \- Default: Decimal(10, 0)| \- Supported \- Only supports signed \- Default: Decimal(9, 0)| Float/Double| -Supported \- Supports signed and unsigned (deprecated after 8.0.17)| \- Supported \- Only supports signed
---|---|---
### Date Types
Type| MySQL| Doris| Date| \- Supported \- Range: ['1000-01-01', '9999-12-31']
\- Format: YYYY-MM-DD| \- Supported \- Range: ['0000-01-01', '9999-12-31'] \-
Format: YYYY-MM-DD| DateTime| \- Supported \- DATETIME([P]), where P is an
optional parameter defined precision \- Range: '1000-01-01 00:00:00.000000' to
'9999-12-31 23:59:59.999999' \- Format: YYYY-MM-DD hh:mm.fraction| \-
Supported \- DATETIME([P]), where P is an optional parameter defined precision
\- Range: ['0000-01-01 00:00:00[.000000]', '9999-12-31 23:59:59[.999999]'] \-
Format: YYYY-MM-DD hh:mm.fraction| Timestamp| \- Supported \- Timestamp[(p)],
where P is an optional parameter defined precision \- Range: ['1970-01-01
00:00:01.000000' UTC, '2038-01-19 03:14:07.999999' UTC] \- Format: YYYY-MM-DD
hh:mm.fraction| \- Not supported| Time| \- Supported \- Time[(p)] \- Range:
['-838:59:59.000000' to '838:59:59.000000'] \- Format: hh:mm.fraction| \- Not
supported| Year| \- Supported \- Range: 1901 to 2155, or 0000 \- Format: yyyy|
\- Not supported
---|---|---
### String Types
Type| MySQL| Doris| Char| -Supported - CHAR[(M)], where M is the character
length. If omitted, default length is 1 \- Fixed-length \- Range: [0, 255]
bytes| \- Supported \- CHAR[(M)], where M is the byte length \- Variable-
length \- Range: [1, 255]| Varchar| \- Supported \- VARCHAR(M), where M is the
character length \- Range: [0, 65535] bytes| \- Supported \- VARCHAR(M), where
M is the byte length \- Range: [1, 65533]| String| \- Not supported| \-
Supported \- 1,048,576 bytes (1MB), can be increased to 2,147,483,643 bytes
(2GB)| Binary| \- Supported \- Similar to Char| \- Not supported| Varbinary|
\- Supported \- Similar to Varchar| \- Not supported| Blob| \- Supported \-
TinyBlob, Blob, MediumBlob, LongBlob| \- Not supported| Text| \- Supported \-
TinyText, Text, MediumText, LongText| \- Not supported| Enum| \- Supported \-
Supports up to 65,535 elements| \- Not supported| Set| \- Supported \-
Supports up to 64 elements| \- Not supported
---|---|---
### JSON Type
Type| MySQL| Doris| JSON| Supported| Supported
---|---|---
### Doris unique data type
Doris has several unique data types. Here are the details:
* **HyperLogLog**
HLL (HyperLogLog) is a data type that cannot be used as a key column. In an
aggregate model table, the corresponding aggregation type for HLL is
HLL_UNION. The length and default value do not need to be specified. The
length is controlled internally based on the data aggregation level. HLL
columns can only be queried or used with `HLL_UNION_AGG`, `HLL_RAW_AGG`,
`HLL_CARDINALITY`, `HLL_HASH`, and other related functions.
HLL is used for approximate fuzzy deduplication and performs better than count
distinct when dealing with large amounts of data. The typical error rate of
HLL is around 1%, sometimes reaching up to 2%.
* **Bitmap**
Bitmap is a data type that cannot be used as a key column. In aggregate model
table, the corresponding aggregation type for BITMAP is BITMAP_UNION. Similar
to HLL, the length and default values do not need to be specified, and the
length is controlled internally based on the data aggregation level. Bitmap
columns can only be queried or used with functions like `BITMAP_UNION_COUNT`,
`BITMAP_UNION`, `BITMAP_HASH`, `BITMAP_HASH64` and others.
Using BITMAP in traditional scenarios may impact loading speed, but it
generally performs better than Count Distinct when dealing with large amounts
of data. Please note that in real-time scenarios, using BITMAP without a
global dictionary and with bitmap_hash() function may introduce an error of
around 0.1%. If this error is not acceptable, you can use bitmap_hash64
instead.
* **QUANTILE_PERCENT**
QUANTILE_STATE is a data type that cannot be used as a key column. In an
aggregate model table, the corresponding aggregation type for QUANTILE_STATE
is QUANTILE_UNION. The length and default value do not need to be specified,
and the length is controlled internally based on the data aggregation level.
QUANTILE_STATE columns can only be queried or used with functions like
`QUANTILE_PERCENT`, `QUANTILE_UNION`, `TO_QUANTILE_STATE` and others.
QUANTILE_STATE is used for calculating approximate quantile values. During
import, it performs pre-aggregation on the same key with different values.
When the number of values does not exceed 2048, it stores all the data in
detail. When the number of values exceeds 2048, it uses the TDigest algorithm
to aggregate (cluster) the data and save the centroids of the clusters.
* **Array **
Array is a data type in Doris that represents an array composed of elements of
type T. It cannot be used as a key column.
* **MAP **
MAP is a data type in Doris that represents a map composed of elements of
types K and V.
* **STRUCT **
A structure (STRUCT) is composed of multiple fields. It can also be identified
as a collection of multiple columns.
* field_name: The identifier of the field, which must be unique.
* field_type: The type of field.
* **Agg_State**
AGG_STATE is a data type in Doris that cannot be used as a key column. During
table creation, the signature of the aggregation function needs to be
declared.
The length and default value do not need to be specified, and the actual
storage size depends on the implementation of the function.
AGG_STATE can only be used in combination with [STATE](/cloud/4.x/sql-
manual/sql-functions/combinators/state) / [MERGE](/cloud/4.x/sql-manual/sql-
functions/combinators/merge)/ [UNION](/cloud/4.x/sql-manual/sql-
functions/combinators/union) functions from the SQL manual for aggregators.
## Syntax
### DDL
#### 01 Create Table Syntax in Doris
CREATE TABLE [IF NOT EXISTS] [database.]table
(
column_definition_list
[, index_definition_list]
)
[engine_type]
[keys_type]
[table_comment]
[partition_info]
distribution_desc
[rollup_list]
[properties]
[extra_properties]
#### 02 Differences with MySQL
Parameter| Differences from MySQL| Column_definition_list| \- Field list
definition: The basic syntax is similar to MySQL but includes an additional
operation for aggregate types.
\- The aggregate type operation primarily supports Aggregate.
\- When creating a table, MySQL allows adding constraints like Index (e.g.,
Primary Key, Unique Key) after the field list definition, while Doris supports
these constraints and computations by defining data models.|
Index_definition_list| \- Index list definition: The basic syntax is similar
to MySQL, supporting bitmap indexes, inverted indexes, and N-Gram indexes, but
Bloom filter indexes are set through properties.
\- MySQL supports B+Tree and Hash indexes.| Engine_type| \- Table engine type:
Optional.
\- The currently supported table engine is mainly the OLAP native engine.
\- MySQL supports storage engines such as Innodb, MyISAM, etc.| Keys_type| \-
Data model: Optional.
\- Supported types include: 1) DUPLICATE KEY (default): The specified columns
are sort columns. 2) AGGREGATE KEY: The specified columns are dimension
columns. 3) UNIQUE KEY: The specified columns are primary key columns.
\- MySQL does not have the concept of a data model.| Table_comment| Table
comment| Partition_info| \- Partitioning algorithm: Optional. Doris supported
partitioning algorithms include:
\- LESS THAN: Only defines the upper bound of partitions. The lower bound is
determined by the upper bound of the previous partition.
\- FIXED RANGE: Defines left-closed and right-open intervals for partitions.
\- MULTI RANGE: Creates multiple RANGE partitions in bulk, defining left-
closed and right-open intervals, setting time units and steps. Time units
support years, months, days, weeks, and hours.
MySQL supports algorithms such as Hash, Range, List, Key. MySQL also supports
subpartitions, with only Hash and Key supported for subpartitions.|
Distribution_desc| \- Bucketing algorithm: Required. Includes: 1) Hash
bucketing syntax: DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num|auto].
Description: Uses specified key columns for hash bucketing. 2) Random
bucketing syntax: DISTRIBUTED BY RANDOM [BUCKETS num|auto]. Description: Uses
random numbers for bucketing.
\- MySQL does not have a bucketing algorithm.| Rollup_list| \- Multiple sync
materialized views can be created while creating the table.
\- Syntax: `rollup_name (col1[, col2, ...]) [DUPLICATE KEY(col1[, col2,
...])][PROPERTIES("key" = "value")]`
\- MySQL does not support this.| Properties| Table properties: They differ
from MySQL's table properties, and the syntax for defining table properties
also differs from MySQL.
---|---
#### 03 CREATE INDEX
CREATE INDEX [IF NOT EXISTS] index_name ON table_name (column [, ...],) [USING BITMAP];
* Doris currently supports Bitmap index, Inverted index, and N-Gram index. BloomFilter index are supported as well, but they have a separate syntax for setting them.
* MySQL supports index algorithms such as B+Tree and Hash.
#### 04 CREATE VIEW
CREATE VIEW [IF NOT EXISTS]
[db_name.]view_name
(column1[ COMMENT "col comment"][, column2, ...])
AS query_stmt
CREATE MATERIALIZED VIEW (IF NOT EXISTS)? mvName=multipartIdentifier
(LEFT_PAREN cols=simpleColumnDefs RIGHT_PAREN)? buildMode?
(REFRESH refreshMethod? refreshTrigger?)?
(KEY keys=identifierList)?
(COMMENT STRING_LITERAL)?
(PARTITION BY LEFT_PAREN partitionKey = identifier RIGHT_PAREN)?
(DISTRIBUTED BY (HASH hashKeys=identifierList | RANDOM) (BUCKETS (INTEGER_VALUE | AUTO))?)?
propertyClause?
AS query
* The basic syntax is consistent with MySQL.
* Doris supports logical view and supports two types of materialized views: synchronous materialized views and asynchronous materialized views
* MySQL do not supports asynchronous materialized views.
#### 05 ALTER TABLE / ALTER INDEX
The syntax of Doris ALTER is basically the same as that of MySQL.
### DROP TABLE / DROP INDEX
The syntax of Doris DROP is basically the same as MySQL.
### DML
#### INSERT
INSERT INTO table_name
[ PARTITION (p1, ...) ]
[ WITH LABEL label]
[ (column [, ...]) ]
[ [ hint [, ...] ] ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
The Doris INSERT syntax is basically the same as MySQL.
#### UPDATE
UPDATE target_table [table_alias]
SET assignment_list
WHERE condition
assignment_list:
assignment [, assignment] ...
assignment:
col_name = value
value:
{expr | DEFAULT}
The Doris UPDATE syntax is basically the same as MySQL, but it should be noted
that the **`WHERE` condition must be added.**
#### Delete
DELETE FROM table_name [table_alias]
[PARTITION partition_name | PARTITIONS (partition_name [, partition_name])]
WHERE column_name op { value | value_list } [ AND column_name op { value | value_list } ...];
The syntax can only specify filter predicates
DELETE FROM table_name [table_alias]
[PARTITION partition_name | PARTITIONS (partition_name [, partition_name])]
[USING additional_tables]
WHERE condition
This syntax can only be used on the UNIQUE KEY model table.
The DELETE syntax in Doris is basically the same as in MySQL. However, since
Doris is an analytical database, deletions cannot be too frequent.
#### SELECT
SELECT
[hint_statement, ...]
[ALL | DISTINCT]
select_expr [, select_expr ...]
[EXCEPT ( col_name1 [, col_name2, col_name3, ...] )]
[FROM table_references
[PARTITION partition_list]
[TABLET tabletid_list]
[TABLESAMPLE sample_value [ROWS | PERCENT]
[REPEATABLE pos_seek]]
[WHERE where_condition]
[GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
[HAVING where_condition]
[ORDER BY {col_name | expr | position} [ASC | DESC], ...]
[LIMIT {[offset_count,] row_count | row_count OFFSET offset_count}]
[INTO OUTFILE 'file_name']
The Doris SELECT syntax is basically the same as MySQL.
## SQL Function
Doris Function covers most MySQL functions.
On This Page
* Data Types
* Numeric Types
* Date Types
* String Types
* JSON Type
* Doris unique data type
* Syntax
* DDL
* DROP TABLE / DROP INDEX
* DML
* SQL Function
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/studio/overview
Version: 4.x
On this page
# Introduce VeloDB Studio
VeloDB Studio is a GUI tool tailored for Apache Doris and its compatible
databases to simplify data development and management.
VeloDB Studio has two versions: Server and Desktop:
* Server versions are built-in to provide enterprise-level user services in VeloDB Cloud and Enterprise.
* Desktop version is a desktop application that can be installed directly on your computer and supports Mac, Windows and Linux (future).
## Core features
### SQL Editor
A SQL editor specially designed for Apache Doris supports SQL syntax
highlighting, automatic completion, formatting and other functions to improve
SQL writing efficiency.
### Log Retrieval and Visual Analysis
Provides log search and visualization functions, and you can use Apache Doris
to work with Studio's log search capabilities to replace Elastic Search and
Kibana Discover for log storage, querying and visualization, achieving 10
times the cost reduction and more efficient analysis of data.
### Query Audit
Query Audit is used to audit and analyze query history executed in Doris. It
allows you to filter slow queries or filter through users, hosts, SQL
statements, etc. to meet audit needs.
### Permission Management
Visually manage Apache Doris user rights to ensure the security of database
data and operations, and meet the needs of enterprise-level applications.
### Multiple connections and SSH tunnels (Desktop version only)
Supports multi-database connections and provides SSH tunneling function, which
facilitates users to remotely manage Doris databases under secure channels,
improving compatibility across network environments.
## VeloDB Studio Server Version
info
The server version supports Chrome 90 or above browsers, and it is recommended
to use the latest version of the browser.
VeloDB Studio Server version is built into VeloDB Cloud and Enterprise, and is
provided to enterprise users in the form of a web.
### Features of VeloDB Studio Server Edition
**1\. Deep integration** : The Server version is deeply integrated in VeloDB
Cloud and Enterprise, and has different adaptive functions according to
different Manager versions.
**2\. Network Isolation** : Enterprise Studio is deployed in your enterprise-
level network environment, and VeloDB Cloud versions of Studio are deployed in
your VPC to provide a secure network environment.
**3\. Higher quality and stability** : The Server version focuses more on
stability and has stricter quality requirements for new functions.
**4\. Security Updates** : Server version provides more instant security
updates and vulnerability responses, and we will update and deliver
vulnerabilities and security issues separately.
**5\. Enterprise-level support** : The team provides professional technical
support and faster feature requests, and problems in the Server version will
always be responded to as soon as possible.
**6\. Team Collaboration** : The Server version is more suitable for team
collaboration, with the same access address, and multiple users can share a
Studio. You can also embed Studio into your enterprise management system.
## VeloDB Studio Desktop Version
### Why launch desktop application?
In the past, we provided the Web version of Studio WebUI in VeloDB Enterprise
Manager, VeloDB Cloud, and Alibaba Cloud.
However, these versions need to be deployed on the server or fully hosted on
the cloud. They are designed for VeloDB kernel, require login to the
management system account, require payment, require complex network
permissions, require administrator permission to update, etc., which are more
suitable for enterprise users. These designs bring a lot of inconvenience to
ordinary users.
To facilitate Apache Doris users, we have launched the VeloDB Studio Desktop
version, which is a GUI designed and developed specifically for Apache Doris.
It has the following main advantages:
### Features of VeloDB Studio Desktop Edition
info
Mac version only supports 64-bit macOS version 13.0 (Ventura) above system
version
info
Windows version only supports 64-bit Windows 10 above system version,
Windows8, 8.1 and Windows Server 2012 does not supported
**1\. No server deployment required**
* You don't need to find a server to deploy Studio alone. Just download the VeloDB Studio Desktop installation package and double-click to use it.
* You don't need to log in to another account, just open the app and enter the connection information to connect to the Doris database.
**2\. Completely free**
* Unlike other versions, VeloDB Studio Desktop is permanently free, no license purchase or no payment.
**3\. Design for Apache Doris**
* Other versions of Studio run on the VeloDB kernel, Apache Doris cannot be used or has limited compatibility.
* VeloDB Studio Desktop is designed for Apache Doris, supports Apache Doris, and supports compatible databases derived from Apache Doris.
**4\. Better user experience**
* More convenient: The desktop application is on your computer, without opening the browser, entering the address, logging in to the account, just double-clicking the icon to open it.
* More efficient: The desktop supports stronger shortcut key system and smoother window management. Your connection will be saved to your computer, allowing multiple connections to be saved without entering connection information every time.
**5\. Native tools that replace Navicat and DBeaver**
* Stronger management capabilities: Unlike more queries-focused tools such as Navicat and DBeaver, VeloDB Studio supports more features of Apache Doris, including session management, log retrieval, permission management, query auditing, etc.
* Better user support and response: VeloDB Studio team can respond faster to your feature requests and problem feedback, and launch new features based on Apache Doris faster.
On This Page
* Core features
* SQL Editor
* Log Retrieval and Visual Analysis
* Query Audit
* Permission Management
* Multiple connections and SSH tunnels (Desktop version only)
* VeloDB Studio Server Version
* Features of VeloDB Studio Server Edition
* VeloDB Studio Desktop Version
* Why launch desktop application?
* Features of VeloDB Studio Desktop Edition
---
# Source: https://docs.velodb.io/cloud/4.x/user-guide/table-design/overview
Version: 4.x
On this page
# Overview
## Creating tables
Users can use the CREATE TABLE statement to create a table in Doris. You can
also use the CREATE TABLE LIKE or CREATE TABLE AS clause to derive the table
definition from another table.
## Table name
In Doris, table names are case-sensitive by default. You can configure
lower_case_table_namesto make them case-insensitive during the initial cluster
setup. The default maximum length for table names is 64 bytes, but you can
change this by configuring table_name_length_limit. It is not recommended to
set this value too high. For syntax on creating tables, please refer to CREATE
TABLE. [Dynamic partitions](/cloud/4.x/user-guide/table-design/data-
partitioning/dynamic-partitioning) can have these properties set individually.
## Table property
In Doris, the CREATE TABLE statement can specify table properties, including:
* **buckets** : Determines the distribution of data within the table.
* **storage_medium** : Controls the storage method for data, such as using HDD, SSD, or remote shared storage.
* **replication_num** : Controls the number of data replicas to ensure redundancy and reliability.
* **storage_policy** : Controls the migration strategy for cold and hot data separation storage.
These properties apply to partitions, meaning that once a partition is
created, it will have its own properties. Modifying table properties will only
affect partitions created in the future and will not affect existing
partitions. For more information about table properties, refer to ALTER TABLE
PROPERTY.
## Notes
1. **Choose an appropriate data model** : The data model cannot be changed, so you need to select an appropriate [data model](/cloud/4.x/user-guide/table-design/data-model/overview) when creating the table.
2. **Choose an appropriate number of buckets** : The number of buckets in an already created partition cannot be modified. You can modify the number of buckets by [replacing the partition](/cloud/4.x/user-guide/data-operate/delete/table-temp-partition), or you can modify the number of buckets for partitions that have not yet been created in dynamic partitions.
3. **Column addition operations** : Adding or removing VALUE columns is a lightweight operation that can be completed in seconds. Adding or removing KEY columns or modifying data types is a heavyweight operation, and the completion time depends on the amount of data. For large datasets, it is recommended to avoid adding or removing KEY columns or modifying data types.
4. **Optimize storage strategy** : You can use tiered storage to store cold data on HDD or S3/HDFS.
On This Page
* Creating tables
* Table name
* Table property
* Notes