# Velodb > "Timeout expired. The timeout period elapsed prior to completion of the --- # Source: https://docs.velodb.io/cloud/4.x/best-practice/bi-faq Version: 4.x On this page # BI FAQ ## Power BI​ ### Q1. An error occurs when you use JDBC to pull data into Desktop Power BI. "Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding".​ Usually, this is Power BI pulling the time timeout of the data source. When filling in the data source server and database, click the advanced option, which has a timeout time, set the time higher. ### Q2. When the 2.1.x version uses JDBC to connect to Power BI, an error occurs "An error happened while reading data from the provider: the given key was not present in the dictionary".​ Run "show collation" in the database first. Generally, only utf8mb4_900_bin is displayed, and the charset is utf8mb4. The main reason for this error is that ID 33 needs to be found when connecting to Power BI. That is, rows with 33ids in the table need to be upgraded to version 2.1.5 or later. ### Q3. Connection Doris Times error "Reading data from the provider times error index and count must refer to the location within the string".​ The cause of the problem is that global parameters are loaded during the connection process, and the SQL column names and values are the same SELECT @@max_allowed_packet as max_allowed_packet, @@character_set_client ,@@character_set_connection , @@license,@@sql_mode ,@@lower_case_table_names , @@autocommit ; The new optimizer can be turned off in the current version or upgraded to version 2.0.7 or 2.1.6 or later. ### Q4. JDBC connection version 2.1.x error message "Character set 'utf8mb3' is not supported by.net.Framework".​ This problem is easily encountered in version 2.1.x. If this problem occurs, you need to upgrade the JDBC Driver to 8.0.32. ## Tableau​ ### Q1. Version 2.0.x reports that Tableau cannot connect to the data source with error code 37CE01A3.​ Turn off the new optimizer in the current version or upgrade to 2.0.7 or later ### Q2. SSL connection error:protocol version mismatch Failed to connect to the MySQL server​ The cause of this error is that SSL authentication is enabled on Doris, but SSL connections are not used during the connection. You need to disable the enable_ssl variable in fe.conf. ### Q3. Connection error Unsupported command(COM_STMT_PREPARED)​ The MySQL driver version is improperly installed. Install the MySQL 5.1.x connection driver instead. On This Page * Power BI * Q1. An error occurs when you use JDBC to pull data into Desktop Power BI. "Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding". * Q2. When the 2.1.x version uses JDBC to connect to Power BI, an error occurs "An error happened while reading data from the provider: the given key was not present in the dictionary". * Q3. Connection Doris Times error "Reading data from the provider times error index and count must refer to the location within the string". * Q4. JDBC connection version 2.1.x error message "Character set 'utf8mb3' is not supported by.net.Framework". * Tableau * Q1. Version 2.0.x reports that Tableau cannot connect to the data source with error code 37CE01A3. * Q2. SSL connection error version mismatch Failed to connect to the MySQL server * Q3. Connection error Unsupported command(COM_STMT_PREPARED) --- # Source: https://docs.velodb.io/cloud/4.x/best-practice/data-faq Version: 4.x On this page # Data Operation Error This document is mainly used to record common problems of data operation during the use of Doris. It will be updated from time to time. ### Q1. Use Stream Load to access FE's public network address to import data, but is redirected to the intranet IP?​ When the connection target of stream load is the http port of FE, FE will only randomly select a BE node to perform the http 307 redirect operation, so the user's request is actually sent to a BE assigned by FE. The redirect returns the IP of the BE, that is, the intranet IP. So if you send the request through the public IP of FE, it is very likely that you cannot connect because it is redirected to the internal network address. The usual way is to ensure that you can access the intranet IP address, or to assume a load balancer for all BE upper layers, and then directly send the stream load request to the load balancer, and the load balancer will transparently transmit the request to the BE node . ### Q2. Does Doris support changing column names?​ After version 1.2.0, when the `"light_schema_change"="true"` option is enabled, column names can be modified. Before version 1.2.0 or when the `"light_schema_change"="true"` option is not enabled, modifying column names is not supported. The reasons are as follows: Doris supports modifying database name, table name, partition name, materialized view (Rollup) name, as well as column type, comment, default value, etc. But unfortunately, modifying column names is currently not supported. For some historical reasons, the column names are currently written directly to the data file. When Doris queries, it also finds the corresponding column through the class name. Therefore, modifying the column name is not only a simple metadata modification, but also involves data rewriting, which is a very heavy operation. We do not rule out some compatible means to support lightweight column name modification operations in the future. ### Q3. Does the table of the Unique Key model support creating a materialized view?​ not support. The table of the Unique Key model is a business-friendly table. Because of its unique function of deduplication according to the primary key, it can easily synchronize business databases with frequently changed data. Therefore, many users will first consider using the Unique Key model when accessing data into Doris. But unfortunately, the table of the Unique Key model cannot establish a materialized view. The reason is that the essence of the materialized view is to "pre-compute" the data through pre-computation, so that the calculated data is directly returned during the query to speed up the query. In the materialized view, the "pre-computed" data is usually some aggregated indicators, such as sum and count. At this time, if the data changes, such as update or delete, because the pre-computed data has lost detailed information, it cannot be updated synchronously. For example, a sum value of 5 may be 1+4 or 2+3. Because of the loss of detailed information, we cannot distinguish how this summation value is calculated, so we cannot meet the needs of updating. ### Q4. tablet writer write failed, tablet_id=27306172, txn_id=28573520, err=-235 or -238​ This error usually occurs during data import operations. The error code is -235. The meaning of this error is that the data version of the corresponding tablet exceeds the maximum limit (default 500, controlled by the BE parameter `max_tablet_version_num`), and subsequent writes will be rejected. For example, the error in the question means that the data version of the tablet 27306172 exceeds the limit. This error is usually caused by the import frequency being too high, which is greater than the compaction speed of the backend data, causing versions to pile up and eventually exceed the limit. At this point, we can first pass the show tablet 27306172 statement, and then execute the show proc statement in the result to check the status of each copy of the tablet. The versionCount in the result represents the number of versions. If you find that a copy has too many versions, you need to reduce the import frequency or stop importing and observe whether the number of versions drops. If the number of versions does not decrease after the import is stopped, you need to go to the corresponding BE node to view the be.INFO log, search for the tablet id and compaction keyword, and check whether the compaction is running normally. For compaction tuning, you can refer to the ApacheDoris official account article: [Doris Best Practices - Compaction Tuning (3)](https://mp.weixin.qq.com/s/cZmXEsNPeRMLHp379kc2aA) The -238 error usually occurs when the same batch of imported data is too large, resulting in too many Segment files for a tablet (default is 200, controlled by the BE parameter `max_segment_num_per_rowset`). At this time, it is recommended to reduce the amount of data imported in one batch, or appropriately increase the BE configuration parameter value to solve the problem. Since version 2.0, users can enable segment compaction feature to reduce segment file number by setting `enable_segcompaction=true` in BE config. ### Q5. tablet 110309738 has few replicas: 1, alive backends: [10003]​ This error can occur during a query or import operation. Usually means that the copy of the corresponding tablet has an exception. At this point, you can first check whether the BE node is down by using the show backends command. For example, the isAlive field is false, or the LastStartTime is a recent time (indicating that it has been restarted recently). If the BE is down, you need to go to the node corresponding to the BE and check the be.out log. If BE is down for abnormal reasons, the exception stack is usually printed in be.out to help troubleshoot the problem. If there is no error stack in be.out. Then you can use the linux command dmesg -T to check whether the process is killed by the system because of OOM. If no BE node is down, you need to pass the show tablet 110309738 statement, and then execute the show proc statement in the result to check the status of each tablet copy for further investigation. ### Q6. Calling stream load to import data through a Java program may result in a Broken Pipe error when a batch of data is large.​ Apart from Broken Pipe, some other weird errors may occur. This situation usually occurs after enabling httpv2. Because httpv2 is an http service implemented using spring boot, and uses tomcat as the default built-in container. However, there seems to be some problems with tomcat's handling of 307 forwarding, so the built-in container was modified to jetty later. In addition, the version of apache http client in the java program needs to use the version after 4.5.13. In the previous version, there were also some problems with the processing of forwarding. So this problem can be solved in two ways: 1. Disable httpv2 Restart FE after adding enable_http_server_v2=false in fe.conf. However, the new version of the UI interface can no longer be used, and some new interfaces based on httpv2 can not be used. (Normal import queries are not affected). 2. Upgrade Upgrading to Doris 0.15 and later has fixed this issue. ### Q7. Error -214 is reported when importing and querying​ When performing operations such as import, query, etc., you may encounter the following errors: failed to initialize storage reader. tablet=63416.1050661139.aa4d304e7a7aff9c-f0fa7579928c85a0, res=-214, backend=192.168.100.10 A -214 error means that the data version for the corresponding tablet is missing. For example, the above error indicates that the data version of the copy of tablet 63416 on the BE of 192.168.100.10 is missing. (There may be other similar error codes, which can be checked and repaired in the following ways). Typically, if your data has multiple copies, the system will automatically repair these problematic copies. You can troubleshoot with the following steps: First, check the status of each copy of the corresponding tablet by executing the `show tablet 63416` statement and executing the `show proc xxx` statement in the result. Usually we need to care about the data in the `Version` column. Normally, the Version of multiple copies of a tablet should be the same. And it is the same as the VisibleVersion version of the corresponding partition. You can view the corresponding partition version with `show partitions from tblx` (the partition corresponding to the tablet can be obtained in the `show tablet` statement.) At the same time, you can also visit the URL in the CompactionStatus column in the `show proc` statement (just open it in a browser) to view more specific version information to check which versions are missing. If there is no automatic repair for a long time, you need to use the `show proc "/cluster_balance"` statement to view the tablet repair and scheduling tasks currently being executed by the system. It may be because there are a large number of tablets waiting to be scheduled, resulting in a longer repair time. You can follow records in `pending_tablets` and `running_tablets`. Further, you can use the `admin repair` statement to specify a table or partition to be repaired first. For details, please refer to `help admin repair`; If it still can't be repaired, then in the case of multiple replicas, we use the `admin set replica status` command to force the replica in question to go offline. For details, see the example of setting the replica status to bad in `help admin set replica status`. (After set to bad, the copy will no longer be accessed. And it will be automatically repaired later. But before operation, you should make sure that other copies are normal) ### Q8. Not connected to 192.168.100.1:8060 yet, server_id=384​ We may encounter this error when importing or querying. If you go to the corresponding BE log, you may also find similar errors. This is an RPC error, and there are usually two possibilities: 1. The corresponding BE node is down. 2. rpc congestion or other errors. If the BE node is down, you need to check the specific downtime reason. Only the problem of rpc congestion is discussed here. One case is OVERCROWDED, which means that the rpc source has a large amount of unsent data that exceeds the threshold. BE has two parameters associated with it: 1. `brpc_socket_max_unwritten_bytes`: The default value is 1GB. If the unsent data exceeds this value, an error will be reported. This value can be modified appropriately to avoid OVERCROWDED errors. (But this cures the symptoms but not the root cause, and there is still congestion in essence). 2. `tablet_writer_ignore_eovercrowded`: Default is false. If set to true, Doris will ignore OVERCROWDED errors during import. This parameter is mainly to avoid import failure and improve the stability of import. The second is that the packet size of rpc exceeds max_body_size. This problem may occur if the query has a very large String type, or a bitmap type. It can be circumvented by modifying the following BE parameters: brpc_max_body_size:default 3GB. ### Q9. [ Broker load ] org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe​ `org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe` during import. The reason for this problem may be that when importing data from external storage (such as HDFS), because there are too many files in the directory, it takes too long to list the file directory. Here, the Broker RPC Timeout defaults to 10 seconds, and the timeout needs to be adjusted appropriately here. time. Modify the `fe.conf` configuration file to add the following parameters: broker_timeout_ms = 10000 ##The default here is 10 seconds, you need to increase this parameter appropriately Adding parameters here requires restarting the FE service. ### Q10. [ Routine load ] ReasonOfStateChanged: ErrorReason{code=errCode = 104, msg='be 10004 abort task with reason: fetch failed due to requested offset not available on the broker: Broker: Offset out of range'}​ The reason for this problem is that Kafka's cleanup policy defaults to 7 days. When a routine load task is suspended for some reason and the task is not restored for a long time, when the task is resumed, the routine load records the consumption offset, and This problem occurs when kafka has cleaned up the corresponding offset So this problem can be solved with alter routine load: View the smallest offset of kafka, use the ALTER ROUTINE LOAD command to modify the offset, and resume the task ALTER ROUTINE LOAD FOR db.tb FROM kafka ( "kafka_partitions" = "0", "kafka_offsets" = "xxx", "property.group.id" = "xxx" ); ### Q11. ERROR 1105 (HY000): errCode = 2, detailMessage = (192.168.90.91)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none​ yum install -y ca-certificates ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt ### Q12. create partition failed. partition numbers will exceed limit variable max_auto_partition_num​ To prevent accidental creation of too many partitions when importing data for auto-partitioned tables, we use the FE configuration item `max_auto_partition_num` to control the maximum number of partitions to be created automatically for such tables. If you really need to create more partitions, please modify this config item of FE Master node. On This Page * Q1. Use Stream Load to access FE's public network address to import data, but is redirected to the intranet IP? * Q2. Does Doris support changing column names? * Q3. Does the table of the Unique Key model support creating a materialized view? * Q4. tablet writer write failed, tablet_id=27306172, txn_id=28573520, err=-235 or -238 * Q5. tablet 110309738 has few replicas: 1, alive backends: [10003] * Q6. Calling stream load to import data through a Java program may result in a Broken Pipe error when a batch of data is large. * Q7. Error -214 is reported when importing and querying * Q8. Not connected to 192.168.100.1:8060 yet, server_id=384 * Q9. [ Broker load ] org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe * Q10. [ Routine load ] ReasonOfStateChanged: ErrorReason{code=errCode = 104, msg='be 10004 abort task with reason: fetch failed due to requested offset not available on the broker: Broker: Offset out of range'} * Q11. ERROR 1105 (HY000): errCode = 2, detailMessage = (192.168.90.91)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none * Q12. create partition failed. partition numbers will exceed limit variable max_auto_partition_num --- # Source: https://docs.velodb.io/cloud/4.x/best-practice/lakehouse-faq Version: 4.x On this page # Data Lakehouse FAQ ## Certificate Issues​ 1. When querying, an error `curl 77: Problem with the SSL CA cert.` occurs. This indicates that the current system certificate is too old and needs to be updated locally. * You can download the latest CA certificate from `https://curl.haxx.se/docs/caextract.html`. * Place the downloaded `cacert-xxx.pem` into the `/etc/ssl/certs/` directory, for example: `sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt`. 2. When querying, an error occurs: `ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none`. yum install -y ca-certificates ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt ## Kerberos​ 1. When connecting to a Hive Metastore authenticated with Kerberos, an error `GSS initiate failed` is encountered. This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps: 1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later. 2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes. * `hadoop.kerberos.keytab`/`hadoop.kerberos.principal`: Used for Hadoop hdfs access, fill in the corresponding values for hdfs. * `hive.metastore.kerberos.principal`: Used for hive metastore. 3. Try replacing the IP in the principal with a domain name (do not use the default `_HOST` placeholder). 4. Ensure that the `/etc/krb5.conf` file exists on all FE and BE nodes. 2. When connecting to a Hive database through the Hive Catalog, an error occurs: `RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`. If the error occurs during the query when there are no issues with `show databases` and `show tables`, follow these two steps: * Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories. * Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query. When encountering the error `GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket)` while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue. * Before restarting all nodes, configure `-Djavax.security.auth.useSubjectCredsOnly=false` in the JAVA_OPTS parameter in `"${DORIS_HOME}/be/conf/be.conf"` to obtain JAAS credentials information through the underlying mechanism rather than the application. * Refer to [JAAS Troubleshooting](https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html) for solutions to common JAAS errors. To resolve the error `Unable to obtain password from user` when configuring Kerberos in the Catalog: * Ensure the principal used is listed in klist by checking with `klist -kt your.keytab`. * Verify the catalog configuration for any missing settings such as `yarn.resourcemanager.principal`. * If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the `JAVA_HOME` environment variable. * Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation. * JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the `$JAVA_HOME/jre/lib/security` directory based on the JDK version: * JDK6: [JCE6](http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html) * JDK7: [JCE7](http://www.oracle.com/technetwork/java/embedded/embedded-se/downloads/jce-7-download-432124.html) * JDK8: [JCE8](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html) When encountering the error `java.security.InvalidKeyException: Illegal key size` while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162 or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files. If configuring Kerberos in the Catalog results in the error `SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]`, place the `core- site.xml` file in the `"${DORIS_HOME}/be/conf"` directory. If accessing HDFS results in the error `No common protection layer between client and server`, ensure that the `hadoop.rpc.protection` properties on the client and server are consistent. hadoop.security.authentication kerberos When using Broker Load with Kerberos configured and encountering the error `Cannot locate default realm.`: Add the configuration item `-Djava.security.krb5.conf=/your-path` to the `JAVA_OPTS` in the `start_broker.sh` script for Broker Load. 3. When using Kerberos configuration in the Catalog, the `hadoop.username` property cannot be used simultaneously. 4. Accessing Kerberos with JDK 17 When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues accessing due to the use of deprecated encryption algorithms. You need to add the `allow_weak_crypto=true` property in krb5.conf or upgrade the encryption algorithm in Kerberos. For more details, refer to: ## JDBC Catalog​ 1. Error connecting to SQLServer via JDBC Catalog: `unable to find valid certification path to requested target` Add the `trustServerCertificate=true` option in the `jdbc_url`. 2. Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions Add `useUnicode=true&characterEncoding=utf-8` in the `jdbc_url`. > Note: Starting from version 1.2.3, when connecting to MySQL database via > JDBC Catalog, these parameters will be automatically added. 3. Error connecting to MySQL database via JDBC Catalog: `Establishing SSL connection without server's identity verification is not recommended` Add `useSSL=true` in the `jdbc_url`. 4. When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver. ## Hive Catalog​ 1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` You can try the following methods: * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. * Configure in `hive-site.xml`: metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader After the configuration is completed, you need to restart the Hive Metastore. * Add `"get_schema_from_table" = "true"` in the Catalog properties This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` If the fe.log contains the following stack trace: Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.getFilteredObjects(AuthorizationMetaStoreFilterHook.java:78) ~[hive-exec-3.1.3-core.jar:3.1.3] at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.filterDatabases(AuthorizationMetaStoreFilterHook.java:55) ~[hive-exec-3.1.3-core.jar:3.1.3] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1548) ~[doris-fe.jar:3.1.3] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1542) ~[doris-fe.jar:3.1.3] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181] Try adding `"metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl"` in the `create catalog` statement to resolve. 3. If after creating Hive Catalog, `show tables` works fine but querying results in `java.net.UnknownHostException: xxxxx` Add the following in the CATALOG's PROPERTIES: 'fs.defaultFS' = 'hdfs://' 4. Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as `_col0`, `_col1`, `_col2`, etc. In this case, add `hive.version` as 1.x.x in the catalog configuration to map with the column names in the hive table. CREATE CATALOG hive PROPERTIES ( 'hive.version' = '1.x.x' ); 5. When querying table data using Catalog, errors related to Hive Metastore such as `Invalid method name` are encountered, set the `hive.version` parameter. 6. When querying a table in ORC format, if the FE reports `Could not obtain block` or `Caused by: java.lang.NoSuchFieldError: types`, it may be due to the FE accessing HDFS to retrieve file information and perform file splitting by default. In some cases, the FE may not be able to access HDFS. This can be resolved by adding the following parameter: `"hive.exec.orc.split.strategy" = "BI"`. Other options include HYBRID (default) and ETL. 7. In Hive, you can find the partition field values of a Hudi table, but in Doris, you cannot. Doris and Hive currently have different ways of querying Hudi. In Doris, you need to add the partition fields in the avsc file structure of the Hudi table. If not added, Doris will query with partition_val being empty (even if `hoodie.datasource.hive_sync.partition_fields=partition_val` is set). { "type": "record", "name": "record", "fields": [{ "name": "partition_val", "type": [ "null", "string" ], "doc": "Preset partition field, empty string when not partitioned", "default": null }, { "name": "name", "type": "string", "doc": "Name" }, { "name": "create_time", "type": "string", "doc": "Creation time" } ] } 8. When querying a Hive external table, if you encounter the error `java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found`, search for `hadoop-lzo-*.jar` in the Hadoop environment, place it in the `"${DORIS_HOME}/fe/lib/"` directory, and restart the FE. Starting from version 2.0.2, you can place this file in the `custom_lib/` directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced. 9. When creating a Hive table specifying the serde as `org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe`, and encountering the error `storage schema reading not supported` when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service: metastore.storage.schema.reader.impl org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader 10. Error: `java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty`. The complete error message in the FE log is as follows: org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path exception. path=s3://bucket/part-*, err: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty org.apache.hadoop.fs.s3a.AWSClientIOException: listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Caused by: javax.net.ssl.SSLException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Caused by: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty Try updating the CA certificate on the FE node using `update-ca-trust (CentOS/RockyLinux)`, and then restart the FE process. 11. BE error: `java.lang.InternalError`. If you see an error similar to the following in `be.INFO`: W20240506 15:19:57.553396 266457 jni-util.cpp:259] java.lang.InternalError at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.(ZlibDecompressor.java:114) at org.apache.hadoop.io.compress.GzipCodec$GzipZlibDecompressor.(GzipCodec.java:229) at org.apache.hadoop.io.compress.GzipCodec.createDecompressor(GzipCodec.java:188) at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:183) at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.(CodecFactory.java:99) at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:223) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:212) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:43) It is because the Doris built-in `libz.a` conflicts with the system environment's `libz.so`. To resolve this issue, first execute `export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH`, and then restart the BE process. 12. When inserting data into Hive, an error occurred as `HiveAccessControlException Permission denied: user [user_a] does not have [UPDATE] privilege on [database/table]`. Since after inserting the data, the corresponding statistical information needs to be updated, and this update operation requires the alter privilege. Therefore, the alter privilege needs to be added for this user on Ranger. ## HDFS​ 1. When accessing HDFS 3.x, if you encounter the error `java.lang.VerifyError: xxx`, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2. 2. Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used. Note: This feature may increase the load on the HDFS cluster, so use it judiciously. You can enable this feature by: create catalog regression properties ( 'type'='hms', 'hive.metastore.uris' = 'thrift://172.21.16.47:7004', 'dfs.client.hedged.read.threadpool.size' = '128', 'dfs.client.hedged.read.threshold.millis' = "500" ); `dfs.client.hedged.read.threadpool.size` represents the number of threads used for Hedged Read, which are shared by an HDFS Client. Typically, for an HDFS cluster, BE nodes will share an HDFS Client. `dfs.client.hedged.read.threshold.millis` is the read threshold in milliseconds. When a read request exceeds this threshold without returning, a Hedged Read is triggered. When enabled, you can see the related parameters in the Query Profile: `TotalHedgedRead`: Number of times Hedged Read was initiated. `HedgedReadWins`: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request) Note that these values are cumulative for a single HDFS Client, not for a single query. The same HDFS Client can be reused by multiple queries. 3. `Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider` In the start scripts of FE and BE, the environment variable `HADOOP_CONF_DIR` is added to the CLASSPATH. If `HADOOP_CONF_DIR` is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong xxx- site.xml file, resulting in reading incorrect information. Check if `HADOOP_CONF_DIR` is configured correctly or remove this environment variable. 4. `BlockMissingExcetpion: Could not obtain block: BP-XXXXXXXXX No live nodes contain current block` Possible solutions include: * Use `hdfs fsck file -files -blocks -locations` to check if the file is healthy. * Check connectivity with datanodes using `telnet`. * Check datanode logs. If you encounter the following error: `org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection` it means that the current hdfs has enabled encrypted transmission, but the client has not, causing the error. Use any of the following solutions: * Copy hdfs-site.xml and core-site.xml to be/conf and fe/conf directories. (Recommended) * In hdfs-site.xml, find the corresponding configuration `dfs.data.transfer.protection` and set this parameter in the catalog. ## DLF Catalog​ 1. When using the DLF Catalog, if `Invalid address` occurs during BE reading JindoFS data, add the domain name appearing in the logs to IP mapping in `/etc/hosts`. 2. If there is no permission to read data, use the `hadoop.username` property to specify a user with permission. 3. The metadata in the DLF Catalog should be consistent with DLF. When managing metadata using DLF, newly imported partitions in Hive may not be synchronized by DLF, leading to inconsistencies between DLF and Hive metadata. To address this, ensure that Hive metadata is fully synchronized by DLF. ## Other Issues​ 1. Query results in garbled characters after mapping Binary type to Doris Doris natively does not support the Binary type, so when mapping Binary types from various data lakes or databases to Doris, it is usually done using the String type. The String type can only display printable characters. If you need to query the content of Binary data, you can use the `TO_BASE64()` function to convert it to Base64 encoding before further processing. 2. Analyzing Parquet files When querying Parquet files, due to potential differences in the format of Parquet files generated by different systems, such as the number of RowGroups, index values, etc., sometimes it is necessary to check the metadata of Parquet files for issue identification or performance analysis. Here is a tool provided to help users analyze Parquet files more conveniently: 1. Download and unzip [Apache Parquet Cli 1.14.0](https://github.com/morningman/tools/releases/download/apache-parquet-cli-1.14.0/apache-parquet-cli-1.14.0.tar.xz) 2. Download the Parquet file to be analyzed to your local machine, assuming the path is `/path/to/file.parquet` 3. Use the following command to analyze the metadata of the Parquet file: `./parquet-tools meta /path/to/file.parquet` 4. For more functionalities, refer to [Apache Parquet Cli documentation](https://github.com/apache/parquet-java/tree/apache-parquet-1.14.0/parquet-cli) On This Page * Certificate Issues * Kerberos * JDBC Catalog * Hive Catalog * HDFS * DLF Catalog * Other Issues --- # Source: https://docs.velodb.io/cloud/4.x/best-practice/load-faq Version: 4.x On this page # Load FAQ ## General Load FAQ​ ### Error "[DATA_QUALITY_ERROR] Encountered unqualified data"​ **Problem Description** : Data quality error during loading. **Solution** : * Stream Load and Insert Into operations will return an error URL, while for Broker Load you can check the error URL through the `Show Load` command. * Use a browser or curl command to access the error URL to view the specific data quality error reasons. * Use the strict_mode and max_filter_ratio parameters to control the acceptable error rate. ### Error "[E-235] Failed to init rowset builder"​ **Problem Description** : Error -235 occurs when the load frequency is too high and data hasn't been compacted in time, exceeding version limits. **Solution** : * Increase the batch size of data loading and reduce loading frequency. * Increase the `max_tablet_version_num` parameter in `be.conf`, it is recommended not to exceed 5000. ### Error "[E-238] Too many segments in rowset"​ **Problem Description** : Error -238 occurs when the number of segments under a single rowset exceeds the limit. **Common Causes** : * The bucket number configured during table creation is too small. * Data skew occurs; consider using more balanced bucket keys. ### Error "Transaction commit successfully, BUT data will be visible later"​ **Problem Description** : Data load is successful but temporarily not visible. **Cause** : Usually due to transaction publish delay caused by system resource pressure. ### Error "Failed to commit kv txn [...] Transaction exceeds byte limit"​ **Problem Description** : In shared-nothing mode, too many partitions and tablets are involved in a single load, exceeding the transaction size limit. **Solution** : * Load data by partition in batches to reduce the number of partitions involved in a single load. * Optimize table structure to reduce the number of partitions and tablets. ### Extra "\r" in the last column of CSV file​ **Problem Description** : Usually caused by Windows line endings. **Solution** : Specify the correct line delimiter: `-H "line_delimiter:\r\n"` ### CSV data with quotes imported as null​ **Problem Description** : CSV data with quotes becomes null after import. **Solution** : Use the `trim_double_quotes` parameter to remove double quotes around fields. ## Stream Load​ ### Reasons for Slow Loading​ * Bottlenecks in CPU, IO, memory, or network card resources. * Slow network between client machine and BE machines, can be initially diagnosed through ping latency from client to BE machines. * Webserver thread count bottleneck, too many concurrent Stream Loads on a single BE (exceeding be.conf webserver_num_workers configuration) may cause thread count bottleneck. * Memtable Flush thread count bottleneck, check BE metrics doris_be_flush_thread_pool_queue_size to see if queuing is severe. Can be resolved by increasing the be.conf flush_thread_num_per_store parameter. ### Handling Special Characters in Column Names​ When column names contain special characters, use single quotes with backticks to specify the columns parameter: curl --location-trusted -u root:"" \ -H 'columns:`@coltime`,colint,colvar' \ -T a.csv \ -H "column_separator:," \ http://127.0.0.1:8030/api/db/loadtest/_stream_load ## Routine Load​ ### Major Bug Fixes​ Issue Description| Trigger Conditions| Impact Scope| Temporary Solution| Affected Versions| Fixed Versions| Fix PR| When at least one job times out while connecting to Kafka, it affects the import of other jobs, slowing down global Routine Load imports.| At least one job times out while connecting to Kafka.| Shared-nothing and shared-storage| Stop or manually pause the job to resolve the issue.| <2.1.9 <3.0.5| 2.1.9 3.0.5| [#47530](https://github.com/apache/doris/pull/47530)| User data may be lost after restarting the FE Master.| The job's offset is set to OFFSET_END, and the FE is restarted.| Shared-storage| Change the consumption mode to OFFSET_BEGINNING.| 3.0.2-3.0.4| 3.0.5| [#46149](https://github.com/apache/doris/pull/46149)| A large number of small transactions are generated during import, causing compaction to fail and resulting in continuous -235 errors.| Doris consumes data too quickly, or Kafka data flow is in small batches.| Shared-nothing and shared-storage| Pause the Routine Load job and execute the following command: `ALTER ROUTINE LOAD FOR jobname FROM kafka ("property.enable.partition.eof" = "false");`| <2.1.8 <3.0.4| 2.1.8 3.0.4| [#45528](https://github.com/apache/doris/pull/45528), [#44949](https://github.com/apache/doris/pull/44949), [#39975](https://github.com/apache/doris/pull/39975)| Kafka third-party library destructor hangs, causing data consumption to fail.| Kafka topic deletion (possibly other conditions).| Shared-nothing and shared-storage| Restart all BE nodes.| <2.1.8 <3.0.4| 2.1.8 3.0.4| [#44913](https://github.com/apache/doris/pull/44913)| Routine Load scheduling hangs.| Timeout occurs when FE aborts a transaction in Meta Service.| Shared- storage| Restart the FE node.| <3.0.2| 3.0.2| [#41267](https://github.com/apache/doris/pull/41267)| Routine Load restart issue.| Restarting BE nodes.| Shared-nothing and shared-storage| Manually resume the job.| <2.1.7 <3.0.2| 2.1.7 3.0.2| [#41134](https://github.com/apache/doris/pull/41134) ---|---|---|---|---|---|--- ### Default Configuration Optimizations​ Optimization Content| Applied Versions| Corresponding PR| Increased the timeout duration for Routine Load.| 2.1.7 3.0.3| [#42042](https://github.com/apache/doris/pull/42042), [#40818](https://github.com/apache/doris/pull/40818)| Adjusted the default value of `max_batch_interval`.| 2.1.8 3.0.3| [#42491](https://github.com/apache/doris/pull/42491)| Removed the restriction on `max_batch_interval`.| 2.1.5 3.0.0| [#29071](https://github.com/apache/doris/pull/29071)| Adjusted the default values of `max_batch_rows` and `max_batch_size`.| 2.1.5 3.0.0| [#36632](https://github.com/apache/doris/pull/36632) ---|---|--- ### Observability Optimizations​ Optimization Content| Applied Versions| Corresponding PR| Added observability- related metrics.| 3.0.5| [#48209](https://github.com/apache/doris/pull/48209), [#48171](https://github.com/apache/doris/pull/48171), [#48963](https://github.com/apache/doris/pull/48963) ---|---|--- ### Error "failed to get latest offset"​ **Problem Description** : Routine Load cannot get the latest Kafka offset. **Common Causes** : * Usually due to network connectivity issues with Kafka. Verify by pinging or using telnet to test the Kafka domain name. * Timeout caused by third-party library bug, error: java.util.concurrent.TimeoutException: Waited X seconds ### Error "failed to get partition meta: Local:'Broker transport failure"​ **Problem Description** : Routine Load cannot get Kafka Topic Partition Meta. **Common Causes** : * Usually due to network connectivity issues with Kafka. Verify by pinging or using telnet to test the Kafka domain name. * If using domain names, try configuring domain name mapping in /etc/hosts ### Error "Broker: Offset out of range"​ **Problem Description** : The consumed offset doesn't exist in Kafka, possibly because it has been cleaned up by Kafka. **Solution** : * Need to specify a new offset for consumption, for example, set offset to OFFSET_BEGINNING. * Need to set appropriate Kafka log cleanup parameters based on import speed: log.retention.hours, log.retention.bytes, etc. On This Page * General Load FAQ * Error "[DATA_QUALITY_ERROR] Encountered unqualified data" * Error "[E-235] Failed to init rowset builder" * Error "[E-238] Too many segments in rowset" * Error "Transaction commit successfully, BUT data will be visible later" * Error "Failed to commit kv txn [...] Transaction exceeds byte limit" * Extra "\r" in the last column of CSV file * CSV data with quotes imported as null * Stream Load * Reasons for Slow Loading * Handling Special Characters in Column Names * Routine Load * Major Bug Fixes * Default Configuration Optimizations * Observability Optimizations * Error "failed to get latest offset" * Error "failed to get partition meta: Local:'Broker transport failure" * Error "Broker: Offset out of range" --- # Source: https://docs.velodb.io/cloud/4.x/best-practice/sql-faq Version: 4.x On this page # SQL Error ### Q1. Show backends/frontends The information viewed is incomplete​ After executing certain statements such as `show backends/frontends`, some columns may be found to be incomplete in the results. For example, the disk capacity information cannot be seen in the show backends result. Usually this problem occurs when the cluster has multiple FEs. If users connect to non-Master FE nodes to execute these statements, they will see incomplete information. This is because some information exists only on the Master FE node. For example, BE's disk usage information, etc. Therefore, complete information can only be obtained after a direct connection to the Master FE. Of course, users can also execute `set forward_to_master=true;` before executing these statements. After the session variable is set to true, some information viewing statements executed subsequently will be automatically forwarded to the Master FE to obtain the results. In this way, no matter which FE the user is connected to, the complete result can be obtained. ### Q2. invalid cluster id: xxxx​ This error may appear in the results of the show backends or show frontends commands. Usually appears in the error message column of an FE or BE node. The meaning of this error is that after the Master FE sends the heartbeat information to the node, the node finds that the cluster id carried in the heartbeat information is different from the cluster id stored locally, so it refuses to respond to the heartbeat. The Master FE node of Doris will actively send heartbeats to each FE or BE node, and will carry a cluster_id in the heartbeat information. cluster_id is the unique cluster ID generated by the Master FE when a cluster is initialized. When the FE or BE receives the heartbeat information for the first time, the cluster_id will be saved locally in the form of a file. The file of FE is in the image/ directory of the metadata directory, and the BE has a cluster_id file in all data directories. After that, each time the node receives the heartbeat, it will compare the content of the local cluster_id with the content in the heartbeat. If it is inconsistent, it will refuse to respond to the heartbeat. This mechanism is a node authentication mechanism to prevent receiving false heartbeat messages sent by nodes outside the cluster. If needed to recover from this error. The first thing to do is to make sure that all the nodes are in the correct cluster. After that, for the FE node, you can try to modify the cluster_id value in the image/VERSION file in the metadata directory and restart the FE. For the BE node, you can delete all the cluster_id files in the data directory and restart the BE. ### Q3. Unique Key model query results are inconsistent​ In some cases, when a user uses the same SQL to query a table with a Unique Key model, the results of multiple queries may be inconsistent. And the query results always change between 2-3 kinds. This may be because, in the same batch of imported data, there are data with the same key but different values, which will lead to inconsistent results between different replicas due to the uncertainty of the sequence of data overwriting. For example, the table is defined as k1, v1. A batch of imported data is as follows: 1, "abc" 1, "def" Then maybe the result of copy 1 is `1, "abc"`, and the result of copy 2 is `1, "def"`. As a result, the query results are inconsistent. To ensure that the data sequence between different replicas is unique, you can refer to the [Sequence Column](/cloud/4.x/user-guide/data- modification/update/update-of-unique-model) function. ### Q4. The problem of querying bitmap/hll type data returns NULL​ In version 1.1.x, when vectorization is enabled, and the bitmap type field in the query data table returns a NULL result, 1. First you have to `set return_object_data_as_binary=true;` 2. Turn off vectorization `set enable_vectorized_engine=false;` 3. Turn off SQL cache `set [global] enable_sql_cache = false;` This is because the bitmap / hll type is in the vectorized execution engine: the input is all NULL, and the output result is also NULL instead of 0 ### Q5. The problem of querying bitmap/hll type data returns NULL​ In version 1.1.x, when vectorization is turned on, and the bitmp type field in the query data table returns a NULL result, 1. First you have to `set return_object_data_as_binary=true;` 2. Turn off vectorization `set enable_vectorized_engine=false;` 3. Turn off SQL cache `set [global] enable_sql_cache = false;` This is because the bitmap/hll type is in the vectorized execution engine: the input is all NULL, and the output result is also NULL instead of 0 ### Q6. Error when accessing object storage: curl 77: Problem with the SSL CA cert​ If the `curl 77: Problem with the SSL CA cert` error appears in the be.INFO log. You can try to solve it in the following ways: 1. Download the certificate at : cacert.pem 2. Copy the certificate to the specified location: `sudo cp /tmp/cacert.pem /etc/ssl/certs/ca-certificates.crt` 3. Restart the BE node. ### Q7. import error:"Message": "[INTERNAL_ERROR]single replica load is disabled on BE."​ 1. Make sure this parameters `enable_single_replica_load` in be.conf is set true 2. Restart the BE node. On This Page * Q1. Show backends/frontends The information viewed is incomplete * Q2. invalid cluster id: xxxx * Q3. Unique Key model query results are inconsistent * Q4. The problem of querying bitmap/hll type data returns NULL * Q5. The problem of querying bitmap/hll type data returns NULL * Q6. Error when accessing object storage: curl 77: Problem with the SSL CA cert * Q7. import error:"Message": "[INTERNAL_ERROR]single replica load is disabled on BE." --- # Source: https://docs.velodb.io/cloud/4.x/ecosystem/observability/logstash Version: 4.x On this page # Logstash Doris output plugin ## Introduction​ Logstash is a log ETL framework (collect, preprocess, send to storage systems) that supports custom output plugins to write data into storage systems. The Logstash Doris output plugin is a plugin for outputting data to Doris. The Logstash Doris output plugin calls the [Doris Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual) HTTP interface to write data into Doris in real-time, offering capabilities such as multi-threaded concurrency, failure retries, custom Stream Load formats and parameters, and output write speed. Using the Logstash Doris output plugin mainly involves three steps: 1. Install the plugin into Logstash 2. Configure the Doris output address and other parameters 3. Start Logstash to write data into Doris in real-time ## Installation​ ### Obtaining the Plugin​ You can download the plugin from the official website or compile it from the source code yourself. * Download from the official website * Installation package without dependencies * Compile from source code cd extension/logstash/ gem build logstash-output-doris.gemspec ### Installing the Plugin​ * Standard Installation `${LOGSTASH_HOME}` is the installation directory of Logstash. Run the `bin/logstash-plugin` command under it to install the plugin. ${LOGSTASH_HOME}/bin/logstash-plugin install logstash-output-doris-1.2.0.gem Validating logstash-output-doris-1.2.0.gem Installing logstash-output-doris Installation successful The standard installation mode will automatically install the ruby modules that the plugin depends on. In cases where the network is not available, it will get stuck and cannot complete. In such cases, you can download the zip installation package with dependencies for a completely offline installation, noting that you need to use `file://` to specify the local file system. * Offline Installation ${LOGSTASH_HOME}/bin/logstash-plugin install file:///tmp/logstash-output-doris-1.2.0.zip Installing file: logstash-output-doris-1.2.0.zip Resolving dependencies......................... Install successful ## Configuration​ The configuration for the Logstash Doris output plugin is as follows: Configuration| Description| `http_hosts`| Stream Load HTTP address, formatted as a string array, can have one or more elements, each element is host:port. For example: ["http://fe1:8030", "http://fe2:8030"]| `user`| Doris username, this user needs to have import permissions for the corresponding Doris database and table| `password`| Password for the Doris user| `db`| The Doris database name to write into| `table`| The Doris table name to write into| `label_prefix`| Doris Stream Load Label prefix, the final generated Label is _{label_prefix}_{db}_{table}_{yyyymmdd_hhmmss}_{uuid}_, the default value is logstash| `headers`| Doris Stream Load headers parameter, the syntax format is a ruby map, for example: headers => { "format" => "json", "read_json_by_line" => "true" }| `mapping`| Mapping from Logstash fields to Doris table fields, refer to the usage examples in the subsequent sections| `message_only`| A special form of mapping, only outputs the Logstash @message field to Doris, default is false| `max_retries`| Number of retries for Doris Stream Load requests on failure, default is -1 for infinite retries to ensure data reliability| `log_request`| Whether to output Doris Stream Load request and response metadata in logs for troubleshooting, default is false| `log_speed_interval`| Time interval for outputting speed in logs, unit is seconds, default is 10, setting to 0 can disable this type of logging ---|--- ## Usage Example​ ### TEXT Log Collection Example​ This example demonstrates TEXT log collection using Doris FE logs as an example. **1\. Data** FE log files are typically located at the fe/log/fe.log file under the Doris installation directory. They are typical Java program logs, including fields such as timestamp, log level, thread name, code location, and log content. Not only do they contain normal logs, but also exception logs with stacktraces, which are multiline. Log collection and storage need to combine the main log and stacktrace into a single log entry. 2024-07-08 21:18:01,432 INFO (Statistics Job Appender|61) [StatisticsJobAppender.runAfterCatalogReady():70] Stats table not available, skip 2024-07-08 21:18:53,710 WARN (STATS_FETCH-0|208) [StmtExecutor.executeInternalQuery():3332] Failed to run internal SQL: OriginStatement{originStmt='SELECT * FROM __internal_schema.column_statistics WHERE part_id is NULL ORDER BY update_time DESC LIMIT 500000', idx=0} org.apache.doris.common.UserException: errCode = 2, detailMessage = tablet 10031 has no queryable replicas. err: replica 10032's backend 10008 does not exist or not alive at org.apache.doris.planner.OlapScanNode.addScanRangeLocations(OlapScanNode.java:931) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.planner.OlapScanNode.computeTabletInfo(OlapScanNode.java:1197) ~[doris-fe.jar:1.2-SNAPSHOT] **2\. Table Creation** The table structure includes fields such as the log's creation time, collection time, hostname, log file path, log type, log level, thread name, code location, and log content. CREATE TABLE `doris_log` ( `log_time` datetime NULL COMMENT 'log content time', `collect_time` datetime NULL COMMENT 'log agent collect time', `host` text NULL COMMENT 'hostname or ip', `path` text NULL COMMENT 'log file path', `type` text NULL COMMENT 'log type', `level` text NULL COMMENT 'log level', `thread` text NULL COMMENT 'log thread', `position` text NULL COMMENT 'log code position', `message` text NULL COMMENT 'log message', INDEX idx_host (`host`) USING INVERTED COMMENT '', INDEX idx_path (`path`) USING INVERTED COMMENT '', INDEX idx_type (`type`) USING INVERTED COMMENT '', INDEX idx_level (`level`) USING INVERTED COMMENT '', INDEX idx_thread (`thread`) USING INVERTED COMMENT '', INDEX idx_position (`position`) USING INVERTED COMMENT '', INDEX idx_message (`message`) USING INVERTED PROPERTIES("parser" = "unicode", "support_phrase" = "true") COMMENT '' ) ENGINE=OLAP DUPLICATE KEY(`log_time`) COMMENT 'OLAP' PARTITION BY RANGE(`log_time`) () DISTRIBUTED BY RANDOM BUCKETS 10 PROPERTIES ( "replication_num" = "1", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-7", "dynamic_partition.end" = "1", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "10", "dynamic_partition.create_history_partition" = "true", "compaction_policy" = "time_series" ); **3\. Logstash Configuration** Logstash mainly has two types of configuration files: one for the entire Logstash system and another for a specific log collection. The configuration file for the entire Logstash system is usually located at config/logstash.yml. To improve performance when writing to Doris, it is necessary to modify the batch size and batch delay. For logs with an average size of a few hundred bytes per line, a batch size of 1,000,000 lines and a batch delay of 10 seconds are recommended. pipeline.batch.size: 1000000 pipeline.batch.delay: 10000 The configuration file for a specific log collection, such as logstash_doris_log.conf, mainly consists of three parts corresponding to the various stages of ETL: 1. Input is responsible for reading the raw data. 2. Filter is responsible for data transformation. 3. Output is responsible for sending the data to the output destination. # 1. input is responsible for reading raw data # File input is an input plugin that can be configured to read the log file of the configured path. It uses the multiline codec to concatenate lines that do not start with a timestamp to the end of the previous line, achieving the effect of merging stacktraces with the main log. File input saves the log content in the @message field, and there are also some metadata fields such as host, log.file.path. Here, we manually add a field named type through add_field, with its value set to fe.log. input { file { path => "/mnt/disk2/xiaokang/opt/doris_master/fe/log/fe.log" add_field => {"type" => "fe.log"} codec => multiline { # valid line starts with timestamp pattern => "^%{TIMESTAMP_ISO8601} " # any line not starting with a timestamp should be merged with the previous line negate => true what => "previous" } } } # 2. filter section is responsible for data transformation # grok is a commonly used data transformation plugin that has some built-in patterns, such as TIMESTAMP_ISO8601 for parsing timestamps, and also supports writing regular expressions to extract fields. filter { grok { match => { # parse log_time, level, thread, position fields from message "message" => "%{TIMESTAMP_ISO8601:log_time} (?[A-Z]+) \((?[^\[]*)\) \[(?[^\]]*)\]" } } } # 3. output section is responsible for data output # Doris output sends data to Doris using the Stream Load HTTP interface. The data format for Stream Load is specified as JSON through the headers parameter, and the mapping parameter specifies the mapping from Logstash fields to JSON fields. Since headers specify "format" => "json", Stream Load will automatically parse the JSON fields and write them into the corresponding fields of the Doris table. output { doris { http_hosts => ["http://localhost:8630"] user => "root" password => "" db => "log_db" table => "doris_log" headers => { "format" => "json" "read_json_by_line" => "true" "load_to_single_tablet" => "true" } mapping => { "log_time" => "%{log_time}" "collect_time" => "%{@timestamp}" "host" => "%{[host][name]}" "path" => "%{[log][file][path]}" "type" => "%{type}" "level" => "%{level}" "thread" => "%{thread}" "position" => "%{position}" "message" => "%{message}" } log_request => true } } **4\. Running Logstash** ${LOGSTASH_HOME}/bin/logstash -f config/logstash_doris_log.conf # When log_request is set to true, the log will output the request parameters and response results of each Stream Load. [2024-07-08T22:35:34,772][INFO ][logstash.outputs.doris ][main][e44d2a24f17d764647ce56f5fed24b9bbf08d3020c7fddcc3298800daface80a] doris stream load response: { "TxnId": 45464, "Label": "logstash_log_db_doris_log_20240708_223532_539_6c20a0d1-dcab-4b8e-9bc0-76b46a929bd1", "Comment": "", "TwoPhaseCommit": "false", "Status": "Success", "Message": "OK", "NumberTotalRows": 452, "NumberLoadedRows": 452, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 277230, "LoadTimeMs": 1797, "BeginTxnTimeMs": 0, "StreamLoadPutTimeMs": 18, "ReadDataTimeMs": 9, "WriteDataTimeMs": 1758, "CommitAndPublishTimeMs": 18 } # By default, speed information is logged every 10 seconds, including the amount of data since startup (in MB and ROWS), the total speed (in MB/s and R/s), and the speed in the last 10 seconds. [2024-07-08T22:35:38,285][INFO ][logstash.outputs.doris ][main] total 11 MB 18978 ROWS, total speed 0 MB/s 632 R/s, last 10 seconds speed 1 MB/s 1897 R/s ### JSON Log Collection Example​ This example demonstrates JSON log collection using data from the GitHub events archive. **1\. Data** The GitHub events archive contains archived data of GitHub user actions, formatted as JSON. It can be downloaded from [here](https://data.gharchive.org/), for example, the data for January 1, 2024, at 3 PM. wget https://data.gharchive.org/2024-01-01-15.json.gz Below is a sample of the data. Normally, each piece of data is on a single line, but for ease of display, it has been formatted here. { "id": "37066529221", "type": "PushEvent", "actor": { "id": 46139131, "login": "Bard89", "display_login": "Bard89", "gravatar_id": "", "url": "https://api.github.com/users/Bard89", "avatar_url": "https://avatars.githubusercontent.com/u/46139131?" }, "repo": { "id": 780125623, "name": "Bard89/talk-to-me", "url": "https://api.github.com/repos/Bard89/talk-to-me" }, "payload": { "repository_id": 780125623, "push_id": 17799451992, "size": 1, "distinct_size": 1, "ref": "refs/heads/add_mvcs", "head": "f03baa2de66f88f5f1754ce3fa30972667f87e81", "before": "85e6544ede4ae3f132fe2f5f1ce0ce35a3169d21" }, "public": true, "created_at": "2024-04-01T23:00:00Z" } **2\. Table Creation** CREATE DATABASE log_db; USE log_db; CREATE TABLE github_events ( `created_at` DATETIME, `id` BIGINT, `type` TEXT, `public` BOOLEAN, `actor.id` BIGINT, `actor.login` TEXT, `actor.display_login` TEXT, `actor.gravatar_id` TEXT, `actor.url` TEXT, `actor.avatar_url` TEXT, `repo.id` BIGINT, `repo.name` TEXT, `repo.url` TEXT, `payload` TEXT, `host` TEXT, `path` TEXT, INDEX `idx_id` (`id`) USING INVERTED, INDEX `idx_type` (`type`) USING INVERTED, INDEX `idx_actor.id` (`actor.id`) USING INVERTED, INDEX `idx_actor.login` (`actor.login`) USING INVERTED, INDEX `idx_repo.id` (`repo.id`) USING INVERTED, INDEX `idx_repo.name` (`repo.name`) USING INVERTED, INDEX `idx_host` (`host`) USING INVERTED, INDEX `idx_path` (`path`) USING INVERTED, INDEX `idx_payload` (`payload`) USING INVERTED PROPERTIES("parser" = "unicode", "support_phrase" = "true") ) ENGINE = OLAP DUPLICATE KEY(`created_at`) PARTITION BY RANGE(`created_at`) () DISTRIBUTED BY RANDOM BUCKETS 10 PROPERTIES ( "replication_num" = "1", "compaction_policy" = "time_series", "enable_single_replica_compaction" = "true", "dynamic_partition.enable" = "true", "dynamic_partition.create_history_partition" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-30", "dynamic_partition.end" = "1", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "10", "dynamic_partition.replication_num" = "1" ); **3\. Logstash Configuration** The configuration file differs from the previous TEXT log collection in the following aspects: 1. The codec parameter for file input is json. Logstash will parse each line of text as JSON format and use the parsed fields for subsequent processing. 2. No filter plugin is used because no additional processing or transformation is needed. input { file { path => "/tmp/github_events/2024-04-01-23.json" codec => json } } output { doris { http_hosts => ["http://fe1:8630", "http://fe2:8630", "http://fe3:8630"] user => "root" password => "" db => "log_db" table => "github_events" headers => { "format" => "json" "read_json_by_line" => "true" "load_to_single_tablet" => "true" } mapping => { "created_at" => "%{created_at}" "id" => "%{id}" "type" => "%{type}" "public" => "%{public}" "actor.id" => "%{[actor][id]}" "actor.login" => "%{[actor][login]}" "actor.display_login" => "%{[actor][display_login]}" "actor.gravatar_id" => "%{[actor][gravatar_id]}" "actor.url" => "%{[actor][url]}" "actor.avatar_url" => "%{[actor][avatar_url]}" "repo.id" => "%{[repo][id]}" "repo.name" => "%{[repo][name]}" "repo.url" => "%{[repo][url]}" "payload" => "%{[payload]}" "host" => "%{[host][name]}" "path" => "%{[log][file][path]}" } log_request => true } } **4\. Running Logstash** ${LOGSTASH_HOME}/bin/logstash -f logstash_github_events.conf On This Page * Introduction * Installation * Obtaining the Plugin * Installing the Plugin * Configuration * Usage Example * TEXT Log Collection Example * JSON Log Collection Example --- # Source: https://docs.velodb.io/cloud/4.x/getting-started/overview Version: 4.x On this page # Introduction VeloDB Cloud is a new generation of multi-cloud native real-time data warehouse based on Apache Doris, focusing on meeting the real-time analysis needs of enterprise-level big data, and providing customers with extremely cost-effective, easy-to-use data analysis services. VeloDB Cloud is publicly available to customers. If customers want to deploy VeloDB data warehouse to AWS (Amazon Web Services), Microsoft Azure, GCP (Google Cloud Platform), please visit and log in to [VeloDB Cloud](https://www.velodb.cloud/passport/login). ## Key Features​ * **Extreme Performance** : In terms of storage, VeloDB Cloud adopts efficient columnar storage and data indexing; in terms of computing, VeloDB Cloud relies on the MPP distributed computing architecture and the vectorized execution engine optimized for X64 and ARM64; VeloDB Cloud is at the global leading level in the ClickBench public performance evaluation. * **Cost-Effective** : VeloDB Cloud adopts a cloud-native architecture that separates storage and computing, and is designed and developed based on cloud infrastructure. In terms of storage, shared object storage achieves extremely low cost; in terms of computing, VeloDB Cloud supports on-demand scaling and start-stop to maximize resource utilization. * **Easy-to-Use** : One-click deployment, out-of-the-box; supports MySQL-compatible network connection protocols; provides integrated connectors with Kafka/Flink/Spark/DBT; has a powerful and easy-to-use visual operation and maintenance management console and data development tools. * **Single-Unified** : On a single product, multiple analytical workloads can be run. Supports real-time/interactive/batch computing types, structured/semi-structured data types, and federated analysis of external data lakes (such as Hive, Iceberg, Hudi, etc.) and databases (such as MySQL, Elasticsearch, etc.). * **Open** : Based on the open source Apache Doris research and development, VeloDB Cloud continue to contribute innovations to the open source community. VeloDB Cloud is fully compatible with the Apache Doris syntax protocol, and can freely migrate data with Apache Doris. Continue to be compatible and mutually certified with domestic and foreign ecological products and tools. Open cooperation with cloud platforms at home and abroad, the product runs on multiple clouds, providing a consistent user experience. * **Safe and Stable** : In terms of data security, VeloDB Cloud provides complete authority control, data encryption, backup and recovery mechanisms; in terms of operation and maintenance management, VeloDB Cloud provides comprehensive observability metrics collection and visual management of data warehouse service; in terms of technical support, VeloDB Cloud has a complete ticketing management system and remote assistance platform, providing multiple levels of expert support services. ## Key Concepts​ ![key concepts of velodb](/assets/images/key-concepts-of- velodb-64d2c1b34cbd3b005929b9844fd088f7.jpeg) Key Concepts of VeloDB Cloud * **Organization** : An organization represents an enterprise or a relatively independent group, and users can use the service as an organization after registering with VeloDB Cloud. Organizations are billing and settlement objects in VeloDB Cloud, and billing, resources, and data between different organizations are isolated from each other. * **Warehouse** : A warehouse is a logical concept that includes computing and storage resources. Each organization can create multiple warehouses to meet the data analysis needs of different businesses, such as orders, advertising, logistics and other businesses. Similarly, resources and data between different warehouses are also isolated from each other, which can be used to meet the security requirements within the organization. * **Cluster** : A cluster is a computing resource in the warehouse, including one or more computing nodes, which can be elastically scaled. A warehouse can contain multiple clusters, which share the underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other. * **Storage** : Use a mature and stable object storage system to store the full amount of data, and support multi-computing cluster shared storage, which brings extremely low storage cost, high data reliability and almost unlimited storage capacity to the data warehouse, and greatly simplifies the implementation complexity of the upper computing cluster. ## Product Architecture​ ![velodb cloud architecture](/assets/images/velodb-cloud- architecture-401e06237d303ed1bd91a3b21b6065fa.jpeg) Cloud-Native Storage and Computing Separation Architecture * **Cloud Service Layer** : The cloud service layer is a collection of supporting services provided by VeloDB Cloud, including: authentication, access control, cloud infrastructure management, metadata management, query parsing and optimization, etc., expressed in the form of a "warehouse". Warehouses are isolated from each other. * **Computing Cluster Layer** : The computing layer is decoupled from the storage layer, supporting flexible elastic scaling and smooth upgrades. The computing layer consists of several computing clusters. Multiple computing clusters share storage, and workloads are isolated between multiple clusters. Each cluster contains one or more computing nodes. Computing nodes use high-speed hard disks to build hot data caches (Cache), and avoid unnecessary cold data reading through leading query optimizers and rich indexing technologies, which significantly optimizes the problem of high response delay of object storage, providing customers with the ultimate data analysis performance. * **Shared Storage Layer** : The bottom layer of VeloDB Cloud uses cheap, highly available, and nearly infinitely scalable object storage as the shared storage layer, and is based on object storage for deep optimization design, which can help customers reduce the cost of data analysis by multiples, and easily support PB-level data analysis needs. The unified standard and maturity of object storage in different cloud environments also strengthens the consistent use experience of VeloDB Cloud in multiple clouds. ## Application Scenario​ * **High Concurrent Real-time Reporting and Analysis** : Use VeloDB Cloud to process online high-concurrency reports to obtain real-time, fast, stable, and highly available services. It supports real-time data writing, sub-second query response, and high-concurrency point queries to meet the high-availability deployment requirements of clusters. * **User Portrait and Behavior Analysis** : Based on VeloDB Cloud, build user CDP (Customer Data management Platform) data warehouse platform layering, support millisecond-level column addition and dynamic tables to flexibly respond to business changes, support rich behavior analysis functions to simplify development and improve efficiency, and support high-level orthogonal bitmaps to achieve second-level circle people in portrait scenes. * **Log Storage and Analysis** : Integrating the VeloDB Cloud data warehouse into the logging system to realize real-time log query, low-cost storage, and efficient processing, reduce the overall cost of the enterprise log system, and improve the performance and reliability of the log system. * **Lake Warehouse Integration and Federated Analysis** : Unified integration of data lakes, databases, and data warehouses into a single platform, relying on the data federation query acceleration capability of VeloDB Cloud, provides high-performance business intelligence reports, Adhoc analysis, and incremental ETL/ELT data processing services. ## Relationship to Apache Doris​ VeloDB Inc ("**VeloDB** ") is a commercial company with products based on Apache Doris. VeloDB was founded in May 2023 by the founding team of Apache Doris. VeloDB is an important driving force of Apache Doris. It has 7 PMC members and 20 Committers, and has led the release of a series of core versions of Apache Doris. VeloDB vigorously promotes the open source Apache Doris, the technology benefits open source users and developers, and launches commercial products based on Apache Doris, the business empowers commercial customers, and the two-wheel drive achieves healthy growth of open source and business. VeloDB Cloud is a new generation of multi-cloud native real-time data warehouse built by VeloDB based on Apache Doris. Compared with Apache Doris, VeloDB Cloud has the following main differences: * The core version is more mature and stable, with more enterprise-level features and cloud-native features. * Provides a built-in visualized operation and maintenance management console and data development tools, no need users to install and deploy, out-of-the-box, minimalist operation and maintenance and management. On This Page * Key Features * Key Concepts * Product Architecture * Application Scenario * Relationship to Apache Doris --- # Source: https://docs.velodb.io/cloud/4.x/getting-started/quick-start Version: 4.x On this page # Getting Started ## New User Registration and Organization Creation​ ### Register and Login​ Click to enter the VeloDB Cloud registration and trial page and fill in the relevant information to complete the registration. ![user register](/assets/images/user- register-f4bf7408e0671addc942afba8a108c54.png) > **Tip** VeloDB Cloud includes two independent account systems: One is used > for logging into the console, as described in this topic. The other one is > used to connect to the warehouse, which is described in the Connections > topic. ### Change Password​ After login, click **User Menu** > **User Center** to change the login password for the VeloDB Cloud console. ![change user password](/assets/images/change-user- password-9531a1cd03598f385130a912a4035ef2.png) Once you have successfully changed the password for the first time, you can use the password for subsequent logins. ## Warehouse and Cluster Creation​ In VeloDB Cloud, the warehouse is a logical concept that includes physical objects such as warehouse metadata, clusters, and data storage. Under each organization, you can create multiple warehouses to meet the needs of different business systems, and the resources and data between these warehouses are isolated. ### Create Warehouse​ A wizard page will be displayed if the organization does not have a warehouse. You can create the first warehouse following the prompts. ![create warehouse](/assets/images/create- warehouse-1556c66eb56b8e952c634f5852ed8361.jpg) You can use a ​free-tier warehouse​ or directly ​purchase a paid warehouse​ based on your analytical requirements. > **Tip:** > > 1. For more information about SaaS and BYOC, see [Overview of > Warehouses](/cloud/4.x/management-guide/warehouse-management/). > 2. If you need to activate a free BYOC, please refer to [Create a BYOC > Warehouse](/cloud/4.x/management-guide/warehouse-management/create-byoc- > warehouse). > ### Create Cluster​ If you have activated the trial warehouse, you will see a trial cluster in that warehouse. In the trial warehouse, you may try the features by importing small amounts of data. You may not create paid clusters under the trial warehouse. If you are happy with the trial experience, you can upgrade the trial warehouse to a paid one, and then you can create paid clusters under the paid warehouse. ## Change Warehouse Password​ The username and password are required when connecting to a warehouse. VeloDB Cloud initializes the username ('admin') and password for you. You can change the password on the **Settings** page. ![change warehouse password](/assets/images/change-warehouse- password-8396e74cfced892bcbd51ac6faa52764.png) > **Warning** The password only supports uppercase letters, lowercase letters, > numbers and special characters ~!@#$%^&*()_+|<>,.?/:;'[]", need to > contain at least 3 of them, length 8-20 characters. ## Connect to Warehouse​ Click **Query** in the left navigation bar, open the login page, enter the username and password, and enter the WebUI interface after completing the login. ![webui query](/assets/images/webui- query-8183389b1bfccb8dd4715c6a996dbdca.png) ### Create Database​ Execute the following statement in the query editor: create database demo; ### Create Data Table​ Execute the following statement in the query editor: use demo; create table mytable ( k1 TINYINT, k2 DECIMAL(10, 2) DEFAULT "10.05", k3 CHAR(10) COMMENT "string column", k4 INT NOT NULL DEFAULT "1" COMMENT "int column" ) COMMENT "my first table" DISTRIBUTED BY HASH(k1) BUCKETS 1; You can see the fields of mytable through desc mytable. ### Insert Data​ Execute the following statement in the query editor: INSERT INTO mytable (k1, k2, k3, k4) VALUES (1, 0.14, 'a1', 20), (2, 1.04, 'b2', 21), (3, 3.14, 'c3', 22), (4, 4.35, 'd4', 23); ### Query Data​ The table creation and data import are completed above, and the query can be performed below. select * from mytable; ![webui query result](/assets/images/webui-query- result-956dff83857ffeb18421f3758085ac44.png) ## (OPTIONAL)Connect to Warehouse Using MySQL Client​ ### IP Whitelist Management​ On the **Connections** page, switch to the **Public Link** tab to manage IP whitelist. Click **Add IP Whitelist** to add new IP addresses. ![public link ip whitelist](/assets/images/public-link-ip- whitelist-40498cfa0943672f009fb66c31d213d5.png) In the IP whitelist, users can add or delete IP addresses to enable or disable their access to the warehouse. ### MySQL Client​ You may download MySQL Client from the official website of MySQL. Here we provide a Linux-free version of [MySQL Client](https://doris-build-hk.oss-cn- hongkong.aliyuncs.com/mysql-client/mysql-5.7.22-linux- glibc2.12-x86_64.tar.gz). If you need MySQL Client for Mac and Windows, please go to the MySQL official website. Currently, VeloDB is compatible with MySQL Client 5.7 and above. You may read details about connections by clicking "Connections" on the target warehouse on the VeloDB Cloud console. > Note: > > 1. The warehouse supports public network connection and private network > (PrivateLink) connection. Different connection methods require different > connection information. > > 2. The public network connection is open by default, and the IP whitelist > is also open to the public by default. If you no longer need to connect to > the warehouse from the public network, please close it. > > 3. For the first connection, please use the user admin and its password. > You can initialize or reset it in the **Setting** page on VeloDB Cloud > console. > > Supposing that you are connecting to a warehouse using the following public link: ![public link connection info](/assets/images/public-link-connection- info-4fb06e922bb490fb95fe191255fdfea0.png) Download MySQL Client and unzip the file, find the `mysql` command line tool under the `bin/` directory. Execute the folowing command to connect to VeloDB. mysql -h 34.199.74.195 -P 33641 -u admin -p After login, if you see the following snippet, that usually means that your Client IP address has not been added to the connection whitelist on the console. ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2 If the following is displayed, that means the connection succeeds. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 119952 Server version: 5.7.37 VeloDB Core version: 3.0.4 Copyright (c) 2000, 2022, Oracle and/or its affiliates. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> ### Create Database and Table​ #### Create Database​ create database demo; #### Create Table​ use demo; create table mytable ( k1 TINYINT, k2 DECIMAL(10, 2) DEFAULT "10.05", k3 CHAR(10) COMMENT "string column", k4 INT NOT NULL DEFAULT "1" COMMENT "int column" ) COMMENT "my first table" DISTRIBUTED BY HASH(k1) BUCKETS 1; You may check details of `mytable` via `desc mytable`. #### Load Data​ Save the following sample data in the local data.csv: 1,0.14,a1,20 2,1.04,b2,21 3,3.14,c3,22 4,4.35,d4,23 **Upload data via HTTP protocol** : curl -u admin:admin_123 -H "fileName:dir1/data.csv" -T data.csv -L '34.199.74.195:39173/copy/upload' You can call and upload multiple files be repeating this command. **Load data by the copy into command:** * * * curl -u admin:admin_123 -H "Content-Type: application/json" '34.199.74.195:39173/copy/query' -d '{"sql": "copy into demo.mytable from @~(\"dir1/data.csv\") PROPERTIES (\"file.column_separator\"=\",\", \"copy.async\"=\"false\")"}' `dir1/data.csv` refers to the file uploaded in the previous step. Wildcard and glob pattern matching are supported here. The service side can automatically identify general formats such as csv. `file.column_separator=","` specifies comma as the separator in the csv format. Since the copy into command is submitted asychronously by default, `"copy.async"="false"` is specified here to implement synchronous submission. That is, the command will only return after the data are loaded successfully. If you see the following response, that means the data are successfully loaded. { "msg": "success", "code": 0, "data": { "result": { "msg": "", "loadedRows": "4", "id": "d33e62f655c4a1a-9827d5561adfb93d", "state": "FINISHED", "type": "", "filterRows": "0", "unselectRows": "0", "url": null }, "time": 5007, "type": "result_set" }, "count": 0 } ### Query Data​ After table creation and data loading, you may execute queries on the data. mysql> use demo; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> select * from mytable; +------+------+------+------+ | k1 | k2 | k3 | k4 | +------+------+------+------+ | 1 | 0.14 | a1 | 20 | | 2 | 1.04 | b2 | 21 | | 3 | 3.14 | c3 | 22 | | 4 | 4.35 | d4 | 23 | +------+------+------+------+ 4 rows in set (0.15 sec) On This Page * New User Registration and Organization Creation * Register and Login * Change Password * Warehouse and Cluster Creation * Create Warehouse * Create Cluster * Change Warehouse Password * Connect to Warehouse * Create Database * Create Data Table * Insert Data * Query Data * (OPTIONAL)Connect to Warehouse Using MySQL Client * IP Whitelist Management * MySQL Client * Create Database and Table * Query Data --- # Source: https://docs.velodb.io/cloud/4.x/integration/bi/tableau Version: 4.x On this page # Tableau VeloDB provides an official Tableau connector. This connector accesses data based on the MySQL JDBC Driver. The connector has been tested by the [TDVT framework](https://tableau.github.io/connector-plugin-sdk/docs/tdvt) with a 100% pass rate. With this connector, Tableau can integrate Doris databases and tables as data sources. To enable this, follow the setup guide below: * Install Tableau and the Doris connector * Configure an Doris data source in Tableau * Build visualizations in Tableau * Connection and usage tips * Summary ## Install Tableau and Doris connector​ 1. Download and install [Tableau desktop](https://www.tableau.com/products/desktop/download). 2. Get the [tableau-doris](https://velodb-bi-connector-1316291683.cos.ap-hongkong.myqcloud.com/Tableau/latest/doris_jdbc-latest.taco) custom connector connector (doris_jdbc-***.taco). 3. Get [MySQL JDBC](https://velodb-bi-connector-1316291683.cos.ap-hongkong.myqcloud.com/Tableau/latest/mysql-connector-j-8.3.0.jar) (version 8.3.0). 4. Locations to place the Connector and JDBC driver MacOS: * Refer to this path: `~/Documents/My Tableau Repository/Connectors`, place the `doris_jdbc-latest.taco` custom connector file (if the path does not exist, create it manually as needed). * JDBC driver jar placement path: `~/Library/Tableau/Drivers` Windows: Assume `tableau_path` is the Tableau installation directory on Windows, typically defaults to: `tableau_path = C:\Program Files\Tableau` * Refer to this path: `%tableau_path%``\Connectors\`, place the `doris_jdbc-latest.taco` custom connector file (if the path does not exist, create it manually as needed). * JDBC driver jar placement path: `%tableau_path%\Drivers\` Next, you can configure a Doris data source in Tableau and start building data visualizations! ## Configure a Doris data source in Tableau​ Now that you have installed and set up the **JDBC and Connector** drivers, let's look at how to define a data source in Tableau that connects to the tpch database in Doris. 1. Gather your connection details To connect to Doris via JDBC, you need the following information: Parameter| Meaning| Example| Server| Database host| 127.0.1.28| Port| Database MySQL port| 9030| Catalog| Doris Catalog, used when querying external tables and data lakes, set in Advanced| internal| Database| Database name| tpch| Authentication| Choose database authentication method: Username / Username and Password| Username and Password| Username| Username| testuser| Password| Password| | Init SQL Statement| Initial SQL statement| `select * from database.table` ---|---|--- 2. Launch Tableau. (If you were already running it before placing the connector, please restart.) 3. From the left menu, click **More** under the **To a Server** section. In the list of available connectors, search for **Doris JDBC by VeloDB** : ![find connector](/assets/images/p01-51ff170c4e64fdb565b3739fb9964ed7.png) 4. Click **Doris by VeloDB ,the following dialog will pop up:** ![dialog](/assets/images/p02-51cceaf33c7abcfc831e5c6120e57ffe.png) 5. Enter the corresponding connection information as prompted in the dialog. 6. Optional advanced configuration: * You can enter preset SQL in Initial SQL to define the data source ![Initial SQL](/assets/images/p03-6b98cf42d8ddb837ad995e6ea63f9942.png) * In Advanced, you can use Catalog to access data lake data sources; the default value is internal, ![Catalog](/assets/images/p04-04824d8428fc9aca1e053736942e4c07.png) 7. After completing the above input fields, click the **Sign In** button, and you should see a new Tableau workbook: ![Sign In](/assets/images/p05-38f3602cb3ab3e8fa651354588a3ba0e.png) Next, you can build some visualizations in Tableau! ## Build visualizations in Tableau​ We choose TPC-H data as the data source, refer to [this document](/cloud/4.x/benchmark/tpch) for the construction method of the Doris TPC-H data source Now that we have configured the Doris data source in Tableau, let's visualize the data 1. Drag the customer table and orders table to the workbook. And select the table join field Custkey for them below ![table join](/assets/images/p06-d1b528f68114eb2fde4aad0ceaf3bd18.png) 2. Drag the nation table to the workbook and select the table join field Nationkey with the customer table ![table join2](/assets/images/p07-24bb6a439ce004ed73e6395adaa1aa95.png) 3. Now that you have associated the customer table, orders table and nation table as a data source, you can use this relationship to handle questions about the data. Select the `Sheet 1` tab at the bottom of the workbook to enter the workspace. ![Sheet 1](/assets/images/p08-d6fd4938780682cc370a95e6a8523412.png) 4. Suppose you want to know the summary of the number of users per year. Drag OrderDate from orders to the `Columns` area (horizontal field), and then drag customer(count) from customer to `Rows`. Tableau will generate the following line chart: ![chart1](/assets/images/p09-6bc04160ad3ab558e4b56ad1a047af6d.png) A simple line chart is completed, but this dataset is automatically generated by the tpch script and default rules and is not actual data. It is not for reference and is intended to test availability. 5. Suppose you want to know the average order amount (USD) by region (country) and year: * Click the `New Worksheet` tab to create a new sheet * Drag Name from the nation table to `Rows` * Drag OrderDate from the orders table to `Columns` You should see the following: ![chart2](/assets/images/p10-8f6459fde1be9fe8e526a9175f1b6ca8.png) 6. Note: The `Abc` value is just a placeholder value, because you have not defined aggregation logic for that mark, so you need to drag a measure onto the table. Drag Totalprice from the orders table to the middle of the table. Note that the default calculation is to perform a SUM on Totalprices: ![SUM on Totalprices](/assets/images/p11-4b072e8fd5d782d4eff69ae7e4dcab9c.png) 7. Click `SUM` and change `Measure` to `Average`. ![sum](/assets/images/p12-81b04874e067a449d46fbde1c53b7c8b.png) 8. From the same dropdown menu, select `Format ` and change `Numbers` to `Currency (Standard)`: ![us](/assets/images/p13-cafbce614996c96995213a90508a455e.png) 9. Get a table that meets expectations: ![chart2](/assets/images/p14-68d31326b6c59c85bd8808fc735357be.png) So far, Tableau has been successfully connected to Doris, and data analysis and visualization dashboard production has been achieved. ## Connection and usage tips​ **Performance optimization** * According to actual needs, reasonably create doris databases and tables, partition and bucket by time, which can effectively reduce predicate filtering and most data transmission * Appropriate data pre-aggregation can be done by creating materialized views on the Doris side. * Set a reasonable refresh plan to balance the computing resource consumption of refresh and the timeliness of dashboard data **Security configuration** * It is recommended to use VPC private connections to avoid security risks introduced by public network access. * Configure security groups to restrict access. * Enable access methods such as SSL/TLS connections. * Refine Doris user account roles and access permissions to avoid excessive delegation of permissions. On This Page * Install Tableau and Doris connector * Configure a Doris data source in Tableau * Build visualizations in Tableau * Connection and usage tips --- # Source: https://docs.velodb.io/cloud/4.x/integration/data-processing/flink-doris-connector Version: 4.x On this page # Flink Doris Connector The [Flink Doris Connector](https://github.com/apache/doris-flink-connector) is used to read from and write data to a Doris cluster through Flink. It also integrates [FlinkCDC](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/overview/), which allows for more convenient full database synchronization with upstream databases such as MySQL. Using the Flink Connector, you can perform the following operations: * **Read data from Doris** : Flink Connector supports parallel reading from BE, improving data retrieval efficiency. * **Write data to Doris** : After batching in Flink, data is imported into Doris in bulk using Stream Load. * **Perform dimension table joins with Lookup Join** : Batching and asynchronous queries accelerate dimension table joins. * **Full database synchronization** : Using Flink CDC, you can synchronize entire databases such as MySQL, Oracle, and PostgreSQL, including automatic table creation and DDL operations. ## Version Description​ Connector Version| Flink Version| Doris Version| Java Version| Scala Version| 1.0.3| 1.11,1.12,1.13,1.14| 0.15+| 8| 2.11,2.12| 1.1.1| 1.14| 1.0+| 8| 2.11,2.12| 1.2.1| 1.15| 1.0+| 8| -| 1.3.0| 1.16| 1.0+| 8| -| 1.4.0| 1.15,1.16,1.17| 1.0+| 8| -| 1.5.2| 1.15,1.16,1.17,1.18| 1.0+| 8| -| 1.6.1| 1.15,1.16,1.17,1.18,1.19| 1.0+| 8| -| 24.0.1| 1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -| 24.1.0| 1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -| 25.0.0| 1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| -| 25.1.0| 1.15,1.16,1.17,1.18,1.19,1.20| 1.0+| 8| - ---|---|---|---|--- ## Usage​ The Flink Doris Connector can be used in two ways: via Jar or Maven. #### Jar​ You can download the corresponding version of the Flink Doris Connector Jar file [here](https://doris.apache.org/download#doris-ecosystem), then copy this file to the `classpath` of your `Flink` setup to use the `Flink-Doris- Connector`. For a `Standalone` mode Flink deployment, place this file under the `lib/` folder. For a Flink cluster running in `Yarn` mode, place the file into the pre-deployment package. #### Maven​ To use it with Maven, simply add the following dependency to your Pom file: org.apache.doris flink-doris-connector-${flink.version} ${connector.version} For example: org.apache.doris flink-doris-connector-1.16 25.1.0 ## Working Principles​ ### Reading Data from Doris​ ![FlinkConnectorPrinciples-JDBC- Doris](/assets/images/FlinkConnectorPrinciples-JDBC- Doris-7726ceb2bfe36b6d1b4e0446381d0e83.png) When reading data, Flink Doris Connector offers higher performance compared to Flink JDBC Connector and is recommended for use: * **Flink JDBC Connector** : Although Doris is compatible with the MySQL protocol, using Flink JDBC Connector for reading and writing to a Doris cluster is not recommended. This approach results in serial read/write operations on a single FE node, creating a bottleneck and affecting performance. * **Flink Doris Connector** : Starting from Doris 2.1, ADBC is the default protocol for Flink Doris Connector. The reading process follows these steps: a. Flink Doris Connector first retrieves Tablet ID information from FE based on the query plan. b. It generates the query statement: `SELECT * FROM tbs TABLET(id1, id2, id3)`. c. The query is then executed through the ADBC port of FE. d. Data is returned directly from BE, bypassing FE to eliminate the single- point bottleneck. ### Writing Data to Doris​ When using Flink Doris Connector for data writing, batch processing is performed in Flink's memory before bulk import via Stream Load. Doris Flink Connector provides two batching modes, with Flink Checkpoint-based streaming writes as the default: | Streaming Write| Batch Write| **Trigger Condition**| Relies on Flink Checkpoints and follows Flink's checkpoint cycle to write to Doris| Periodic submission based on connector-defined time or data volume thresholds| **Consistency**| Exactly-Once| At-Least-Once; Exactly-Once can be ensured with the primary key model| **Latency**| Limited by the Flink checkpoint interval, generally higher| Independent batch mechanism with flexible adjustment| **Fault Tolerance & Recovery**| Fully consistent with Flink state recovery| Relies on external deduplication logic (e.g., Doris primary key deduplication) ---|---|--- ## Quick Start​ #### Preparation​ #### Flink Cluster Deployment​ Taking a Standalone cluster as an example: 1. Download the Flink installation package, e.g., [Flink 1.18.1](https://archive.apache.org/dist/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz); 2. After extraction, place the Flink Doris Connector package in `/lib`; 3. Navigate to the `` directory and run `bin/start-cluster.sh` to start the Flink cluster; 4. You can verify if the Flink cluster started successfully using the `jps` command. #### Initialize Doris Tables​ Run the following statements to create Doris tables: CREATE DATABASE test; CREATE TABLE test.student ( `id` INT, `name` VARCHAR(256), `age` INT ) UNIQUE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" ); INSERT INTO test.student values(1,"James",18); INSERT INTO test.student values(2,"Emily",28); CREATE TABLE test.student_trans ( `id` INT, `name` VARCHAR(256), `age` INT ) UNIQUE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" ); #### Run FlinkSQL Task​ **Start FlinkSQL Client** bin/sql-client.sh **Run FlinkSQL** CREATE TABLE Student ( id STRING, name STRING, age INT ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'test.student', 'username' = 'root', 'password' = '' ); CREATE TABLE StudentTrans ( id STRING, name STRING, age INT ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'test.student_trans', 'username' = 'root', 'password' = '', 'sink.label-prefix' = 'doris_label' ); INSERT INTO StudentTrans SELECT id, concat('prefix_',name), age+1 FROM Student; #### Query Data​ mysql> select * from test.student_trans; +------+--------------+------+ | id | name | age | +------+--------------+------+ | 1 | prefix_James | 19 | | 2 | prefix_Emily | 29 | +------+--------------+------+ 2 rows in set (0.02 sec) ## Scenarios and Operations​ ### Reading Data from Doris​ When Flink reads data from Doris, the Doris Source is currently a bounded stream and does not support continuous reading in a CDC manner. Data can be read from Doris using Thrift or ArrowFlightSQL (supported from version 24.0.0 onward). Starting from version 2.1, ArrowFlightSQL is the recommended approach. * **Thrift** : Data is read by calling the BE's Thrift interface. For detailed steps, refer to [Reading Data via Thrift Interface](https://github.com/apache/doris/blob/master/samples/doris-demo/doris-source-demo/README.md). * **ArrowFlightSQL** : Based on Doris 2.1, this method allows high-speed reading of large volumes of data using the Arrow Flight SQL protocol. For more information, refer to [High-speed Data Transfer via Arrow Flight SQL](https://doris.apache.org/docs/dev/db-connect/arrow-flight-sql-connect/). #### Using FlinkSQL to Read Data​ ##### Thrift Method​ CREATE TABLE student ( id INT, name STRING, age INT ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', -- Fe的host:HttpPort 'table.identifier' = 'test.student', 'username' = 'root', 'password' = '' ); SELECT * FROM student; ##### ArrowFlightSQL​ CREATE TABLE student ( id INT, name STRING, age INT ) WITH ( 'connector' = 'doris', 'fenodes' = '{fe.conf:http_port}', 'table.identifier' = 'test.student', 'source.use-flight-sql' = 'true', 'source.flight-sql-port' = '{fe.conf:arrow_flight_sql_port}', 'username' = 'root', 'password' = '' ); SELECT * FROM student; #### Using DataStream API to Read Data​ When using the DataStream API to read data, you need to include the dependencies in your program's POM file in advance, as described in the "Usage" section. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DorisOptions option = DorisOptions.builder() .setFenodes("127.0.0.1:8030") .setTableIdentifier("test.student") .setUsername("root") .setPassword("") .build(); DorisReadOptions readOptions = DorisReadOptions.builder().build(); DorisSource> dorisSource = DorisSource.>builder() .setDorisOptions(option) .setDorisReadOptions(readOptions) .setDeserializer(new SimpleListDeserializationSchema()) .build(); env.fromSource(dorisSource, WatermarkStrategy.noWatermarks(), "doris source").print(); env.execute("Doris Source Test"); For the complete code, refer to:[DorisSourceDataStream.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/DorisSourceDataStream.java) ### Writing Data to Doris​ Flink writes data to Doris using the Stream Load method, supporting both streaming and batch-insertion modes. Difference Between Streaming and Batch-insertion Starting from Connector 1.5.0, batch-insertion is supported. Batch-insertion does not rely on Checkpoints; it buffers data in memory and controls the writing timing based on batch parameters. Streaming insertion requires Checkpoints to be enabled, continuously writing upstream data to Doris during the entire Checkpoint period, without keeping data in memory continuously. #### Using FlinkSQL to Write Data​ For testing, Flink's [Datagen](https://nightlies.apache.org/flink/flink-docs- master/docs/connectors/table/datagen/) is used to simulate the continuously generated upstream data. -- enable checkpoint SET 'execution.checkpointing.interval' = '30s'; CREATE TABLE student_source ( id INT, name STRING, age INT ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '1', 'fields.name.length' = '20', 'fields.id.min' = '1', 'fields.id.max' = '100000', 'fields.age.min' = '3', 'fields.age.max' = '30' ); -- doris sink CREATE TABLE student_sink ( id INT, name STRING, age INT ) WITH ( 'connector' = 'doris', 'fenodes' = '10.16.10.6:28737', 'table.identifier' = 'test.student', 'username' = 'root', 'password' = 'password', 'sink.label-prefix' = 'doris_label' --'sink.enable.batch-mode' = 'true' Adding this configuration enables batch writing ); INSERT INTO student_sink SELECT * FROM student_source; #### Using DataStream API to Write Data​ When using the DataStream API to write data, different serialization methods can be used to serialize the upstream data before writing it to the Doris table. info The Connector already contains the HttpClient4.5.13 version. If you reference HttpClient separately in your project, you need to ensure that the versions are consistent. ##### Standard String Format​ When the upstream data is in CSV or JSON format, you can directly use the `SimpleStringSerializer` to serialize the data. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(30000); DorisSink.Builder builder = DorisSink.builder(); DorisOptions dorisOptions = DorisOptions.builder() .setFenodes("10.16.10.6:28737") .setTableIdentifier("test.student") .setUsername("root") .setPassword("") .build(); Properties properties = new Properties(); // When the upstream data is in json format, the following configuration needs to be enabled properties.setProperty("read_json_by_line", "true"); properties.setProperty("format", "json"); // When writing csv data from the upstream, the following configurations need to be enabled //properties.setProperty("format", "csv"); //properties.setProperty("column_separator", ","); DorisExecutionOptions executionOptions = DorisExecutionOptions.builder() .setLabelPrefix("label-doris") .setDeletable(false) //.setBatchMode(true) Enable batch writing .setStreamLoadProp(properties) .build(); builder.setDorisReadOptions(DorisReadOptions.builder().build()) .setDorisExecutionOptions(executionOptions) .setSerializer(new SimpleStringSerializer()) .setDorisOptions(dorisOptions); List data = new ArrayList<>(); data.add("{\"id\":3,\"name\":\"Michael\",\"age\":28}"); data.add("{\"id\":4,\"name\":\"David\",\"age\":38}"); env.fromCollection(data).sinkTo(builder.build()); env.execute("doris test"); For the complete code, refer to:[DorisSinkExample.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/DorisSinkExample.java) ##### RowData Format​ RowData is the internal format of Flink. If the upstream data is in RowData format, you need to use the `RowDataSerializer` to serialize the data. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(10000); env.setParallelism(1); DorisSink.Builder builder = DorisSink.builder(); Properties properties = new Properties(); properties.setProperty("column_separator", ","); properties.setProperty("line_delimiter", "\n"); properties.setProperty("format", "csv"); // When writing json data from the upstream, the following configuration needs to be enabled // properties.setProperty("read_json_by_line", "true"); // properties.setProperty("format", "json"); DorisOptions.Builder dorisBuilder = DorisOptions.builder(); dorisBuilder .setFenodes("10.16.10.6:28737") .setTableIdentifier("test.student") .setUsername("root") .setPassword(""); DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder(); executionBuilder.setLabelPrefix(UUID.randomUUID().toString()).setDeletable(false).setStreamLoadProp(properties); // flink rowdata‘s schema String[] fields = {"id","name", "age"}; DataType[] types = {DataTypes.INT(), DataTypes.VARCHAR(256), DataTypes.INT()}; builder.setDorisExecutionOptions(executionBuilder.build()) .setSerializer( RowDataSerializer.builder() // serialize according to rowdata .setType(LoadConstants.CSV) .setFieldDelimiter(",") .setFieldNames(fields) .setFieldType(types) .build()) .setDorisOptions(dorisBuilder.build()); // mock rowdata source DataStream source = env.fromElements("") .flatMap( new FlatMapFunction() { @Override public void flatMap(String s, Collector out) throws Exception { GenericRowData genericRowData = new GenericRowData(3); genericRowData.setField(0, 1); genericRowData.setField(1, StringData.fromString("Michael")); genericRowData.setField(2, 18); out.collect(genericRowData); GenericRowData genericRowData2 = new GenericRowData(3); genericRowData2.setField(0, 2); genericRowData2.setField(1, StringData.fromString("David")); genericRowData2.setField(2, 38); out.collect(genericRowData2); } }); source.sinkTo(builder.build()); env.execute("doris test"); For the complete code, refer to:[DorisSinkExampleRowData.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/DorisSinkExampleRowData.java) ##### Debezium Format​ For upstream data in Debezium format, such as data from FlinkCDC or Debezium format in Kafka, you can use the `JsonDebeziumSchemaSerializer` to serialize the data. // enable checkpoint env.enableCheckpointing(10000); Properties props = new Properties(); props.setProperty("format", "json"); props.setProperty("read_json_by_line", "true"); DorisOptions dorisOptions = DorisOptions.builder() .setFenodes("127.0.0.1:8030") .setTableIdentifier("test.student") .setUsername("root") .setPassword("").build(); DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder(); executionBuilder.setLabelPrefix("label-prefix") .setStreamLoadProp(props) .setDeletable(true); DorisSink.Builder builder = DorisSink.builder(); builder.setDorisReadOptions(DorisReadOptions.builder().build()) .setDorisExecutionOptions(executionBuilder.build()) .setDorisOptions(dorisOptions) .setSerializer(JsonDebeziumSchemaSerializer.builder().setDorisOptions(dorisOptions).build()); env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source") .sinkTo(builder.build()); For the complete code, refer to:[CDCSchemaChangeExample.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/CDCSchemaChangeExample.java) ##### Multi-table Write Format​ Currently, DorisSink supports synchronizing multiple tables with a single Sink. You need to pass both the data and the database/table information to the Sink, and serialize it using the `RecordWithMetaSerializer`. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); DorisSink.Builder builder = DorisSink.builder(); Properties properties = new Properties(); properties.setProperty("column_separator", ","); properties.setProperty("line_delimiter", "\n"); properties.setProperty("format", "csv"); DorisOptions.Builder dorisBuilder = DorisOptions.builder(); dorisBuilder .setFenodes("10.16.10.6:28737") .setTableIdentifier("") .setUsername("root") .setPassword(""); DorisExecutionOptions.Builder executionBuilder = DorisExecutionOptions.builder(); executionBuilder .setLabelPrefix("label-doris") .setStreamLoadProp(properties) .setDeletable(false) .setBatchMode(true); builder.setDorisReadOptions(DorisReadOptions.builder().build()) .setDorisExecutionOptions(executionBuilder.build()) .setDorisOptions(dorisBuilder.build()) .setSerializer(new RecordWithMetaSerializer()); RecordWithMeta record = new RecordWithMeta("test", "student_1", "1,David,18"); RecordWithMeta record1 = new RecordWithMeta("test", "student_2", "1,Jack,28"); env.fromCollection(Arrays.asList(record, record1)).sinkTo(builder.build()); For the complete code, refer to:[DorisSinkMultiTableExample.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/DorisSinkMultiTableExample.java) ### Lookup Join​ Using Lookup Join can optimize dimension table joins in Flink. When using Flink JDBC Connector for dimension table joins, the following issues may arise: * Flink JDBC Connector uses a synchronous query mode, meaning that after upstream data (e.g., from Kafka) sends a record, it immediately queries the Doris dimension table. This results in high query latency under high-concurrency scenarios. * Queries executed via JDBC are typically point lookups per record, whereas Doris recommends batch queries for better efficiency. Using [Lookup Join](https://nightlies.apache.org/flink/flink-docs- release-1.20/docs/dev/table/sql/queries/joins/#lookup-join) for dimension table joins in Flink Doris Connector provides the following advantages: * **Batch caching of upstream data** , avoiding the high latency and database load caused by per-record queries. * **Asynchronous execution of join queries** , improving data throughput and reducing the query load on Doris. CREATE TABLE fact_table ( `id` BIGINT, `name` STRING, `city` STRING, `process_time` as proctime() ) WITH ( 'connector' = 'kafka', ... ); create table dim_city( `city` STRING, `level` INT , `province` STRING, `country` STRING ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', 'table.identifier' = 'dim.dim_city', 'username' = 'root', 'password' = '' ); SELECT a.id, a.name, a.city, c.province, c.country,c.level FROM fact_table a LEFT JOIN dim_city FOR SYSTEM_TIME AS OF a.process_time AS c ON a.city = c.city ### Full Database Synchronization​ The Flink Doris Connector integrates **Flink CDC** ([Flink CDC Documentation](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/overview/)), making it easier to synchronize relational databases like MySQL to Doris. This integration also includes automatic table creation, schema changes, etc. Supported databases for synchronization include: MySQL, Oracle, PostgreSQL, SQLServer, MongoDB, and DB2. Note 1. When using full database synchronization, you need to add the corresponding Flink CDC dependencies in the `$FLINK_HOME/lib` directory (Fat Jar), such as **flink-sql-connector-mysql-cdc-${version}.jar** , **flink-sql-connector-oracle-cdc-${version}.jar**. FlinkCDC version 3.1 and later is not compatible with previous versions. You can download the dependencies from the following links: [FlinkCDC 3.x](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-mysql-cdc/), [FlinkCDC 2.x](https://repo.maven.apache.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/). 2. For versions after Connector 24.0.0, the required Flink CDC version must be 3.1 or higher. You can download it [here](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-mysql-cdc/). If Flink CDC is used to synchronize MySQL and Oracle, you must also add the relevant JDBC drivers under `$FLINK_HOME/lib`. #### MySQL Whole Database Synchronization​ After starting the Flink cluster, you can directly run the following command: bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ lib/flink-doris-connector-1.16-24.0.1.jar \ mysql-sync-database \ --database test_db \ --mysql-conf hostname=127.0.0.1 \ --mysql-conf port=3306 \ --mysql-conf username=root \ --mysql-conf password=123456 \ --mysql-conf database-name=mysql_db \ --including-tables "tbl1|test.*" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password=123456 \ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### Oracle Whole Database Synchronization​ bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ ./lib/flink-doris-connector-1.16-24.0.1.jar \ oracle-sync-database \ --database test_db \ --oracle-conf hostname=127.0.0.1 \ --oracle-conf port=1521 \ --oracle-conf username=admin \ --oracle-conf password="password" \ --oracle-conf database-name=XE \ --oracle-conf schema-name=ADMIN \ --including-tables "tbl1|tbl2" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password=\ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### PostgreSQL Whole Database Synchronization​ /bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1\ -c org.apache.doris.flink.tools.cdc.CdcTools \ ./lib/flink-doris-connector-1.16-24.0.1.jar \ postgres-sync-database \ --database db1\ --postgres-conf hostname=127.0.0.1 \ --postgres-conf port=5432 \ --postgres-conf username=postgres \ --postgres-conf password="123456" \ --postgres-conf database-name=postgres \ --postgres-conf schema-name=public \ --postgres-conf slot.name=test \ --postgres-conf decoding.plugin.name=pgoutput \ --including-tables "tbl1|tbl2" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password=\ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### SQLServer Whole Database Synchronization​ /bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ ./lib/flink-doris-connector-1.16-24.0.1.jar \ sqlserver-sync-database \ --database db1\ --sqlserver-conf hostname=127.0.0.1 \ --sqlserver-conf port=1433 \ --sqlserver-conf username=sa \ --sqlserver-conf password="123456" \ --sqlserver-conf database-name=CDC_DB \ --sqlserver-conf schema-name=dbo \ --including-tables "tbl1|tbl2" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password=\ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### DB2 Whole Database Synchronization​ bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ lib/flink-doris-connector-1.16-24.0.1.jar \ db2-sync-database \ --database db2_test \ --db2-conf hostname=127.0.0.1 \ --db2-conf port=50000 \ --db2-conf username=db2inst1 \ --db2-conf password=doris123456 \ --db2-conf database-name=testdb \ --db2-conf schema-name=DB2INST1 \ --including-tables "FULL_TYPES|CUSTOMERS" \ --single-sink true \ --use-new-schema-change true \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password=123456 \ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### MongoDB Whole Database Synchronization​ /bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ ./lib/flink-doris-connector-1.18-24.0.1.jar \ mongodb-sync-database \ --database doris_db \ --schema-change-mode debezium_structure \ --mongodb-conf hosts=127.0.0.1:27017 \ --mongodb-conf username=flinkuser \ --mongodb-conf password=flinkpwd \ --mongodb-conf database=test \ --mongodb-conf scan.startup.mode=initial \ --mongodb-conf schema.sample-percent=0.2 \ --including-tables "tbl1|tbl2" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password= \ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --sink-conf sink.enable-2pc=false \ --table-conf replication_num=1 #### AWS Aurora MySQL Whole Database Synchronization​ bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ lib/flink-doris-connector-1.18-25.0.0.jar \ mysql-sync-database \ --database testwd \ --mysql-conf hostname=xxx.us-east-1.rds.amazonaws.com \ --mysql-conf port=3306 \ --mysql-conf username=admin \ --mysql-conf password=123456 \ --mysql-conf database-name=test \ --mysql-conf server-time-zone=UTC \ --including-tables "student" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password= \ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 #### AWS RDS MySQL Whole Database Synchronization​ bin/flink run \ -Dexecution.checkpointing.interval=10s \ -Dparallelism.default=1 \ -c org.apache.doris.flink.tools.cdc.CdcTools \ lib/flink-doris-connector-1.18-25.0.0.jar \ mysql-sync-database \ --database testwd \ --mysql-conf hostname=xxx.ap-southeast-1.rds.amazonaws.com \ --mysql-conf port=3306 \ --mysql-conf username=admin \ --mysql-conf password=123456 \ --mysql-conf database-name=test \ --mysql-conf server-time-zone=UTC \ --including-tables "student" \ --sink-conf fenodes=127.0.0.1:8030 \ --sink-conf username=root \ --sink-conf password= \ --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \ --sink-conf sink.label-prefix=label \ --table-conf replication_num=1 ## Usage Instructions​ ### Parameter Configuration​ #### General Configuration Items​ Key| Default Value| Required| Comment| fenodes| \--| Y| Doris FE http addresses. Multiple addresses are supported and should be separated by commas.| benodes| \--| N| Doris BE http addresses. Multiple addresses are supported and should be separated by commas.| jdbc-url| \--| N| JDBC connection information, such as jdbc:mysql://127.0.0.1:9030.| table.identifier| \--| Y| Doris table name, such as db.tbl.| username| \--| Y| Username for accessing Doris.| password| \--| Y| Password for accessing Doris.| auto-redirect| TRUE| N| Whether to redirect StreamLoad requests. After enabling, StreamLoad will write through FE and will no longer explicitly obtain BE information.| doris.request.retries| 3| N| The number of retries for sending requests to Doris.| doris.request.connect.timeout| 30s| N| The connection timeout for sending requests to Doris.| doris.request.read.timeout| 30s| N| The read timeout for sending requests to Doris. ---|---|---|--- #### Source Configuration​ Key| Default Value| Required| Comment| doris.request.query.timeout| 21600s| N| The timeout for querying Doris. The default value is 6 hours.| doris.request.tablet.size| 1| N| The number of Doris Tablets corresponding to one Partition. The smaller this value is set, the more Partitions will be generated, which can increase the parallelism on the Flink side. However, it will also put more pressure on Doris.| doris.batch.size| 4064| N| The maximum number of rows read from BE at one time. Increasing this value can reduce the number of connections established between Flink and Doris, thereby reducing the additional time overhead caused by network latency.| doris.exec.mem.limit| 8192mb| N| The memory limit for a single query. The default is 8GB, in bytes.| source.use-flight-sql| FALSE| N| Whether to use Arrow Flight SQL for reading.| source.flight-sql-port| -| N| The arrow_flight_sql_port of FE when using Arrow Flight SQL for reading. ---|---|---|--- **DataStream-Specific Configuration** Key| Default Value| Required| Comment| doris.read.field| \--| N| The list of column names for reading Doris tables. Multiple columns should be separated by commas.| doris.filter.query| \--| N| The expression for filtering read data. This expression is passed to Doris. Doris uses this expression to complete source data filtering. For example, age=18. ---|---|---|--- #### Sink Configuration​ Key| Default Value| Required| Comment| sink.label-prefix| \--| Y| The label prefix used for Stream load import. In the 2pc scenario, it is required to be globally unique to ensure the EOS semantics of Flink.| sink.properties.*| \--| N| Import parameters for Stream Load. For example, 'sink.properties.column_separator' = ', ' defines the column separator, and 'sink.properties.escape_delimiters' = 'true' means that special characters as delimiters, like \x01, will be converted to binary 0x01. For JSON format import, 'sink.properties.format' = 'json', 'sink.properties.read_json_by_line' = 'true'. For detailed parameters, refer to [here](/cloud/4.x/user-guide/data- operate/import/import-way/stream-load-manual). For Group Commit mode, for example, 'sink.properties.group_commit' = 'sync_mode' sets the group commit to synchronous mode. The Flink connector has supported import configuration group commit since version 1.6.2. For detailed usage and limitations, refer to [group commit](/cloud/4.x/user-guide/data-operate/import/group-commit- manual).| sink.enable-delete| TRUE| N| Whether to enable deletion. This option requires the Doris table to have the batch deletion feature enabled (enabled by default in Doris 0.15+ versions), and only supports the Unique model.| sink.enable-2pc| TRUE| N| Whether to enable two-phase commit (2pc). The default is true, ensuring Exactly-Once semantics. For details about two-phase commit, refer to [here](/cloud/4.x/user-guide/data-operate/import/import- way/stream-load-manual).| sink.buffer-size| 1MB| N| The size of the write data cache buffer, in bytes. It is not recommended to modify it, and the default configuration can be used.| sink.buffer-count| 3| N| The number of write data cache buffers. It is not recommended to modify it, and the default configuration can be used.| sink.max-retries| 3| N| The maximum number of retries after a Commit failure. The default is 3 times.| sink.enable.batch- mode| FALSE| N| Whether to use the batch mode to write to Doris. After enabling, the writing timing does not rely on Checkpoint, and it is controlled by parameters such as sink.buffer-flush.max-rows, sink.buffer-flush.max-bytes, and sink.buffer-flush.interval. Meanwhile, after enabling, Exactly-once semantics will not be guaranteed, but idempotency can be achieved with the help of the Uniq model.| sink.flush.queue-size| 2| N| The size of the cache queue in batch mode.| sink.buffer-flush.max-rows| 500000| N| The maximum number of rows written in a single batch in batch mode.| sink.buffer- flush.max-bytes| 100MB| N| The maximum number of bytes written in a single batch in batch mode.| sink.buffer-flush.interval| 10s| N| The interval for asynchronously flushing the cache in batch mode.| sink.ignore.update-before| TRUE| N| Whether to ignore the update-before event. The default is to ignore it. ---|---|---|--- #### Lookup Join Configuration​ Key| Default Value| Required| Comment| lookup.cache.max-rows| -1| N| The maximum number of rows in the lookup cache. The default value is -1, which means the cache is not enabled.| lookup.cache.ttl| 10s| N| The maximum time for the lookup cache. The default is 10 seconds.| lookup.max-retries| 1| N| The number of retries after a lookup query fails.| lookup.jdbc.async| FALSE| N| Whether to enable asynchronous lookup. The default is false.| lookup.jdbc.read.batch.size| 128| N| The maximum batch size for each query in asynchronous lookup.| lookup.jdbc.read.batch.queue-size| 256| N| The size of the intermediate buffer queue during asynchronous lookup.| lookup.jdbc.read.thread-size| 3| N| The number of jdbc threads for lookup in each task. ---|---|---|--- #### Full Database Synchronization Configuration​ **Syntax** bin/flink run \ -c org.apache.doris.flink.tools.cdc.CdcTools \ lib/flink-doris-connector-1.16-1.6.1.jar \ \ --database \ [--job-name ] \ [--table-prefix ] \ [--table-suffix ] \ [--including-tables ] \ [--excluding-tables ] \ --mysql-conf [--mysql-conf ...] \ --oracle-conf [--oracle-conf ...] \ --postgres-conf [--postgres-conf ...] \ --sqlserver-conf [--sqlserver-conf ...] \ --sink-conf [--table-conf ...] \ [--table-conf [--table-conf ...]] **Configuration** Key| Comment| \--job-name| The name of the Flink task, which is optional.| \--database| The name of the database synchronized to Doris.| \--table-prefix| The prefix name of the Doris table, for example, --table-prefix ods_.| \--table-suffix| The suffix name of the Doris table, similar to the prefix.| \--including-tables| The MySQL tables that need to be synchronized. Multiple tables can be separated by |, and regular expressions are supported. For example, --including-tables table1.| \--excluding-tables| The tables that do not need to be synchronized. The usage is the same as that of --including- tables.| \--mysql-conf| The configuration of the MySQL CDCSource, for example, --mysql-conf hostname=127.0.0.1. You can view all the configurations of MySQL- CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/mysql-cdc/). Among them, hostname, username, password, and database-name are required. When the synchronized database and table contain non-primary key tables, scan.incremental.snapshot.chunk.key-column must be set, and only one non-null type field can be selected. For example: scan.incremental.snapshot.chunk.key- column=database.table:column,database.table1:column..., and columns of different databases and tables are separated by commas.| \--oracle-conf| The configuration of the Oracle CDCSource, for example, --oracle-conf hostname=127.0.0.1. You can view all the configurations of Oracle-CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/oracle-cdc/). Among them, hostname, username, password, database-name, and schema-name are required.| \--postgres- conf| The configuration of the Postgres CDCSource, for example, --postgres- conf hostname=127.0.0.1. You can view all the configurations of Postgres-CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/postgres-cdc/). Among them, hostname, username, password, database-name, schema-name, and slot.name are required.| \--sqlserver-conf| The configuration of the SQLServer CDCSource, for example, --sqlserver-conf hostname=127.0.0.1. You can view all the configurations of SQLServer-CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/sqlserver-cdc/). Among them, hostname, username, password, database-name, and schema-name are required.| \--db2-conf| The configuration of the SQLServer CDCSource, for example, --db2-conf hostname=127.0.0.1. You can view all the configurations of DB2-CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/db2-cdc/). Among them, hostname, username, password, database-name, and schema-name are required.| \--sink- conf| All the configurations of the Doris Sink can be viewed [here]( Configuration Items).| \--mongodb-conf| The configuration of the MongoDB CDCSource, for example, --mongodb-conf hosts=127.0.0.1:27017. You can view all the configurations of Mongo-CDC [here](https://nightlies.apache.org/flink/flink-cdc-docs- release-3.2/docs/connectors/flink-sources/mongodb-cdc/). Among them, hosts, username, password, and database are required. --mongodb-conf schema.sample- percent is the configuration for automatically sampling MongoDB data to create tables in Doris, and the default value is 0.2.| \--table-conf| The configuration items of the Doris table, that is, the content included in properties (except for table-buckets, which is not a properties attribute). For example, --table-conf replication_num=1, and --table-conf table- buckets="tbl1:10,tbl2:20,a._:30,b._ :40,.*:50" means specifying the number of buckets for different tables in the order of regular expressions. If there is no match, the BUCKETS AUTO method will be used to create tables.| \--schema- change-mode| The modes for parsing schema change, including debezium_structure and sql_parser. The debezium_structure mode is used by default. The debezium_structure mode parses the data structure used when the upstream CDC synchronizes data and judges DDL change operations by parsing this structure. The sql_parser mode parses the DDL statements when the upstream CDC synchronizes data to judge DDL change operations, so this parsing mode is more accurate. Usage example: --schema-change-mode debezium_structure. This function will be available in versions after 24.0.0.| \--single-sink| Whether to use a single Sink to synchronize all tables. After enabling, it can also automatically identify newly created tables upstream and create tables automatically.| \--multi-to-one-origin| The configuration of the source tables when multiple upstream tables are written to the same table, for example: --multi-to-one-origin "a_.*|b_.*", refer to [#208](https://github.com/apache/doris-flink-connector/pull/208)| \--multi-to- one-target| Used in combination with multi-to-one-origin, the configuration of the target table, for example: --multi-to-one-target "a|b"| \--create-table- only| Whether to only synchronize the structure of the table. ---|--- ### Type Mapping​ Doris Type| Flink Type| NULL_TYPE| NULL| BOOLEAN| BOOLEAN| TINYINT| TINYINT| SMALLINT| SMALLINT| INT| INT| BIGINT| BIGINT| FLOAT| FLOAT| DOUBLE| DOUBLE| DATE| DATE| DATETIME| TIMESTAMP| DECIMAL| DECIMAL| CHAR| STRING| LARGEINT| STRING| VARCHAR| STRING| STRING| STRING| DECIMALV2| DECIMAL| ARRAY| ARRAY| MAP| STRING| JSON| STRING| VARIANT| STRING| IPV4| STRING| IPV6| STRING ---|--- ### Monitoring Metrics​ Flink provides multiple [Metrics](https://nightlies.apache.org/flink/flink- docs-master/docs/ops/metrics/#metrics) for monitoring the indicators of the Flink cluster. The following are the newly added monitoring metrics for the Flink Doris Connector. Name| Metric Type| Description| totalFlushLoadBytes| Counter| The total number of bytes that have been flushed and imported.| flushTotalNumberRows| Counter| The total number of rows that have been imported and processed.| totalFlushLoadedRows| Counter| The total number of rows that have been successfully imported.| totalFlushTimeMs| Counter| The total time taken for successful imports to complete.| totalFlushSucceededNumber| Counter| The number of times that imports have been successfully completed.| totalFlushFailedNumber| Counter| The number of times that imports have failed.| totalFlushFilteredRows| Counter| The total number of rows with unqualified data quality.| totalFlushUnselectedRows| Counter| The total number of rows filtered by the where condition.| beginTxnTimeMs| Histogram| The time taken to request the Fe to start a transaction, in milliseconds.| putDataTimeMs| Histogram| The time taken to request the Fe to obtain the import data execution plan.| readDataTimeMs| Histogram| The time taken to read data.| writeDataTimeMs| Histogram| The time taken to execute the write data operation.| commitAndPublishTimeMs| Histogram| The time taken to request the Fe to commit and publish the transaction.| loadTimeMs| Histogram| The time taken for the import to complete. ---|---|--- ## Best Practices​ ### FlinkSQL Quickly Connects to MySQL Data via CDC​ -- enable checkpoint SET 'execution.checkpointing.interval' = '10s'; CREATE TABLE cdc_mysql_source ( id int ,name VARCHAR ,PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = '127.0.0.1', 'port' = '3306', 'username' = 'root', 'password' = 'password', 'database-name' = 'database', 'table-name' = 'table' ); -- Supports synchronizing insert/update/delete events CREATE TABLE doris_sink ( id INT, name STRING ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'database.table', 'username' = 'root', 'password' = '', 'sink.properties.format' = 'json', 'sink.properties.read_json_by_line' = 'true', 'sink.enable-delete' = 'true', -- Synchronize delete events 'sink.label-prefix' = 'doris_label' ); insert into doris_sink select id,name from cdc_mysql_source; ### Flink Performs Partial Column Updates​ CREATE TABLE doris_sink ( id INT, name STRING, bank STRING, age int ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'database.table', 'username' = 'root', 'password' = '', 'sink.properties.format' = 'json', 'sink.properties.read_json_by_line' = 'true', 'sink.properties.columns' = 'id,name,bank,age', -- Columns that need to be updated 'sink.properties.partial_columns' = 'true' -- Enable partial column updates ); ### Flink Imports Bitmap Data​ CREATE TABLE bitmap_sink ( dt int, page string, user_id int ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'test.bitmap_test', 'username' = 'root', 'password' = '', 'sink.label-prefix' = 'doris_label', 'sink.properties.columns' = 'dt,page,user_id,user_id=to_bitmap(user_id)' ) ### FlinkCDC Updates Key Columns​ Generally, in a business database, a number is often used as the primary key of a table. For example, for the Student table, the number (id) is used as the primary key. However, as the business develops, the number corresponding to the data may change. In this scenario, when using Flink CDC + Doris Connector to synchronize data, the data of the primary key column in Doris can be automatically updated. **Principle** The underlying collection tool of Flink CDC is Debezium. Debezium internally uses the op field to identify corresponding operations. The values of the op field are c, u, d, and r, corresponding to create, update, delete, and read respectively. For the update of the primary key column, Flink CDC will send DELETE and INSERT events downstream, and the data of the primary key column in Doris will be automatically updated after the data is synchronized to Doris. **Usage** The Flink program can refer to the above CDC synchronization examples. After successfully submitting the task, execute the statement to update the primary key column on the MySQL side (for example, update student set id = '1002' where id = '1001'), and then the data in Doris can be modified. ### Flink Deletes Data According to Specified Columns​ Generally, messages in Kafka use specific fields to mark the operation type, such as {"op_type":"delete",data:{...}}. For this kind of data, it is hoped to delete the data with op_type=delete. The DorisSink will, by default, distinguish the types of events according to RowKind. Usually, in the case of CDC, the event type can be directly obtained, and the hidden column `__DORIS_DELETE_SIGN__` can be assigned a value to achieve the purpose of deletion. However, for Kafka, it is necessary to judge according to the business logic and explicitly pass in the value of the hidden column. -- For example, the upstream data:{"op_type":"delete",data:{"id":1,"name":"zhangsan"}} CREATE TABLE KAFKA_SOURCE( data STRING, op_type STRING ) WITH ( 'connector' = 'kafka', ... ); CREATE TABLE DORIS_SINK( id INT, name STRING, __DORIS_DELETE_SIGN__ INT ) WITH ( 'connector' = 'doris', 'fenodes' = '127.0.0.1:8030', 'table.identifier' = 'db.table', 'username' = 'root', 'password' = '', 'sink.enable-delete' = 'false', -- false means not to obtain the event type from RowKind 'sink.properties.columns' = 'id, name, __DORIS_DELETE_SIGN__' -- Explicitly specify the import columns of streamload ); INSERT INTO DORIS_SINK SELECT json_value(data,'$.id') as id, json_value(data,'$.name') as name, if(op_type='delete',1,0) as __DORIS_DELETE_SIGN__ from KAFKA_SOURCE; ### Flink CDC Synchronize DDL Statements​ Generally, when synchronizing upstream data sources such as MySQL, when adding or deleting fields in the upstream, you need to synchronize the Schema Change operation in Doris. For this scenario, you usually need to write a program for the DataStream API and use the JsonDebeziumSchemaSerializer serializer provided by DorisSink to automatically perform SchemaChange. For details, please refer to [CDCSchemaChangeExample.java](https://github.com/apache/doris-flink- connector/blob/master/flink-doris- connector/src/test/java/org/apache/doris/flink/example/CDCSchemaChangeExample.java) In the whole database synchronization tool provided by the Connector, no additional configuration is required, and the upstream DDL will be automatically synchronized and the SchemaChange operation will be performed in Doris. ## Frequently Asked Questions (FAQ)​ 1. **errCode = 2, detailMessage = Label [label_0_1] has already been used, relate to txn [19650]** In the Exactly-Once scenario, the Flink Job must be restarted from the latest Checkpoint/Savepoint, otherwise the above error will be reported. When Exactly-Once is not required, this problem can also be solved by disabling 2PC submission (sink.enable-2pc=false) or changing to a different sink.label- prefix. 2. **errCode = 2, detailMessage = transaction [19650] not found** This occurs during the Commit stage. The transaction ID recorded in the checkpoint has expired on the FE side. When committing again at this time, the above error will occur. At this point, it's impossible to start from the checkpoint. Subsequently, you can extend the expiration time by modifying the `streaming_label_keep_max_second` configuration in `fe.conf`. The default expiration time is 12 hours. After doris version 2.0, it will also be limited by the `label_num_threshold` configuration in `fe.conf` (default 2000), which can be increased or changed to -1 (-1 means only limited by time). 3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100** This is because the concurrent imports into the same database exceed 100. It can be solved by adjusting the parameter `max_running_txn_num_per_db` in `fe.conf`. For specific details, please refer to [max_running_txn_num_per_db](https://doris.apache.org/zh-CN/docs/dev/admin- manual/config/fe-config/#max_running_txn_num_per_db). Meanwhile, frequently modifying the label and restarting a task may also lead to this error. In the 2pc scenario (for Duplicate/Aggregate models), the label of each task needs to be unique. And when restarting from a checkpoint, the Flink task will actively abort the transactions that have been pre-committed successfully but not yet committed. Frequent label modifications and restarts will result in a large number of pre-committed successful transactions that cannot be aborted and thus occupy transactions. In the Unique model, 2pc can also be disabled to achieve idempotent writes. 4. **tablet writer write failed, tablet_id=190958, txn_id=3505530, err=-235** This usually occurs before Connector version 1.1.0 and is caused by too high a writing frequency, which leads to an excessive number of versions. You can reduce the frequency of Streamload by setting the `sink.batch.size` and `sink.batch.interval` parameters. After Connector version 1.1.0, the default writing timing is controlled by Checkpoint, and you can reduce the writing frequency by increasing the Checkpoint interval. 5. **How to skip dirty data when Flink is importing?** When Flink imports data, if there is dirty data, such as issues with field formats or lengths, it will cause StreamLoad to report errors. At this time, Flink will keep retrying. If you need to skip such data, you can disable the strict mode of StreamLoad (by setting `strict_mode=false` and `max_filter_ratio=1`) or filter the data before the Sink operator. 6. **How to configure when the network between Flink machines and BE machines is not connected?** When Flink initiates writing to Doris, Doris will redirect the write operation to BE. At this time, the returned address is the internal network IP of BE, which is the IP seen through the `show backends` command. If Flink and Doris have no network connectivity at this time, an error will be reported. In this case, you can configure the external network IP of BE in `benodes`. 7. **stream load error: HTTP/1.1 307 Temporary Redirect** Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto- redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode. On This Page * Version Description * Usage * Working Principles * Reading Data from Doris * Writing Data to Doris * Quick Start * Scenarios and Operations * Reading Data from Doris * Writing Data to Doris * Lookup Join * Full Database Synchronization * Usage Instructions * Parameter Configuration * Type Mapping * Monitoring Metrics * Best Practices * FlinkSQL Quickly Connects to MySQL Data via CDC * Flink Performs Partial Column Updates * Flink Imports Bitmap Data * FlinkCDC Updates Key Columns * Flink Deletes Data According to Specified Columns * Flink CDC Synchronize DDL Statements * Frequently Asked Questions (FAQ) --- # Source: https://docs.velodb.io/cloud/4.x/integration/data-processing/spark-doris-connector Version: 4.x On this page # Spark Doris Connector Spark Doris Connector can support reading data stored in Doris and writing data to Doris through Spark. Github: * Support reading data in batch mode from `Doris` through `RDD`, `DataFrame` and `Spark SQL`. It is recommended to use `DataFrame` or `Spark SQL` * Support writing data to `Doris` in batch or streaming mode with DataFrame API and Spark SQL. * You can map the `Doris` table to` DataFrame` or `RDD`, it is recommended to use` DataFrame`. * Support the completion of data filtering on the `Doris` side to reduce the amount of data transmission. ## Version Compatibility​ Connector| Spark| Doris| Java| Scala| 25.1.0| 3.5 - 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 25.0.1| 3.5 - 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 25.0.0| 3.5 - 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 24.0.0| 3.5 ~ 3.1, 2.4| 1.0 +| 8| 2.12, 2.11| 1.3.2| 3.4 ~ 3.1, 2.4, 2.3| 1.0 ~ 2.1.6| 8| 2.12, 2.11| 1.3.1| 3.4 ~ 3.1, 2.4, 2.3| 1.0 ~ 2.1.0| 8| 2.12, 2.11| 1.3.0| 3.4 ~ 3.1, 2.4, 2.3| 1.0 ~ 2.1.0| 8| 2.12, 2.11| 1.2.0| 3.2, 3.1, 2.3| 1.0 ~ 2.0.2| 8| 2.12, 2.11| 1.1.0| 3.2, 3.1, 2.3| 1.0 ~ 1.2.8| 8| 2.12, 2.11| 1.0.1| 3.1, 2.3| 0.12 - 0.15| 8| 2.12, 2.11 ---|---|---|---|--- ## How To Use​ ### Maven​ org.apache.doris spark-doris-connector-spark-3.5 25.1.0 ::: tip Starting from version 24.0.0, the naming rules of the Doris connector package have been adjusted: 1. No longer contains Scala version information. 2. For Spark 2.x versions, use the package named `spark-doris-connector-spark-2` uniformly, and by default only compile based on Scala 2.11 version. If you need Scala 2.12 version, please compile it yourself. 3. For Spark 3.x versions, use the package named `spark-doris-connector-spark-3.x` according to the specific Spark version. Applications based on Spark 3.0 version can use the package `spark-doris-connector-spark-3.1`. ::: **Note** 1. Please replace the corresponding Connector version according to different Spark and Scala versions. 2. You can also download the relevant version jar package from [here](https://repo.maven.apache.org/maven2/org/apache/doris/). ### Compile​ When compiling, you can directly run `sh build.sh`, for details, please refer to here. After successful compilation, the target jar package will be generated in the `dist` directory, such as: spark-doris-connector-spark-3.5-25.1.0.jar. Copy this file to the `ClassPath` of `Spark` to use `Spark-Doris-Connector`. For example, for `Spark` running in `Local` mode, put this file in the `jars/` folder. For `Spark` running in `Yarn` cluster mode, put this file in the pre- deployment package. You can also Execute in the source code directory: `sh build.sh` Enter the Scala and Spark versions you need to compile according to the prompts. After successful compilation, the target jar package will be generated in the `dist` directory, such as: `spark-doris-connector-spark-3.5-25.1.0.jar`. Copy this file to the `ClassPath` of `Spark` to use `Spark-Doris-Connector`. For example, if `Spark` is running in `Local` mode, put this file in the `jars/` folder. If `Spark` is running in `Yarn` cluster mode, put this file in the pre-deployment package. For example, upload `spark-doris-connector-spark-3.5-25.1.0.jar` to hdfs and add the Jar package path on hdfs to the `spark.yarn.jars` parameter 1. Upload `spark-doris-connector-spark-3.5-25.1.0.jar` to hdfs. hdfs dfs -mkdir /spark-jars/ hdfs dfs -put /your_local_path/spark-doris-connector-spark-3.5-25.1.0.jar /spark-jars/ 2. Add the `spark-doris-connector-spark-3.5-25.1.0.jar` dependency in the cluster. spark.yarn.jars=hdfs:///spark-jars/spark-doris-connector-spark-3.5-25.1.0.jar ## Example​ ### Batch Read​ #### RDD​ import org.apache.doris.spark._ val dorisSparkRDD = sc.dorisRDD( tableIdentifier = Some("$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME"), cfg = Some(Map( "doris.fenodes" -> "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", "doris.request.auth.user" -> "$YOUR_DORIS_USERNAME", "doris.request.auth.password" -> "$YOUR_DORIS_PASSWORD" )) ) dorisSparkRDD.collect() #### DataFrame​ val dorisSparkDF = spark.read.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .load() dorisSparkDF.show(5) #### Spark SQL​ CREATE TEMPORARY VIEW spark_doris USING doris OPTIONS( "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME", "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", "user"="$YOUR_DORIS_USERNAME", "password"="$YOUR_DORIS_PASSWORD" ); SELECT * FROM spark_doris; #### pySpark​ dorisSparkDF = spark.read.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .load() // show 5 lines data dorisSparkDF.show(5) #### Reading via Arrow Flight SQL​ Starting from version 24.0.0, data can be read via Arrow Flight SQL (Doris version >= 2.1.0 is required). Set `doris.read.mode` to arrow, set `doris.read.arrow-flight-sql.port` to the Arrow Flight SQL port configured by FE. For server configuration, refer to [High-speed data transmission link based on Arrow Flight SQL](https://doris.apache.org/zh-CN/docs/dev/db-connect/arrow- flight-sql-connect). val df = spark.read.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("doris.user", "$YOUR_DORIS_USERNAME") .option("doris.password", "$YOUR_DORIS_PASSWORD") .option("doris.read.mode", "arrow") .option("doris.read.arrow-flight-sql.port", "12345") .load() df.show() ### Batch Write​ #### DataFrame​ val mockDataDF = List( (3, "440403001005", "21.cn"), (1, "4404030013005", "22.cn"), (33, null, "23.cn") ).toDF("id", "mi_code", "mi_name") mockDataDF.show(5) mockDataDF.write.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") //other options //specify the fields to write .option("doris.write.fields", "$YOUR_FIELDS_TO_WRITE") // Support setting Overwrite mode to overwrite data // .mode(SaveMode.Overwrite) .save() #### Spark SQL​ CREATE TEMPORARY VIEW spark_doris USING doris OPTIONS( "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME", "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", "user"="$YOUR_DORIS_USERNAME", "password"="$YOUR_DORIS_PASSWORD" ); INSERT INTO spark_doris VALUES ("VALUE1", "VALUE2", ...); -- insert into select INSERT INTO spark_doris SELECT * FROM YOUR_TABLE; -- insert overwrite INSERT OVERWRITE SELECT * FROM YOUR_TABLE; ### Streaming Write​ #### DataFrame​ ##### Write structured data​ val df = spark.readStream.format("your_own_stream_source").load() df.writeStream .format("doris") .option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .start() .awaitTermination() ##### Write directly​ If the first column of data in the data stream is formatted data that conforms to the `Doris` table structure, such as CSV format data with the same column order, or JSON format data with the same field name, it can be written directly to `Doris` by setting the `doris.sink.streaming.passthrough` option to `true` without converting to `DataFrame`. Taking kafka as an example. And assuming the table structure to be written is: CREATE TABLE `t2` ( `c0` int NULL, `c1` varchar(10) NULL, `c2` date NULL ) ENGINE=OLAP DUPLICATE KEY(`c0`) COMMENT 'OLAP' DISTRIBUTED BY HASH(`c0`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 1" ); The value of the message is `{"c0":1,"c1":"a","dt":"2024-01-01"}` in json format. val kafkaSource = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "$YOUR_KAFKA_SERVERS") .option("startingOffsets", "latest") .option("subscribe", "$YOUR_KAFKA_TOPICS") .load() // Select the value of the message as the first column of the DataFrame. kafkaSource.selectExpr("CAST(value as STRING)") .writeStream .format("doris") .option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") // Set this option to true, and the first column will be written directly without processing. .option("doris.sink.streaming.passthrough", "true") .option("doris.sink.properties.format", "json") .start() .awaitTermination() #### Write in JSON format​ Set `doris.sink.properties.format` to json val df = spark.readStream.format("your_own_stream_source").load() df.write.format("doris") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .option("doris.sink.properties.format", "json") .save() ### Spark Doris Catalog​ Since version 24.0.0, support accessing doris through Spark Catalog. #### Catalog Config​ Key| Required| Comment| spark.sql.catalog.your_catalog_name| true| Set class name of catalog provider, the only valid value for Doris is `org.apache.doris.spark.catalog.DorisTableCatalog`| spark.sql.catalog.your_catalog_name.doris.fenodes| true| Set Doris FE node in the format fe_ip:fe_http_port| spark.sql.catalog.your_catalog_name.doris.query.port| false| Set Doris FE query port, this option is unnecessary if `spark.sql.catalog.your_catalog_name.doris.fe.auto.fetch` is set to true| spark.sql.catalog.your_catalog_name.doris.user| true| Set Doris user| spark.sql.catalog.your_catalog_name.doris.password| true| Set Doris password| spark.sql.defaultCatalog| false| Set Spark SQL default catalog ---|---|--- tip All connector parameters that apply to DataFrame and Spark SQL can be set for catalog. For example, if you want to write data in json format, you can set the option `spark.sql.catalog.your_catalog_name.doris.sink.properties.format` to `json`. #### DataFrame​ val conf = new SparkConf() conf.set("spark.sql.catalog.your_catalog_name", "org.apache.doris.spark.catalog.DorisTableCatalog") conf.set("spark.sql.catalog.your_catalog_name.doris.fenodes", "192.168.0.1:8030") conf.set("spark.sql.catalog.your_catalog_name.doris.query.port", "9030") conf.set("spark.sql.catalog.your_catalog_name.doris.user", "root") conf.set("spark.sql.catalog.your_catalog_name.doris.password", "") val spark = builder.config(conf).getOrCreate() spark.sessionState.catalogManager.setCurrentCatalog("your_catalog_name") // show all databases spark.sql("show databases") // use databases spark.sql("use your_doris_db") // show tables in test spark.sql("show tables") // query table spark.sql("select * from your_doris_table") // write data spark.sql("insert into your_doris_table values(xxx)") #### Spark SQL​ Start Spark SQL CLI with necessary config. spark-sql \ --conf "spark.sql.catalog.your_catalog_name=org.apache.doris.spark.catalog.DorisTableCatalog" \ --conf "spark.sql.catalog.your_catalog_name.doris.fenodes=192.168.0.1:8030" \ --conf "spark.sql.catalog.your_catalog_name.doris.query.port=9030" \ --conf "spark.sql.catalog.your_catalog_name.doris.user=root" \ --conf "spark.sql.catalog.your_catalog_name.doris.password=" \ --conf "spark.sql.defaultCatalog=your_catalog_name" Execute query in Spark SQL CLI. -- show all databases show databases; -- use databases use your_doris_db; -- show tables in test show tables; -- query table select * from your_doris_table; -- write data insert into your_doris_table values(xxx); insert into your_doris_table select * from your_source_table; -- access table with full name select * from your_catalog_name.your_doris_db.your_doris_table; insert into your_catalog_name.your_doris_db.your_doris_table values(xxx); insert into your_catalog_name.your_doris_db.your_doris_table select * from your_source_table; ## Configuration​ ### General​ Key| Default Value| Comment| doris.fenodes| \--| Doris FE http address, support multiple addresses, separated by commas| doris.table.identifier| \--| Doris table identifier, eg, db1.tbl1| doris.user| \--| Doris username| doris.password| Empty string| Doris password| doris.request.retries| 3| Number of retries to send requests to Doris| doris.request.connect.timeout.ms| 30000| Connection timeout for sending requests to Doris| doris.request.read.timeout.ms| 30000| Read timeout for sending request to Doris| doris.request.query.timeout.s| 21600| Query the timeout time of doris, the default is 6 hour, -1 means no timeout limit| doris.request.tablet.size| 1| The number of Doris Tablets corresponding to an RDD Partition. The smaller this value is set, the more partitions will be generated. This will increase the parallelism on the Spark side, but at the same time will cause greater pressure on Doris.| doris.read.field| \--| List of column names in the Doris table, separated by commas| doris.batch.size| 4064| The maximum number of rows to read data from BE at one time. Increasing this value can reduce the number of connections between Spark and Doris. Thereby reducing the extra time overhead caused by network delay.| doris.exec.mem.limit| 8589934592| Memory limit for a single query. The default is 8GB, in bytes.| doris.write.fields| \--| Specifies the fields (or the order of the fields) to write to the Doris table, fileds separated by commas. By default, all fields are written in the order of Doris table fields.| doris.sink.batch.size| 500000| Maximum number of lines in a single write BE| doris.sink.max-retries| 0| Number of retries after writing BE, Since version 1.3.0, the default value is 0, which means no retries are performed by default. When this parameter is set greater than 0, batch-level failure retries will be performed, and data of the configured size of `doris.sink.batch.size` will be cached in the Spark Executor memory. The memory allocation may need to be appropriately increased.| doris.sink.retry.interval.ms| 10000| After configuring the number of retries, the interval between each retry, in ms| doris.sink.properties.format| \--| Data format of the stream load. Supported formats: csv, json, arrow [More Multi-parameter details](/cloud/4.x/user-guide/data- operate/import/import-way/stream-load-manual)| doris.sink.properties.*| \--| Import parameters for Stream Load. For example: Specify column separator: `'doris.sink.properties.column_separator' = ','`. [More parameter details](/cloud/4.x/user-guide/data-operate/import/import- way/stream-load-manual)| doris.sink.task.partition.size| \--| The number of partitions corresponding to the Writing task. After filtering and other operations, the number of partitions written in Spark RDD may be large, but the number of records corresponding to each Partition is relatively small, resulting in increased writing frequency and waste of computing resources. The smaller this value is set, the less Doris write frequency and less Doris merge pressure. It is generally used with doris.sink.task.use.repartition.| doris.sink.task.use.repartition| false| Whether to use repartition mode to control the number of partitions written by Doris. The default value is false, and coalesce is used (note: if there is no Spark action before the write, the whole computation will be less parallel). If it is set to true, then repartition is used (note: you can set the final number of partitions at the cost of shuffle).| doris.sink.batch.interval.ms| 0| The interval time of each batch sink, unit ms.| doris.sink.enable-2pc| false| Whether to enable two- stage commit. When enabled, transactions will be committed at the end of the job, and all pre-commit transactions will be rolled back when some tasks fail.| doris.sink.auto-redirect| true| Whether to redirect StreamLoad requests. After being turned on, StreamLoad will write through FE and no longer obtain BE information explicitly.| doris.enable.https| false| Whether to enable FE Https request.| doris.https.key-store-path| -| Https key store path.| doris.https.key-store-type| JKS| Https key store type.| doris.https.key-store-password| -| Https key store password.| doris.read.mode| thrift| Doris read mode, with optional `thrift` and `arrow`.| doris.read.arrow-flight-sql.port| -| Arrow Flight SQL port of Doris FE. When `doris.read.mode` is `arrow`, it is used to read data via Arrow Flight SQL. For server configuration, see [High-speed data transmission link based on Arrow Flight SQL](https://doris.apache.org/zh-CN/docs/dev/db-connect/arrow- flight-sql-connect)| doris.sink.label.prefix| spark-doris| The import label prefix when writing in Stream Load mode.| doris.thrift.max.message.size| 2147483647| The maximum size of a message when reading data via Thrift.| doris.fe.auto.fetch| false| Whether to automatically obtain FE information. When set to true, all FE node information will be requested according to the nodes configured by `doris.fenodes`. There is no need to configure multiple nodes and configure `doris.read.arrow-flight-sql.port` and `doris.query.port` separately.| doris.read.bitmap-to-string| false| Whether to convert the Bitmap type to a string composed of array indexes for reading. For the specific result format, see the function definition [BITMAP_TO_STRING](/cloud/4.x/sql- manual/sql-functions/scalar-functions/bitmap-functions/bitmap-to-string).| doris.read.bitmap-to-base64| false| Whether to convert the Bitmap type to a Base64-encoded string for reading. For the specific result format, see the function definition [BITMAP_TO_BASE64](/cloud/4.x/sql-manual/sql- functions/scalar-functions/bitmap-functions/bitmap-to-base64).| doris.query.port| -| Doris FE query port, used for overwriting and obtaining metadata of the Catalog. ---|---|--- ### SQL & Dataframe Configuration​ Key| Default Value| Comment| doris.filter.query.in.max.count| 100| In the predicate pushdown, the maximum number of elements in the in expression value list. If this number is exceeded, the in-expression conditional filtering is processed on the Spark side. ---|---|--- ### Structured Streaming Configuration​ Key| Default Value| Comment| doris.sink.streaming.passthrough| false| Write the value of the first column directly without processing. ---|---|--- ### RDD Configuration​ Key| Default Value| Comment| doris.request.auth.user| \--| Doris username| doris.request.auth.password| \--| Doris password| doris.filter.query| \--| Filter expression of the query, which is transparently transmitted to Doris. Doris uses this expression to complete source-side data filtering. ---|---|--- ## Doris & Spark Column Type Mapping​ Doris Type| Spark Type| NULL_TYPE| DataTypes.NullType| BOOLEAN| DataTypes.BooleanType| TINYINT| DataTypes.ByteType| SMALLINT| DataTypes.ShortType| INT| DataTypes.IntegerType| BIGINT| DataTypes.LongType| FLOAT| DataTypes.FloatType| DOUBLE| DataTypes.DoubleType| DATE| DataTypes.DateType| DATETIME| DataTypes.TimestampType| DECIMAL| DecimalType| CHAR| DataTypes.StringType| LARGEINT| DecimalType| VARCHAR| DataTypes.StringType| STRING| DataTypes.StringType| JSON| DataTypes.StringType| VARIANT| DataTypes.StringType| TIME| DataTypes.DoubleType| HLL| DataTypes.StringType| Bitmap| DataTypes.StringType ---|--- tip Since version 24.0.0, the return type of the Bitmap type is string type, and the default return value is string value `Read unsupported`. ## FAQ​ 1. How to write Bitmap type In Spark SQL, when writing data through insert into, if the target table of doris contains data of type `BITMAP` or `HLL`, you need to set the option `doris.ignore-type` to the corresponding type and map the columns through `doris.write.fields`. The usage is as follows: **BITMAP** CREATE TEMPORARY VIEW spark_doris USING doris OPTIONS( "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME", "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", "user"="$YOUR_DORIS_USERNAME", "password"="$YOUR_DORIS_PASSWORD" "doris.ignore-type"="bitmap", "doris.write.fields"="col1,col2,col3,bitmap_col2=to_bitmap(col2),bitmap_col3=bitmap_hash(col3)" ); **HLL** CREATE TEMPORARY VIEW spark_doris USING doris OPTIONS( "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME", "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", "user"="$YOUR_DORIS_USERNAME", "password"="$YOUR_DORIS_PASSWORD" "doris.ignore-type"="hll", "doris.write.fields"="col1,hll_col1=hll_hash(col1)" ); tip Since version 24.0.0, `doris.ignore-type` has been deprecated and there is no need to add this parameter when writing. 2. **How to use overwrite to write?** Since version 1.3.0, overwrite mode writing is supported (only supports data overwriting at the full table level). The specific usage is as follows: **DataFrame** resultDf.format("doris") .option("doris.fenodes","$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") // your own options .mode(SaveMode.Overwrite) .save() **SQL** INSERT OVERWRITE your_target_table SELECT * FROM your_source_table 3. **How to read Bitmap type** Starting from version 24.0.0, it supports reading converted Bitmap data through Arrow Flight SQL (Doris version >= 2.1.0 is required). **Bitmap to string** `DataFrame` example is as follows, set `doris.read.bitmap-to-string` to true. For the specific result format, see the option definition. spark.read.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .option("doris.read.bitmap-to-string","true") .load() **Bitmap to base64** `DataFrame` example is as follows, set `doris.read.bitmap-to-base64` to true. For the specific result format, see the option definition. spark.read.format("doris") .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") .option("user", "$YOUR_DORIS_USERNAME") .option("password", "$YOUR_DORIS_PASSWORD") .option("doris.read.bitmap-to-base64","true") .load() 4. **An error occurs when writing in DataFrame mode:`org.apache.spark.sql.AnalysisException: TableProvider implementation doris cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.`** Need to add save mode to append. resultDf.format("doris") .option("doris.fenodes","$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") // your own options .mode(SaveMode.Append) .save() On This Page * Version Compatibility * How To Use * Maven * Compile * Example * Batch Read * Batch Write * Streaming Write * Spark Doris Catalog * Configuration * General * SQL & Dataframe Configuration * Structured Streaming Configuration * RDD Configuration * Doris & Spark Column Type Mapping * FAQ --- # Source: https://docs.velodb.io/cloud/4.x/integration/data-source/doris-kafka-connector Version: 4.x On this page # Doris Kafka Connector [Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html) is a scalable and reliable tool for data transmission between Apache Kafka and other systems. Connectors can be defined Move large amounts of data in and out of Kafka. The Doris community provides the [doris-kafka- connector](https://github.com/apache/doris-kafka-connector) plug-in, which can write data in the Kafka topic to Doris. ## Version Description​ Connector Version| Kafka Version| Doris Version| Java Version| 1.0.0| 2.4+| 2.0+| 8| 1.1.0| 2.4+| 2.0+| 8| 24.0.0| 2.4+| 2.0+| 8| 25.0.0| 2.4+| 2.0+| 8 ---|---|---|--- ## Usage​ ### Download​ [doris-kafka-connector](https://doris.apache.org/download) maven dependencies org.apache.doris doris-kafka-connector 25.0.0 ### Standalone mode startup​ Create the plugins directory under $KAFKA_HOME and put the downloaded doris- kafka-connector jar package into it Configure config/connect-standalone.properties # Modify broker address bootstrap.servers=127.0.0.1:9092 # Modify to the created plugins directory # Note: Please fill in the direct path to Kafka here. For example: plugin.path=/opt/kafka/plugins plugin.path=$KAFKA_HOME/plugins # It is recommended to increase the max.poll.interval.ms time of Kafka to more than 30 minutes, the default is 5 minutes # Avoid Stream Load import data consumption timeout and consumers being kicked out of the consumer group max.poll.interval.ms=1800000 consumer.max.poll.interval.ms=1800000 Configure doris-connector-sink.properties Create doris-connector-sink.properties in the config directory and configure the following content: name=test-doris-sink connector.class=org.apache.doris.kafka.connector.DorisSinkConnector topics=topic_test doris.topic2table.map=topic_test:test_kafka_tbl buffer.count.records=10000 buffer.flush.time=120 buffer.size.bytes=5000000 doris.urls=10.10.10.1 doris.http.port=8030 doris.query.port=9030 doris.user=root doris.password= doris.database=test_db key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.kafka.connect.json.JsonConverter Start Standalone $KAFKA_HOME/bin/connect-standalone.sh -daemon $KAFKA_HOME/config/connect-standalone.properties $KAFKA_HOME/config/doris-connector-sink.properties note Note: It is generally not recommended to use standalone mode in a production environment. ### Distributed mode startup​ Create the plugins directory under $KAFKA_HOME and put the downloaded doris- kafka-connector jar package into it Configure config/connect-distributed.properties # Modify kafka server address bootstrap.servers=127.0.0.1:9092 # Modify group.id, the same cluster needs to be consistent group.id=connect-cluster # Modify to the created plugins directory # Note: Please fill in the direct path to Kafka here. For example: plugin.path=/opt/kafka/plugins plugin.path=$KAFKA_HOME/plugins # It is recommended to increase the max.poll.interval.ms time of Kafka to more than 30 minutes, the default is 5 minutes # Avoid Stream Load import data consumption timeout and consumers being kicked out of the consumer group max.poll.interval.ms=1800000 consumer.max.poll.interval.ms=1800000 Start Distributed $KAFKA_HOME/bin/connect-distributed.sh -daemon $KAFKA_HOME/config/connect-distributed.properties Add Connector curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name":"test-doris-sink-cluster", "config":{ "connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector", "topics":"topic_test", "doris.topic2table.map": "topic_test:test_kafka_tbl", "buffer.count.records":"10000", "buffer.flush.time":"120", "buffer.size.bytes":"5000000", "doris.urls":"10.10.10.1", "doris.user":"root", "doris.password":"", "doris.http.port":"8030", "doris.query.port":"9030", "doris.database":"test_db", "key.converter":"org.apache.kafka.connect.storage.StringConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter" } }' Operation Connector # View connector status curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/status -X GET # Delete connector curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster -X DELETE # Pause connector curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/pause -X PUT # Restart connector curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/resume -X PUT # Restart tasks within the connector curl -i http://127.0.0.1:8083/connectors/test-doris-sink-cluster/tasks/0/restart -X POST Refer to: [Connect REST Interface](https://docs.confluent.io/platform/current/connect/references/restapi.html#kconnect- rest-interface) note Note that when kafka-connect is started for the first time, three topics `config.storage.topic` `offset.storage.topic` and `status.storage.topic` will be created in the kafka cluster to record the shared connector configuration of kafka-connect. Offset data and status updates. [How to Use Kafka Connect - Get Started](https://docs.confluent.io/platform/current/connect/userguide.html) ### Access an SSL-certified Kafka cluster​ Accessing an SSL-certified Kafka cluster through kafka-connect requires the user to provide a certificate file (client.truststore.jks) used to authenticate the Kafka Broker public key. You can add the following configuration in the `connect-distributed.properties` file: # Connect worker security.protocol=SSL ssl.truststore.location=/var/ssl/private/client.truststore.jks ssl.truststore.password=test1234 # Embedded consumer for sink connectors consumer.security.protocol=SSL consumer.ssl.truststore.location=/var/ssl/private/client.truststore.jks consumer.ssl.truststore.password=test1234 For instructions on configuring a Kafka cluster connected to SSL authentication through kafka-connect, please refer to: [Configure Kafka Connect](https://docs.confluent.io/5.1.2/tutorials/security_tutorial.html#configure- kconnect-long) ### Dead letter queue​ By default, any errors encountered during or during the conversion will cause the connector to fail. Each connector configuration can also tolerate such errors by skipping them, optionally writing the details of each error and failed operation as well as the records in question (with varying levels of detail) to a dead-letter queue for logging. errors.tolerance=all errors.deadletterqueue.topic.name=test_error_topic errors.deadletterqueue.context.headers.enable=true errors.deadletterqueue.topic.replication.factor=1 ## Configuration items​ Key| Enum| Default Value| **Required**| **Description**| name| -| -| Y| Connect application name, must be unique within the Kafka Connect environment| connector.class| -| -| Y| org.apache.doris.kafka.connector.DorisSinkConnector| topics| -| -| Y| List of subscribed topics, separated by commas. like: topic1, topic2| doris.urls| -| -| Y| Doris FE connection address. If there are multiple, separate them with commas. like: 10.20.30.1,10.20.30.2,10.20.30.3| doris.http.port| -| -| Y| Doris HTTP protocol port| doris.query.port| -| -| Y| Doris MySQL protocol port| doris.user| -| -| Y| Doris username| doris.password| -| -| Y| Doris password| doris.database| -| -| Y| The database to write to. It can be empty when there are multiple libraries. At the same time, the specific library name needs to be configured in topic2table.map.| doris.topic2table.map| -| -| N| The corresponding relationship between topic and table table, for example: topic1:tb1,topic2:tb2 The default is empty, indicating that topic and table names correspond one to one. The format of multiple libraries is topic1:db1.tbl1,topic2:db2.tbl2| buffer.count.records| -| 50000| N| The number of records each Kafka partition buffers in memory before flushing to doris. Default 50000 records| buffer.flush.time| -| 120| N| Buffer refresh interval, in seconds, default 120 seconds| buffer.size.bytes| -| 10485760(100MB)| N| The cumulative size of records buffered in memory for each Kafka partition, in bytes, default 100MB| jmx| -| true| N| To obtain connector internal monitoring indicators through JMX, please refer to: [Doris-Connector-JMX](https://github.com/apache/doris- kafka-connector/blob/master/docs/en/Doris-Connector-JMX.md)| enable.2pc| -| true| N| Whether to enable two-phase commit (TwoPhaseCommit) of Stream Load, the default is true.| enable.delete| -| false| N| Whether to delete records synchronously, default false| label.prefix| -| ${name}| N| Stream load label prefix when importing data. Defaults to the Connector application name.| auto.redirect| -| true| N| Whether to redirect StreamLoad requests. After being turned on, StreamLoad will redirect to the BE where data needs to be written through FE, and the BE information will no longer be displayed.| sink.properties.*| -| `'sink.properties.format':'json'`, `'sink.properties.read_json_by_line':'true'`| N| Import parameters for Stream Load. For example: define column separator `'sink.properties.column_separator':','` Detailed parameter reference [here](/cloud/4.x/user-guide/data- operate/import/import-way/stream-load-manual) **Enable Group Commit** , for example, enable group commit in sync_mode mode: `"sink.properties.group_commit":"sync_mode"`. Group Commit can be configured with three modes: `off_mode`, `sync_mode`, and `async_mode`. For specific usage, please refer to: [Group-Commit](/cloud/4.x/user-guide/data- operate/import/group-commit-manual) **Enable partial column update** , for example, enable update of partial columns of specified col2: `"sink.properties.partial_columns":"true"`, `"sink.properties.columns": " col2",`| delivery.guarantee| `at_least_once`, `exactly_once`| at_least_once| N| How to ensure data consistency when consuming Kafka data is imported into Doris. Supports `at_least_once` `exactly_once`, default is `at_least_once`. Doris needs to be upgraded to 2.1.0 or above to ensure data `exactly_once`| converter.mode| `normal`, `debezium_ingestion`| normal| N| Type conversion mode of upstream data when using Connector to consume Kafka data. `normal` means consuming data in Kafka normally without any type conversion. `debezium_ingestion` means that when Kafka upstream data is collected through CDC (Changelog Data Capture) tools such as Debezium, the upstream data needs to undergo special type conversion to support it.| debezium.schema.evolution| `none`, `basic`| none| N| Use Debezium to collect upstream database systems (such as MySQL), and when structural changes occur, the added fields can be synchronized to Doris. `none` means that when the structure of the upstream database system changes, the changed structure will not be synchronized to Doris. `basic` means synchronizing the data change operation of the upstream database. Since changing the column structure is a dangerous operation (it may lead to accidentally deleting columns of the Doris table structure), currently only the operation of adding columns synchronously upstream is supported. When a column is renamed, the old column remains unchanged, and the Connector will add a new column in the target table and sink the renamed new data into the new column.| database.time_zone| -| UTC| N| When `converter.mode` is not `normal` mode, it provides a way to specify time zone conversion for date data types (such as datetime, date, timestamp, etc.). The default is UTC time zone.| avro.topic2schema.filepath| -| -| N| By reading the locally provided Avro Schema file, the Avro file content in the Topic is parsed to achieve decoupling from the Schema registration center provided by Confluent. This configuration needs to be used with the `key.converter` or `value.converter` prefix. For example, the local Avro Schema file for configuring avro-user and avro-product Topic is as follows: `"value.converter.avro.topic2schema. filepath":"avro- user:file:///opt/avro_user.avsc, avro-product:file:///opt/avro_product.avsc"` For specific usage, please refer to: [#32](https://github.com/apache/doris- kafka-connector/pull/32)| record.tablename.field| -| -| N| Configure this parameter, data from one kafka topic can flow to multiple doris tables. For configuration details, refer to: [#58](https://github.com/apache/doris-kafka- connector/pull/58)| enable.combine.flush| `true`, `false`| false| N| Whether to merge data from all partitions together and write them. The default value is false. When enabled, only at_least_once semantics are guaranteed.| max.retries| -| 10| N| The maximum number of times to retry on errors before failing the task.| retry.interval.ms| -| 6000| N| The time in milliseconds to wait following an error before attempting a retry.| behavior.on.null.values| `ignore`, `fail`| ignore| N| Defined how to handle records with null values. ---|---|---|---|--- For other Kafka Connect Sink common configuration items, please refer to: [connect_configuring](https://kafka.apache.org/documentation/#connect_configuring) ## Type mapping​ Doris-kafka-connector uses logical or primitive type mapping to resolve the column's data type. Primitive types refer to simple data types represented using Kafka connect's `Schema`. Logical data types usually use the `Struct` structure to represent complex types, or date and time types. Kafka Primitive Type| Doris Type| INT8| TINYINT| INT16| SMALLINT| INT32| INT| INT64| BIGINT| FLOAT32| FLOAT| FLOAT64| DOUBLE| BOOLEAN| BOOLEAN| STRING| STRING| BYTES| STRING ---|--- Kafka Logical Type| Doris Type| org.apache.kafka.connect.data.Decimal| DECIMAL| org.apache.kafka.connect.data.Date| DATE| org.apache.kafka.connect.data.Time| STRING| org.apache.kafka.connect.data.Timestamp| DATETIME ---|--- Debezium Logical Type| Doris Type| io.debezium.time.Date| DATE| io.debezium.time.Time| String| io.debezium.time.MicroTime| DATETIME| io.debezium.time.NanoTime| DATETIME| io.debezium.time.ZonedTime| DATETIME| io.debezium.time.Timestamp| DATETIME| io.debezium.time.MicroTimestamp| DATETIME| io.debezium.time.NanoTimestamp| DATETIME| io.debezium.time.ZonedTimestamp| DATETIME| io.debezium.data.VariableScaleDecimal| DOUBLE ---|--- ## Best Practices​ ### Load plain JSON data​ 1. Import data sample In Kafka, there is the following sample data kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-data-topic --from-beginning {"user_id":1,"name":"Emily","age":25} {"user_id":2,"name":"Benjamin","age":35} {"user_id":3,"name":"Olivia","age":28} {"user_id":4,"name":"Alexander","age":60} {"user_id":5,"name":"Ava","age":17} {"user_id":6,"name":"William","age":69} {"user_id":7,"name":"Sophia","age":32} {"user_id":8,"name":"James","age":64} {"user_id":9,"name":"Emma","age":37} {"user_id":10,"name":"Liam","age":64} 2. Create the table that needs to be imported In Doris, create the imported table, the specific syntax is as follows CREATE TABLE test_db.test_kafka_connector_tbl( user_id BIGINT NOT NULL COMMENT "user id", name VARCHAR(20) COMMENT "name", age INT COMMENT "age" ) DUPLICATE KEY(user_id) DISTRIBUTED BY HASH(user_id) BUCKETS 12; 3. Create an import task On the machine where Kafka-connect is deployed, submit the following import task through the curl command curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name":"test-doris-sink-cluster", "config":{ "connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector", "tasks.max":"10", "topics":"test-data-topic", "doris.topic2table.map": "test-data-topic:test_kafka_connector_tbl", "buffer.count.records":"10000", "buffer.flush.time":"120", "buffer.size.bytes":"5000000", "doris.urls":"10.10.10.1", "doris.user":"root", "doris.password":"", "doris.http.port":"8030", "doris.query.port":"9030", "doris.database":"test_db", "key.converter":"org.apache.kafka.connect.storage.StringConverter", "value.converter":"org.apache.kafka.connect.storage.StringConverter" } }' ### Load data collected by Debezium components​ 1. The MySQL database has the following table CREATE TABLE test.test_user ( user_id int NOT NULL , name varchar(20), age int, PRIMARY KEY (user_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; insert into test.test_user values(1,'zhangsan',20); insert into test.test_user values(2,'lisi',21); insert into test.test_user values(3,'wangwu',22); 2. Create the imported table in Doris CREATE TABLE test_db.test_user( user_id BIGINT NOT NULL COMMENT "user id", name VARCHAR(20) COMMENT "name", age INT COMMENT "age" ) UNIQUE KEY(user_id) DISTRIBUTED BY HASH(user_id) BUCKETS 12; 3. Deploy the Debezium connector for MySQL component, refer to: [Debezium connector for MySQL](https://debezium.io/documentation/reference/stable/connectors/mysql.html) 4. Create doris-kafka-connector import task Assume that the MySQL table data collected through Debezium is in the `mysql_debezium.test.test_user` Topic curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name":"test-debezium-doris-sink", "config":{ "connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector", "tasks.max":"10", "topics":"mysql_debezium.test.test_user", "doris.topic2table.map": "mysql_debezium.test.test_user:test_user", "buffer.count.records":"10000", "buffer.flush.time":"120", "buffer.size.bytes":"5000000", "doris.urls":"10.10.10.1", "doris.user":"root", "doris.password":"", "doris.http.port":"8030", "doris.query.port":"9030", "doris.database":"test_db", "converter.mode":"debezium_ingestion", "enable.delete":"true", "key.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter" } }' ### Load Avro serialized data​ curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name":"doris-avro-test", "config":{ "connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector", "topics":"avro_topic", "tasks.max":"10", "doris.topic2table.map": "avro_topic:avro_tab", "buffer.count.records":"100000", "buffer.flush.time":"120", "buffer.size.bytes":"10000000", "doris.urls":"127.0.0.1", "doris.user":"root", "doris.password":"", "doris.http.port":"8030", "doris.query.port":"9030", "doris.database":"test", "load.model":"stream_load", "key.converter":"io.confluent.connect.avro.AvroConverter", "key.converter.schema.registry.url":"http://127.0.0.1:8081", "value.converter":"io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url":"http://127.0.0.1:8081" } }' ### Load Protobuf serialized data​ curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name":"doris-protobuf-test", "config":{ "connector.class":"org.apache.doris.kafka.connector.DorisSinkConnector", "topics":"proto_topic", "tasks.max":"10", "doris.topic2table.map": "proto_topic:proto_tab", "buffer.count.records":"100000", "buffer.flush.time":"120", "buffer.size.bytes":"10000000", "doris.urls":"127.0.0.1", "doris.user":"root", "doris.password":"", "doris.http.port":"8030", "doris.query.port":"9030", "doris.database":"test", "load.model":"stream_load", "key.converter":"io.confluent.connect.protobuf.ProtobufConverter", "key.converter.schema.registry.url":"http://127.0.0.1:8081", "value.converter":"io.confluent.connect.protobuf.ProtobufConverter", "value.converter.schema.registry.url":"http://127.0.0.1:8081" } }' ### Loading Data with Kafka Connect Single Message Transforms​ For example, consider data in the following format: { "registertime": 1513885135404, "userid": "User_9", "regionid": "Region_3", "gender": "MALE" } To add a hard-coded column to Kafka messages, InsertField can be used. Additionally, TimestampConverter can be used to convert Bigint type timestamps to time strings. curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ "name": "insert_field_tranform", "config": { "connector.class": "org.apache.doris.kafka.connector.DorisSinkConnector", "tasks.max": "1", "topics": "users", "doris.topic2table.map": "users:kf_users", "buffer.count.records": "10", "buffer.flush.time": "11", "buffer.size.bytes": "5000000", "doris.urls": "127.0.0.1:8030", "doris.user": "root", "doris.password": "123456", "doris.http.port": "8030", "doris.query.port": "9030", "doris.database": "testdb", "key.converter": "org.apache.kafka.connect.storage.StringConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable": "false", "transforms": "InsertField,TimestampConverter", // Insert Static Field "transforms.InsertField.type": "org.apache.kafka.connect.transforms.InsertField$Value", "transforms.InsertField.static.field": "repo", "transforms.InsertField.static.value": "Apache Doris", // Convert Timestamp Format "transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value", "transforms.TimestampConverter.field": "registertime", "transforms.TimestampConverter.format": "yyyy-MM-dd HH:mm:ss.SSS", "transforms.TimestampConverter.target.type": "string" } }' After InsertField and TimestampConverter transformations, the data becomes: { "userid": "User_9", "regionid": "Region_3", "gender": "MALE", "repo": "Apache Doris",// Static field added "registertime": "2017-12-21 03:38:55.404" // Unix timestamp converted to string } For more examples of Kafka Connect Single Message Transforms (SMT), please refer to the [SMT documentation](https://docs.confluent.io/cloud/current/connectors/transforms/overview.html). ## FAQ​ **1\. The following error occurs when reading Json type data:** Caused by: org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration. at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:337) at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:91) at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$4(WorkerSinkTask.java:536) at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:180) at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:214) **reason:** This is because using the `org.apache.kafka.connect.json.JsonConverter` converter requires matching the "schema" and "payload" fields. **Two solutions, choose one:** 1. Replace `org.apache.kafka.connect.json.JsonConverter` with `org.apache.kafka.connect.storage.StringConverter` 2. If the startup mode is **Standalone** mode, change `value.converter.schemas.enable` or `key.converter.schemas.enable` in config/connect-standalone.properties to false; If the startup mode is **Distributed** mode, change `value.converter.schemas.enable` or `key.converter.schemas.enable` in config/connect-distributed.properties to false **2\. The consumption times out and the consumer is kicked out of the consumption group:** org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group. at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1318) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doCommitOffsetsAsync(ConsumerCoordinator.java:1127) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:1093) at org.apache.kafka.clients.consumer.KafkaConsumer.commitAsync(KafkaConsumer.java:1590) at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitAsync(WorkerSinkTask.java:361) at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:376) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:467) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:381) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:221) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259) at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) **Solution:** Increase `max.poll.interval.ms` in Kafka according to the scenario. The default value is `300000` * If it is started in Standalone mode, add the `max.poll.interval.ms` and `consumer.max.poll.interval.ms` parameters in the configuration file of config/connect-standalone.properties, and configure the parameter values. * If it is started in Distributed mode, add the `max.poll.interval.ms` and `consumer.max.poll.interval.ms` parameters in the configuration file of config/connect-distributed.properties, and configure the parameter values. After adjusting the parameters, restart kafka-connect **3\. Doris-kafka-connector reports an error when upgrading version from 1.0.0 or 1.1.0 to 24.0.0** org.apache.kafka.common.config.ConfigException: Topic 'connect-status' supplied via the 'status.storage.topic' property is required to have 'cleanup.policy=compact' to guarantee consistency and durability of connector and task statuses, but found the topic currently has 'cleanup.policy=delete'. Continuing would likely result in eventually losing connector and task statuses and problems restarting this Connect cluster in the future. Change the 'status.storage.topic' property in the Connect worker configurations to use a topic with 'cleanup.policy=compact'. at org.apache.kafka.connect.util.TopicAdmin.verifyTopicCleanupPolicyOnlyCompact(TopicAdmin.java:581) at org.apache.kafka.connect.storage.KafkaTopicBasedBackingStore.lambda$topicInitializer$0(KafkaTopicBasedBackingStore.java:47) at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:247) at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:231) at org.apache.kafka.connect.storage.KafkaStatusBackingStore.start(KafkaStatusBackingStore.java:228) at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:164) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run **Solution:** Adjust the clearing strategy of `connect-configs` `connect- status` Topic to compact $KAFKA_HOME/bin/kafka-configs.sh --alter --entity-type topics --entity-name connect-configs --add-config cleanup.policy=compact --bootstrap-server 127.0.0.1:9092 $KAFKA_HOME/bin/kafka-configs.sh --alter --entity-type topics --entity-name connect-status --add-config cleanup.policy=compact --bootstrap-server 127.0.0.1:9092 **4\. Table schema change failed in`debezium_ingestion` converter mode** [2025-01-07 14:26:20,474] WARN [doris-normal_test_sink-connector|task-0] Table 'test_sink' cannot be altered because schema evolution is disabled. (org.apache.doris.kafka.connector.converter.RecordService:183) [2025-01-07 14:26:20,475] ERROR [doris-normal_test_sink-connector|task-0] WorkerSinkTask{id=doris-normal_test_sink-connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Cannot alter table org.apache.doris.kafka.connector.model.TableDescriptor@67cd8027 because schema evolution is disabled (org.apache.kafka.connect.runtime.WorkerSinkTask:612) org.apache.doris.kafka.connector.exception.SchemaChangeException: Cannot alter table org.apache.doris.kafka.connector.model.TableDescriptor@67cd8027 because schema evolution is disabled at org.apache.doris.kafka.connector.converter.RecordService.alterTableIfNeeded(RecordService.java:186) at org.apache.doris.kafka.connector.converter.RecordService.checkAndApplyTableChangesIfNeeded(RecordService.java:150) at org.apache.doris.kafka.connector.converter.RecordService.processStructRecord(RecordService.java:100) at org.apache.doris.kafka.connector.converter.RecordService.getProcessedRecord(RecordService.java:305) at org.apache.doris.kafka.connector.writer.DorisWriter.putBuffer(DorisWriter.java:155) at org.apache.doris.kafka.connector.writer.DorisWriter.insertRecord(DorisWriter.java:124) at org.apache.doris.kafka.connector.writer.StreamLoadWriter.insert(StreamLoadWriter.java:151) at org.apache.doris.kafka.connector.service.DorisDefaultSinkService.insert(DorisDefaultSinkService.java:154) at org.apache.doris.kafka.connector.service.DorisDefaultSinkService.insert(DorisDefaultSinkService.java:135) at org.apache.doris.kafka.connector.DorisSinkTask.put(DorisSinkTask.java:97) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:583) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:336) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:237) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:202) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:257) at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:177) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) **Solution:** In `debezium_ingestion` converter mode, table schema changes are turned off by default. You need to configure `debezium.schema.evolution` to `basic` to enable table schema changes. It should be noted that enabling table structure changes does not accurately keep this changed column as the only column in the Doris table (see `debezium.schema.evolution` parameter description for details). If you need to keep only unique columns in the upstream and downstream, it is best to manually add the changed columns to the Doris table, and then restart the Connector task. The Connector will continue to consume the unconsumed `offset` to maintain data consistency. On This Page * Version Description * Usage * Download * Standalone mode startup * Distributed mode startup * Access an SSL-certified Kafka cluster * Dead letter queue * Configuration items * Type mapping * Best Practices * Load plain JSON data * Load data collected by Debezium components * Load Avro serialized data * Load Protobuf serialized data * Loading Data with Kafka Connect Single Message Transforms * FAQ --- # Source: https://docs.velodb.io/cloud/4.x/integration/overview Version: 4.x On this page # Integration Overview VeloDB integrations are categorized into **BI, Lakehouse, Observability, SQL client, Data Source, Data Ingestion and Data Processing** categories. This list of VeloDB / Apache Doris integrations is continuously being updated and is not yet complete. We welcome any contributions of relevant VeloDB / Apache Doris integrations to help expand it. [Contact Us](mailto:contact@velodb.io) to update the integration list. ## Lakehouse​ Name| Logo| Description| Resources| Apache Iceberg| ![iceberg](/assets/images/iceberg-8d6eafac442eb014b2759dc49dfdf387.png)| Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/iceberg-catalog)| Apache Hudi| ![Hudi]()| By connecting to the Hive Metastore, or a metadata service compatible with the Hive Metastore, Doris can automatically obtain Hudi's database and table information and perform data queries.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/hive-catalog)| Amazon Glue| | Using AWS Glue Catalog to access Iceberg tables or Hive tables through CREATE CATALOG.| [Documentation](/cloud/4.x/user-guide/lakehouse/metastores/aws-glue)| Apache Paimon| | Doris currently supports accessing Paimon table metadata through various metadata services and querying Paimon data.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/paimon-catalog)| Apache Hive| | By connecting to Hive Metastore or metadata services compatible with Hive Metastore, Doris can automatically retrieve Hive database and table information for data querying.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/hive-catalog)| BigQuery| ![BigQuery](/assets/images/google-bigquery-logo-icon-ff6d8331afc34882c370b0e6089ed461.png)| BigQuery Catalog uses the Trino Connector compatibility framework to access BigQuery tables through the BigQuery Connector.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/bigquery-catalog)| Apache Kudu| ![Kudu](/assets/images/Apache-Kudu-logo-1a36a540e2fcc40ba59cf2d86354419b.png)| Kudu Catalog uses the Trino Connector compatibility framework to access Kudu tables through the Kudu Connector.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/kudu-catalog)| LakeSoul| | Doris supports accessing and reading LakeSoul table data using metadata stored in PostgreSQL.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/lakesoul-catalog)| MaxCompute| ![MaxCompute]()| MaxCompute is an enterprise-level SaaS (Software as a Service) cloud data warehouse on Alibaba Cloud.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/maxcompute-catalog) ---|---|---|--- ## Observability​ Name| Logo| Description| Resources| Opentelemetry| ![Opentelemetry]()| Using AWS Glue Catalog to access Iceberg tables or Hive tables through CREATE CATALOG.| Documentation| Logstash| | Logstash is a log ETL framework (collect, preprocess, send to storage systems) that supports custom output plugins to write data into storage systems.| [Documentation](/cloud/4.x/ecosystem/observability/logstash)| Beats| | Doris supports accessing Iceberg table data through various metadata services. In addition to reading data, Doris also supports writing to Iceberg tables.| [Documentation](/cloud/4.x/ecosystem/observability/beats)| Fluentbit| ![Fluentbit]()| Doris currently supports accessing Paimon table metadata through various metadata services and querying Paimon data.| [Documentation](/cloud/4.x/ecosystem/observability/fluentbit) ---|---|---|--- ## Data Processing​ Name| Logo| Description| Resources| Apache Spark| Apache Spark logo| Spark Doris Connector can support reading data stored in Doris and writing data to Doris through Spark.| [GitHub](https://github.com/apache/doris-spark- connector) [Documentation](/cloud/4.x/integration/data-processing/spark-doris-connector)| Apache Flink| ![flink](/assets/images/flink-b58e6d26344e5697b440d497b9368294.png)| The Flink Doris Connector is used to read from and write data to a Doris cluster through Flink.| [GitHub](https://github.com/apache/doris-flink-connector) [Documentation](/cloud/4.x/integration/data-processing/flink-doris-connector)| dbt| dbt| The dbt-doris adapter is developed based on dbt-core and relies on the mysql-connector-python driver to convert data to doris.| [Documentation](/cloud/4.x/integration/data-processing/dbt-doris-adapter) ---|---|---|--- ## BI​ Name| Logo| Description| Resources| Tableau| ![tableau]()| Interactive data visualization software focused on business intelligence| [Documentation](/cloud/4.x/integration/bi/tableau)| Power BI| ![powerbi]()| Microsoft Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence.| [Documentation](/cloud/4.x/integration/bi/powerbi)| QuickSight| | Amazon QuickSight powers data-driven organizations with unified business intelligence (BI).| [Documentation](/cloud/4.x/integration/bi/quicksight)| Apache Superset| | Apache Superset is an open-source data exploration platform. It supports a rich variety of data source connections and numerous visualization methods.| [Documentation](/cloud/4.x/integration/bi/apache-superset)| FineBI| ![finebi]()| FineBI supports rich data source connection and analysis and management of tables with multiple views.| [Documentation](/cloud/4.x/integration/bi/finebi)| SmartBI| ![smartbi]()| Smartbi is a collection of software services and application connectors that can connect to a variety of data sources, including Oracle, SQL Server, MySQL, and Doris, enabling users to integrate and cleanse their data easily.| [Documentation](/cloud/4.x/integration/bi/smartbi)| QuickBI| ![quickbi]()| Quick BI is a data warehouse-based business intelligence tool that helps enterprises set up impressive visual analyses quickly.| [Documentation](/cloud/4.x/integration/bi/quickbi) ---|---|---|--- ## SQL Client​ Name| Logo| Description| Resources| DBeaver| ![dbeaver]()| DBeaver is a cross-platform database tool for developers, database administrators, analysts and anyone who works with data.| [Documentation](/cloud/4.x/integration/sql-client/dbeaver)| DataGrip| icon_DataGrip| DataGrip is a powerful cross-platform database tool for relational and NoSQL databases from JetBrains.| [Documentation](/cloud/4.x/integration/sql-client/datagrip) ---|---|---|--- ## Data Source​ Name| Logo| Description| Resources| Apache Kafka| | Doris integrates with Kafka via its efficient Routine Load for real-time streaming (CSV/JSON, Exactly-Once) and the Doris Kafka Connector for advanced formats.| [GitHub](https://github.com/apache/doris-kafka-connector) [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/kafka)| Doris Kafka Connector| | Doris integrates with Kafka via its efficient Routine Load for real-time streaming (CSV/JSON, Exactly-Once) and the Doris Kafka Connector for advanced formats.| [GitHub](https://github.com/apache/doris-kafka-connector) [Documentation](/cloud/4.x/integration/data-source/doris-kafka-connector)| MySQL| | Doris JDBC Catalog supports connecting to MySQL databases via the standard JDBC interface.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/jdbc-mysql-catalog)| PostgreSQL| | Doris JDBC Catalog supports connecting to PostgreSQL databases via the standard JDBC interface.| [Documentation](/cloud/4.x/user-guide/lakehouse/catalogs/jdbc-pg-catalog)| Amazon S3| ![amazon s3]()| Doris supports loading S3 files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/amazon-s3)| Azure| | Doris supports loading Azure Storage files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/azure-storage)| Google Cloud Storage| ![gcp](/assets/images/google-bigquery-logo-icon-ff6d8331afc34882c370b0e6089ed461.png)| For loading files from Google Cloud Storage, Doris provides two methods: the asynchronous S3 Load and the synchronous TVF.| [Documentation](/cloud/4.x/user-guide/data-operate/import/data-source/google-cloud-storage)| MinIO| ![MinIO]()| Doris supports loading MinIO files using both asynchronous (S3 Load) and synchronous (TVF) methods.| [Documentation](/cloud/4.x/user-guide/lakehouse/storages/minio)| HDFS| | By connecting to Hive Metastore or metadata services compatible with Hive Metastore, Doris can automatically retrieve Hive database and table information for data querying.| [Documentation](/cloud/4.x/user-guide/lakehouse/storages/hdfs) ---|---|---|--- ## Data Ingestion​ Name| Logo| Description| Resources| Doris Streamloader| ![doris streamloader]()| Doris Streamloader is a client tool designed for loading data into Apache Doris. In comparison to single-threaded load using curl, it reduces the load latency of large datasets by its concurrent loading capabilities.| [Documentation](/cloud/4.x/integration/data-ingestion/doris-streamloader)| Apache SeaTunnel| ![seatunnel](/assets/images/seatunnel-d67346ce0fadbd01f59a673817ec4629.png)| SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data.| [Documentation](/cloud/4.x/integration/data-ingestion/seatunnel)| BladePipe| | BladePipe is a real-time end-to-end data replication tool, moving data between 30+ databases, message queues, search engines, caching, real-time data warehouses, data lakes and more, with ultra-low latency less than 3 seconds.| [Documentation](/cloud/4.x/integration/data-ingestion/cloudcanal) ---|---|---|--- ## More​ Name| Logo| Description| Resources| AutoMQ| ![automq]()| AutoMQ is a cloud-native fork of Kafka by separating storage to object storage like S3.| [Documentation](/cloud/4.x/integration/more/automq-load)| DataX| ![datax]()| The DataX Doriswriter plugin supports synchronizing data from various data sources, such as MySQL, Oracle, and SQL Server, into Doris using the Stream Load method.| [Documentation](/cloud/4.x/integration/more/datax)| Kettle| ![kettle](/assets/images/pentaho-0787550d4b1fddad87fb31fd3c0e5d02.png)| Kettle Doris Plugin is used to write data from other data sources to Doris through Stream Load in Kettle.| [Documentation](/cloud/4.x/integration/more/kettle)| Apache Kyuubi| ![kyuubi]()| Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on Data Warehouses and Lakehouses.| [Documentation](/cloud/4.x/integration/more/kyuubi) ---|---|---|--- On This Page * Lakehouse * Observability * Data Processing * BI * SQL Client * Data Source * Data Ingestion * More --- # Source: https://docs.velodb.io/cloud/4.x/integration/sql-client/dbeaver Version: 4.x On this page # DBeaver ## introduce​ DBeaver is a cross-platform database tool for developers, database administrators, analysts and anyone who works with data. Apache Doris is highly compatible with the MySQL protocol. You can use DBeaver's MySQL driver to connect to Apache Doris and query data in the internal catalog and external catalog. ## Preconditions​ Dbeaver installed You can visit to download and install DBeaver ## Add data source​ Note Currently verified using DBeaver version 24.0.0 1. Start DBeaver 2. Click the plus sign (**+**) icon in the upper left corner of the DBeaver window, or select **Database > New Database Connection** in the menu bar to open the **Connect to a database** interface. ![add connection 1](/assets/images/dbeaver1-08e265526a12a1b560d84b179eac1238.png) ![add connection 2](/assets/images/dbeaver2-a7f26e8015598024cb730df1f3f341d2.png) 3. Select the MySQL driver In the **Select your database** window, select **MySQL**. ![chose driver](/assets/images/dbeaver3-599f75b71d72b8454f6641c2e575f96c.png) 4. Configure Doris connection In the **main** tab of the **Connection Settings** window, configure the following connection information: * Server Host: FE host IP address of the Doris cluster. * Port: FE query port of Doris cluster, such as 9030. * Database: The target database in the Doris cluster. * Username: The username used to log in to the Doris cluster, such as admin. * Password: User password used to log in to the Doris cluster. tip Database can be used to distinguish between internal catalog and external catalog. If only the Database name is filled in, the current data source will be connected to the internal catalog by default. If the format is catalog.db, the current data source will be connected to the catalog filled in Database by default, as shown in DBeaver The database tables are also database tables in the connected catalog, so you can use DBeaver's MySQL driver to create multiple Doris data sources to manage different Catalogs in Doris. Note Managing the external catalog connected to Doris through the Database form of catalog.db requires Doris version 2.1.0 and above. * internal catalog ![connect internal catalog](/assets/images/dbeaver4-9b79f13badba5713605d6647f4648ed9.png) * external catalog ![connect external catalog](/assets/images/dbeaver5-f92fa21b93bffc3ce13dfb830da8dd13.png) 5. Test data source connection After filling in the connection information, click Test Connection in the lower left corner to verify the accuracy of the database connection information. DBeaver returns to the following dialog box to confirm the configuration of the connection information. Click OK to confirm that the configured connection information is correct. Then click Finish in the lower right corner to complete the connection configuration. ![test connection](/assets/images/dbeaver6-fac1178b7798f028a57c79991dd9a036.png) 6. Connect to database After the database connection is established, you can see the created data source connection in the database connection navigation on the left, and you can connect and manage the database through DBeaver. ![create connection](/assets/images/dbeaver7-68de28fe0f0fe59c23972aa3bc39c354.png) ## Function support​ * fully support * Visual viewing class * Databases * Tables * Views * Users * Administer * Session Manager * System Info * Session Variables * Global Variables * Engines * Charsets * User Priviages * Plugin * Operation class * SQL editor * SQL console * basic support The basic support part means that you can click to view without error, but due to protocol compatibility issues, there may be incomplete display. * Visual viewing class * dash board * Users/user/properties * Session Status * Global Status * not support The unsupported part means that when using DBeaver to manage Doris, errors may be reported when performing certain visual operations, or some visual operations are not verified. Such as visual creation of database tables, schema change, addition, deletion and modification of data, etc. On This Page * introduce * Preconditions * Add data source * Function support --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/backup Version: 4.x On this page # Backup and Restore VeloDB Cloud supports backing up databases to object storage either periodically or as a one-time operation, and allows users to quickly restore data based on specified backup sets, comprehensively ensuring high availability of data. ## Backup​ ### Create a Backup Plan​ Click **Backup** in the left navigation bar, and click **Create Backup Plan** on the Backup page. You can choose between periodic or one-time backups as needed. Periodic and one-time backups have a mutual exclusion relationship. Updating the backup plan will overwrite the original backup plan. If periodic backup is selected, you need to choose whether to enable it, the backup execution cycle, start time, backup objects, retention days, and the cluster used for backup, and save the selected settings for them to take effect. ![backup-periodic](/assets/images/backup- periodic-8f7b6b62f6d4df40f129b3e8cc2efcea.png) If you choose one-time backup, you need to select the start time, backup objects, retention days, and the cluster used for backup. Similarly, you need to save the selected settings for them to take effect. ![backup-one-time](/assets/images/backup-one- time-24d271d69de3ec5713279a4c0861636b.png) Parameter| Description| Backup Every| Multiple selections are allowed from Monday to Sunday, with at least one day and at most seven days.| Start Time| The startup time of the backup task.| Backup Objects| Internal Catalog: Database; External Catalog: Only backs up DDL, not data.| Backup Retention Days| Set the retention days for backup sets, and backup sets exceeding the retention days will be cleared.| Backup Cluster| The backup process consumes computing resources. In the case of multiple clusters, it is necessary to specify the cluster to be used for backup operations. ---|--- ### View Backup Tasks​ VeloDB Cloud will automatically execute backup tasks according to the plan you set. View all backup tasks in the **Backup Tasks** list, including backup status, retention days, data size, and backup start and completion times. ![backup-list](/assets/images/backup-list- bfc1b861f701aea82ace15bce4f5e154.png) Click the **View Details** in the operation column to obtain detailed information about backup task execution. ![backup-detail](/assets/images/backup- detail-f6e8d9392855dcb23fe4c9843ed36961.png) ## Restore​ You can select the row where the target backup set is located in the list of backup tasks, click **Restore** in the operation column, specify the target warehouse and cluster for the restore task, and then restore the backup. ![backup-restore](/assets/images/backup- restore-10cb20e55017073d000fcb77c506b5ad.png) Restore tasks will be displayed in the **Restore Tasks** list, where you can view detailed information such as task status, data size, start and completion time, etc. ![restore-list](/assets/images/restore-list- bb7cfccacba2dcb121071b3ffcd69f85.png) Click the **View Details** in the operation column to obtain detailed information about restore task execution. ![restore-detail](/assets/images/restore- detail-96e6a7d2eb901ef4c7915b70dbccb7b1.png) On This Page * Backup * Create a Backup Plan * View Backup Tasks * Restore --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/cluster-management Version: 4.x On this page # Cluster Management In each paid warehouse, you can create multiple clusters to support different workloads, such as writing data, customer-facing reporting, user profiles, and behavior analytics. The cluster only contain compute resource, cache resource and cached data. All clusters in the warehouse share the stored data. ## New Cluster​ To create a new cluster in a paid warehouse, you can click **Clusters** on the navigation bar. If a cluster already exists, you will see the **Cluster Overview** page. ![cluster list existing](/assets/images/cluster-list- existing-2d9d00a1d4eeb41bf33100aba6983194.png) Click **New Cluster** on the wizard page or **Cluster Overview** page to create a new cluster. ![create_cluster](/assets/images/create-cluster- fdda47a3ec19565f7dab70360bf89667.png) **Parameter** | **Description**| Cluster Name| Required. Must start with a letter, up to 32 characters, you can use letters (case insensitive), numbers and _.| Compute| Default is minimum 4 vCPU, maximum 1024 vCPU per cluster, if you need a higher quota, please [get help](mailto:support@velodb.io) to apply. Currently, the ratio of vCPU to memory is fixed at 1:8.| Cache| The upper and lower limits of the cache space will vary depending on the compute size.| Storage| Pay as you go, no need to preset storage space. All clusters in the warehouse share the stored data.| Billing Method| Default is **On-Demand (Hourly)** billing, suitable for scenarios that need to be flexibly changed or deleted at any time, such as temporary test verification.| Auto Pause/Resume| When enabled, the compute cluster will automatically pause after a period of inactivity. It will automatically resume upon a new query request. ---|--- Creating a new cluster will incur a charge. Therefore, before creation, please ensure sufficient available amount or open and enable the cloud marketplace deduction channel. Otherwise, you will see the following error prompt. ![insufficient cash-balance](/assets/images/insufficient-cash- balance-a173fc56f41649ad5bc6f813c36eb480.png) > **Note** > > * After confirming the creation, you can see the new cluster on the > **Cluster overview** page. It takes about 3 minutes to complete the > creation, and the cluster status will be changed from "**Creating** " to > "**Running** ". > * The SaaS model free trial clusters do not support new cluster creation. > ## Reboot Cluster​ In certain situations (such as cluster exceptions or modification of certain parameters), you may need to reboot the cluster. On the **Cluster Overview** page, find the target cluster card, click **Reboot** operation, and confirm again. The cluster status will be changed to "**Rebooting** ", and no other operations can be performed on the cluster at this status. ![cluster rebooting](/assets/images/cluster-rebooting- en-a729a4a2e55fc2603b10261f34c08b3f.png) > **Note** > > * It takes about 3 minutes for the cluster to reboot. When it is done, the > cluster status will be changed from "**Rebooting** " to "**Running** ". > * The rebooting of cluster may cause business requests to experience > crashes or delayed responses. > * During the cluster rebooting process, VeloDB Cloud will still meter and > charge the cluster. > ## Pause/Resume Cluster​ ### Manual Pause/Resume Cluster​ You may wish to save costs when the cluster is idle. On the **Cluster Overview** page, find the target cluster card. When the cluster status is "**Running** " and it is confirmed that the cluster is unloaded, you can manually pause the cluster, click **Pause** operation and confirm again. The cluster status will be changed to "**Pausing** ", and no other operations can be performed on the cluster at this time. VeloDB Cloud will release the computing resource of the cluster while retaining the cache space and its data. ![cluster pausing](/assets/images/cluster-pausing- en-3f7440c003695f4f90659947221cb39c.png) ![cluster paused](/assets/images/cluster-paused- en-27e7474bb39d489dc1ed92b5fcc11ed2.png) > **Note** > > * It takes about 3 minutes for the cluster to pause. When it is done, the > cluster status will be changed from "**Pausing** " to "**Paused** ". > * The cluster will not respond to business requests during the pause > period. > * During the cluster suspension period, VeloDB Cloud will no longer meter > and charge for computing resource, but will still meter and charge for cache > space. > * Clusters containing monthly/yearly billing resources do not support the > pause/resume function. > When you need the cluster to continue responding to business requests, you can manually resume the "**paused** " cluster. On the **Cluster Overview** page, find the target cluster card, click **Resume** operation and confirm again. The cluster status will be changed to "**Resuming** ", and no other operations can be performed on the cluster at this status. VeloDB Cloud will pull up computing resource and mount the reserved cache space and its data. ![cluster resuming](/assets/images/cluster-resuming- en-d84b09981edbf0147fa70031b091a7b2.png) ![cluster running](/assets/images/cluster-running- en-93a53c5f3f329474bf025b4f3e87e67f.png) > **Note** > > * It takes about 3 minutes for the cluster to resume. When it is done, the > cluster status will be changed from "**Resuming** " to "**Running** ". > * The cluster will not respond to business requests during the resuming > process. > * After the cluster is resumed, it can respond to business requests, and > VeloDB Cloud will restore metering and billing for the pulled up computing > resource. > * Clusters containing monthly/yearly billing resources do not support the > pause/resume function. > ### Auto Pause/Resume Cluster​ If you want to automatically start and stop idle clusters, you can click **Set Auto Start/Stop** to the right of **Started On** or in the upper right corner on the **Cluster Details** page, and turn on the **Auto Start/Stop** switch to customize the idle duration of the shutdown trigger condition. ![cluster-auto-stop-start_en](/assets/images/cluster-auto-stop-start- en-3098e717998861a4e7853fa534564964.png) ## Cluster Details​ Before performing any operation on a cluster, you may need to first know the detailed information of the cluster. On the **Cluster Overview** page, find the target cluster card, and if the cluster status supports, click on the cluster card to enter the **Cluster Details** page. ![cluster-detail-CPU-arch-en](/assets/images/cluster-detail-CPU-arch- en-d0ab3caebc2428a5d157daa391c05263.png) The **Cluster Details** page includes two main content areas: basic information and on-demand billing resources, as well as corresponding functional operations. The specific explanation is as follows: **Basic Information** : **Parameter** | **Description**| Cluster ID| The globally unique ID of the cluster. Start with "c-", followed by 18 characters, randomly combined with 26 lowercase letters and 10 numbers.| Cluster Name| It is unique in a warehouse, and supporting one click copying and locally renaming. If you need to modify the cluster name, click the edit icon, enter the new cluster name in the input box that appears (it is recommended that the name should indicate the meaning), click the confirm icon and confirm again. **Note** \- The VeloDB Core syntax will use the cluster name, for example: `USE { [catalog_name.]database_name[@cluster_name] }` \- The cluster name must start with a letter, up to 32 characters, you can use letters (case insensitive), numbers and _. \- After modifying the cluster name, it is necessary to ensure that the business uses the new cluster name or sets the default cluster for the relevant database users, otherwise it will cause the relevant requests to fail.| Created By| The user who created the cluster. Multiple users in the same organization can perform corresponding operations on warehouses and their clusters according to their privileges.| Created At| The time when the cluster was created.| Started At| The time when the cluster was last rebooted or resumed.| Running Time| The running time of the cluster since it was last rebooted or resumed.| Zone| The availability zone where the cluster is located.| CPU Architecture| The CPU architecture of cluster computing resource. **Note** \- Currently, only VeloDB Cloud warehouses deployed on AWS can see the CPU architecture of the cluster, which may be x86 or ARM. \- Core version 4.0.4 or above is required to create an ARM architecture cluster. If the core version is too low, please upgrade the core version. \- On the same specifications, ARM architecture has a performance improvement of over 30% compared to x86 architecture. \- In the SaaS model, the pricing of cluster computing resources for ARM architecture and x86 architecture is consistent in the same cloud platform and region. In the BYOC model, the pricing of computing resources for different CPU architectures may vary within the same cloud platform and region, depending on the cloud provider. \- It cannot be modified after the cluster is created.| **On-Demand Resources** :| ---|--- **Parameter** | **Description**| Compute| Displays the current compute resource of the cluster.| Cache| Displays the current cache space of the cluster.| Scale Out/In| If the performance of the current cluster does not meet the business requirements, you can increase or decrease compute resource or cache space to adjust the capacity of the current cluster by clicking **Scale Out/In**. ---|--- ## Scale Cluster​ ### Manual Scaling​ Based on your business requirements, you can click **Scale Out/In** in the upper right corner on the **On-Demand Resources** content area of the **Cluster Details** page, and select **Manual Scaling** to adjust the capacity of the current cluster. ![cluster scaling manual en](/assets/images/cluster-scaling-manual- en-8d7009c8374a257ad01053374d906e42.png) > **Note** > > * After confirming the scaling, you can see the cluster status be changed > from "**Running** " to "**Scaling** " on the **Cluster Overview** page. It > takes about 3 minutes to complete the scaling, and the cluster status will > be changed from "**Scaling** " to "**Running** ". > * The SaaS free trial clusters do not support scaling. > ### Time-based Scaling​ If the cluster needs to deal with periodic business peaks and lows, you can click **Scale Out/In** in the upper right corner on the **On-Demand Resources** content area of the **Cluster Details** page, and select **Time- based scaling** , customize and add at least two different target vCPU time- based rules, and enable time-based scaling policy. ![cluster scaling time based en](/assets/images/cluster-scaling-time-based- en-3f994662caceeb78e1b5e67b94a8483c.png) > **Note** > > * The SaaS free trial clusters do not support scaling. > * The on-demand billing cluster does not support configuring a time-based > rule with a target vCPU of 0. > * The time-based rule is valid and executed when the cluster is running > normally. When the cluster is not running normally (such as pausing, > rebooting, upgrading, etc.), it will wait for a retry, and will not be > executed after more than 30 minutes. > * If the current organization does not have sufficient available amount or > open and enable the cloud marketplace deduction channel, the time-based rule > will be considered invalid and abandoned by VeloDB Cloud. > * The execution period of the time-based rule defaults to every day and > does not currently support modification. > * There should be at least an hour interval between the time-based rules, > so a maximum of 23 time-based rules can be configured. > * The execution time of the time-based rule cannot be repeated with > existing time-based rules. > * Scaling cluster may cause some requests to experience crashes or delayed > responses. > * When scaling in, the cache space will automatically scale in > proportionally with the computing resource (vCPU), and cache data that > exceeds the target cache space will be eliminated. The response time of some > requests may experience significant delays. > ## Delete Cluster​ If the business no longer requires the current cluster, you can delete it. In the upper right corner of the **Cluster Details** page, click **Delete Cluster** operation and confirm again. ![delete cluster en](/assets/images/delete-cluster- en-229056ed54eb5f80def638c01800e4da.png) > **Note** > > * Deleting the SaaS model free trial cluster will also delete the free > trial warehouse, storage resources, and their data. > * Clusters containing monthly/yearly billing resources do not support > early deletion. You need to wait until the cluster expires and be converted > to on-demand billing by default. If you want monthly billing resources to > expire and be converted to on-demand billing as soon as possible, you need > to confirm that the auto renew function is not enabled, otherwise the > cluster may not expire. > * All resources and cached data of the cluster will be deleted by VeloDB > Cloud, and you need to adjust the business accessing the cluster in a timely > manner, otherwise related business requests will fail. > ## Multi-Availability Zone Disaster Recovery​ The virtual cluster provides high availability and disaster recovery capabilities across Availability Zones by establishing an active-standby cluster architecture. In the event of a failure in the primary Availability Zone, the system automatically triggers a failover to ensure business continuity. Leveraging a real-time data synchronization mechanism, it effectively prevents service interruptions and data loss, thereby guaranteeing high availability for your business. ![virtual cluster intro](/images/cloud/virtual-cluster-intro-en.png) Before creating a high-availability virtual cluster, two physical clusters must be prepared. They must be in the Running state and located in different Availability Zones. ![virtual cluster physical](/images/cloud/virtual-cluster-physical.png) On the Virtual Cluster page, click **New Virtual Cluster** to navigate to the cluster configuration page. ![virtual cluster create](/images/cloud/virtual-cluster-create.png) ![virtual cluster new intro](/images/cloud/virtual-cluster-new.png) **Parameter** | **Description**| Virtual Cluster Name| The cluster name must start with a letter, up to 32 characters, you can use letters (case insensitive), numbers and _.| Active Cluster| The cluster that is actively serving traffic.| Standby Cluster| The disaster recovery cluster that becomes active upon failover. Note: Identical specifications are recommended. ---|--- After the virtual cluster is successfully created, you can click on its card on the overview page to navigate to the details page. There, you can modify the active/standby cluster configuration or delete the virtual cluster. ![virtual cluster detail](/images/cloud/virtual-cluster-detail.png) On This Page * New Cluster * Reboot Cluster * Pause/Resume Cluster * Manual Pause/Resume Cluster * Auto Pause/Resume Cluster * Cluster Details * Scale Cluster * Manual Scaling * Time-based Scaling * Delete Cluster * Multi-Availability Zone Disaster Recovery --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/connections Version: 4.x On this page # Connections ## Private Link​ Private Link can help you securely and stably access services deployed in other VPCs through a private network in VPC environments, greatly simplifying network architecture and avoiding security risks associated with accessing services through the public network. The VeloDB Cloud warehouse is created and run in the VeloDB VPC, and application systems or clients within the user's VPC can access the VeloDB Cloud warehouse across VPC via Private Link. Private Link includes two parts: endpoint service and endpoint. When the user needs to access VeloDB in their own private network, VeloDB Cloud will create and manage the endpoint service, and the user creates and manages the endpoint. When the user needs to use VeloDB to access their own private network, they need to create an endpoint service and register it in VeloDB Cloud. Subsequently, VeloDB Cloud will create an endpoint to connect to the user's endpoint service. ### Access VeloDB from Your VPC​ ![Access VeloDB from Your VPC](/assets/images/AccessVeloDBfromYourVPC-9044402274dba781c781989f2e1cd2c9.gif) Creating a connection to allow your data applications, such as reporting, profiling, and log analytics, within your private network to access the VeloDB Cloud warehouse. > **Note** There is no additional fee on the VeloDB Cloud service side, but > users need to pay the cloud platform for endpoint instances and traffic > fees. #### AWS​ 1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **Set up Connection** to **Connect Your VPC to VeloDB** on the **Private Link** tab to create an endpoint. ![private link ad](/assets/images/private-link- ad-5308ac3208b4d3765739491fedc22f9b.png) 2. The page displays the Endpoint Service information required for creating an endpoint. You can click **Set up one or more endpoints** to go to the cloud platform's Private Link product console and create an endpoint. ![private link add endpoint](/assets/images/private-link-add- endpoint-b0011c0707c63e95567bdec58229fa99.png) 3. On the cloud platform's Private Link product console, you need to confirm that the current region is the same as the warehouse's endpoint service (limited by the cloud platform's Private Link product) and click **Create endpoint**. ![private link create endpoint on aws](/assets/images/private-link-create- endpoint-on-aws-19e9b1a47a9681557c82ebfdd14d13ec.png) > **Note** You need to sign in to AWS with the principal that has been allowed > to access the endpoint service of VeloDB Cloud, so that you can successfully > pass the service name verification when creating the endpoint. 4. Follow the wizard prompts to fill in the form as follows: ![private link create endpoint on aws01](/assets/images/private-link-create- endpoint-on-aws01-78c945819bf263a4482a9ee8f5814889.png) ![private link create endpoint on aws02](/assets/images/private-link-create- endpoint-on-aws02-1f368bf72e0c3aa9940179e50ce53d3f.png) **Parameter**| **Description**| Name tag| Optional. Creates a tag with a key of 'Name' and a value that you specify.| Service category| Required. Select the service category. The endpoint service of the VeloDB Cloud warehouse belongs to **Endpoint services that use NLBs and GWLBs** , so click to select it.| Service name| Required. One-click shortcut to copy the Service Name of the endpoint service of VeloDB Cloud warehouse in the page that displays the Endpoint Service information required for creating an endpoint, fill in the input box and click **Verify service** .| VPC| Required. Select the VPC in which to create your endpoint.| Subnets| Required. Select the same Availability Zone as the one where the endpoint service of the VeloDB Cloud warehouse is located (limited by the cloud vendor's Private Link product), and then select an appropriate subnet ID under it.| Security groups| Required. Select a preset security group. Note that the security rules should allow the protocol and port used by the VeloDB Cloud warehouse, as well as the IP address of the source where the application/client connects to the VeloDB Cloud warehouse.| Tags| Optional. You can add tags associated with the resource. ---|--- 5. After the endpoint is created, its status will be changed from " **Pending** " to " **Available** ", indicating that the endpoint has successfully connected with the warehouse's endpoint service. ![private link create endpoint on aws pending](/assets/images/private-link- create-endpoint-on-aws-pending-91f5a81fe597c6d7bd66cb05f97fbec7.png) 6. After refreshing the **Connections** page of the VeloDB Cloud warehouse, the endpoint list will display the connection information of the endpoint. ![private link endpoint list table](/assets/images/private-link-endpoint-list- table-cafe9e613032d86472fd091b081efeea.png) ![private link endpoint on aws details](/assets/images/private-link-endpoint- on-aws-details-d7095c7a0337d3fff859ae69af87bc4b.png) > **Note** You need to click **Find DNS Name** to open the **Endpoint > Details** page of AWS Private Link product console, find the **DNS Name** of > the endpoint and use it to access the VeloDB Cloud warehouse. 7. The application/client can access the VeloDB Cloud warehouse through the DNS name of the endpoint by MySQL protocol or HTTP protocol. For the specific connection method, refer to the pop-up bubble for **Connection Examples** . ![private link connection example](/assets/images/private-link-connection- example-435e64f083d0f407a65a11598e96d135.png) > **Note** > > * VeloDB Cloud includes two independent account systems: One is used to > connect to the warehouse, as described in this topic. The other one is used > to log into the console, which is described in the [Registration and > Login](/cloud/4.x/management-guide/user-and-organization) topic. > > * For first-time connection, please use the admin username and its > password which can be initialized or reset on the **Settings** page. > > #### Azure​ 1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** to **Access VeloDB from Your VPC** on the **Private Link** tab to create an endpoint. Firstly, you need to approve a subscription to access the endpoint service of VeloDB Cloud warehouse. ![azure private link access velodb 1 1](/assets/images/azure-private-link- access-velodb-1-1-f9cfba82e9cad6bfb73e2f60cb7718fa.png) ![azure private link access velodb 1 2](/assets/images/azure-private-link- access-velodb-1-2-5ea71ca83159813a2649c0f22d6666ac.png) 2. After approving a subscription to access the endpoint service, the page displays the Endpoint Service information required for creating an endpoint. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint. ![azure private link access velodb 2](/assets/images/azure-private-link- access-velodb-2-9dc52b4a46250ba905eefdf868b7fe0c.png) 3. In the **Basics** tab of the **Create a private endpoint** page on the cloud platform's Private Link product console, you need to confirm that the current region is the same as the endpoint service of VeloDB Cloud warehouse (limited by the cloud platform's Private Link product). Follow the wizard prompts to fill in the form as follows and click **Next: Resource**. ![azure private link access velodb 3](/assets/images/azure-private-link- access-velodb-3-453e657d9a3433175aa84303de1b14b2.png) Parameter| Category| Description| Subscription| Project details| Required. Select the subscription to access the endpoint service of VeloDB Cloud warehouse. All resources in an Azure subscription are billed together.| Resource group| Project details| Required. Select a resource group for the private endpoint to be created in it. If there is no suitable one, you can create a new one. A resource group is a collection of resources that share the same lifecycle, permissions, and policies.| Name| Instance details| Required. The instance name of the private endpoint to be created. You can customize it.| Network Interface Name| Instance details| Required. The network interface name of the private endpoint to be created. When you enter the instance name, it will be automatically generated and you can modify it.| Region| Instance details| "Required. Select the region for the private endpoint to be created in it. Note: You need to select the region is the same as the endpoint service of VeloDB Cloud warehouse (limited by the cloud platform's Private Link product)." ---|---|--- 4. In the **Resource** tab of the **Create a private endpoint** page, choose the connection method **Connect to an Azure resource** with a resource ID or alias and fill in the form as follows and click **Next: Virtual Network**. ![azure private link access velodb 4](/assets/images/azure-private-link- access-velodb-4-2d33d206e95aea48e81f401aadf9608d.png) Parameter| Description| Resource ID or alias| Required. When connecting to someone else's resource, they must provide you with the resource ID or alias for that resource in order for you to initiate a connection request. In the current scene, you can one-click shortcut to copy the **Service Alias** value of the endpoint service of VeloDB Cloud warehouse in the page that displays the Endpoint Service information required for creating an endpoint, then fill in the input box.| Request message| Optional. This message will be sent to the resource owner (This refers to VeloDB Cloud.) to assist them in the connection management process. Don't include private or sensitive information. ---|--- 5. In the **Virtual Network** tab of the **Create a private endpoint** page, Select the virtual network and subnet for the private endpoint to be created in it. Follow the wizard prompts to fill in the form as follows and click **Next: DNS**. ![azure private link access velodb 5](/assets/images/azure-private-link- access-velodb-5-b8d8099780eee1eef9c733ede83bf7bc.png) Parameter| Category| Description| Virtual network| Networking| Required. Only virtual networks in the currently selected subscription and location are listed. Select the virtual network for the private endpoint to be created in it. If there is no suitable one, you can create a new one on the cloud platform's Virtual network product console.| Subnet| Networking| Required. Only subnets in the currently selected virtual network are listed. Select a subnet for the private endpoint to be created in it. If there is no suitable one, you can create a new one on the cloud platform's Virtual network product console.| Network policy for private endpoints| Networking| Optional. The network policy for the private endpoint to be created. The default is disabled, you can edit it.| Private IP configuration| Private IP configuration| Optional. You can choose Dynamically allocate IP address or Statically allocate IP address. According to the virtual network and subnet configured above, Dynamically allocate IP address is selected by default.| Application security group| Application security group| Optional. Select the application security group for the private endpoint to be created. If there is no suitable one, you can create a new one. ---|---|--- 6. In the **DNS** tab of the **Create a private endpoint** page, Keep the default settings and click **Next: Tags**. Note: To connect privately with your private endpoint, you need a DNS record. You need to configure the resource configuration to support Private DNS. ![azure private link access velodb 6](/assets/images/azure-private-link- access-velodb-6-e58196b663a472d9a3835b3c3490ceed.png) 7. In the **Tags** tab of the **Create a private endpoint** page. , Keep the default settings and click **Next: Review + create**. Note: If you want to categorize the private endpoint and view consolidated billing, you can configure the tag for the private endpoint to be created. ![azure private link access velodb 7](/assets/images/azure-private-link- access-velodb-7-8c4834cd96d68943e96c3a9cd5486ee0.png) 8. In the **Review + create** tab of the **Create a private endpoint** page, you can review the settings for the private endpoint to be created. If some settings are not as expected, you can click **Previous** back to modify. If there is no problem, you can click **Create**. ![azure private link access velodb 8](/assets/images/azure-private-link- access-velodb-8-06865e9b6e047808b718159f3cc32b84.png) 9. After the endpoint is created, its status will be changed from "**Created** " to "**OK** ", indicating that the endpoint has successfully connected with the endpoint service of VeloDB Cloud warehouse. ![azure private link access velodb 9 1](/assets/images/azure-private-link- access-velodb-9-1-34cf0662dbf97d64b3b4aa971b9170d1.png) ![azure private link access velodb 9 2](/assets/images/azure-private-link- access-velodb-9-2-b35003bfc5a86366dbaecdb10ee5de2e.png) 10. After refreshing the **Connections** page of the VeloDB Cloud warehouse, the endpoint list will display the connection information of the endpoint. ![azure private link access velodb 10](/assets/images/azure-private-link- access-velodb-10-004b6563bf76ab9a0659f48692b262be.png) 11. The application/client can access the VeloDB Cloud warehouse through the IP or DNS name of the endpoint by MySQL protocol or HTTP protocol. You can click **Find DNS Name** in the endpoint list to open the details page of the endpoint to find the IP or DNS name of it. ![azure private link access velodb 11](/assets/images/azure-private-link- access-velodb-11-13de08c12429e2a133438a5b0f2c1451.png) 12. For the specific connection method, you can hover the pop-up bubble for **Connection Examples** in the **Connections** page of the VeloDB Cloud warehouse. ![azure private link access velodb 12](/assets/images/azure-private-link- access-velodb-12-a383dae143082e8b495f49c31438139f.png) ### VeloDB Accesses Your VPC​ ![VeloDB Accesses Your VPC](/assets/images/VeloDBAccessesYourVPC-3f95f331da978ca2ce8b056cd7a7e33c.gif) > **Note** The endpoint instance and traffic fees generated by VeloDB's access > to the private network are currently not charged to users. #### AWS​ 1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** for **VeloDB Accesses Your VPC** on the **Private Link** tab to create a connection to your endpoint service. ![private link create connection choose endpoint service](/assets/images/private-link-create-connection-choose-endpoint- service-c9034deb178d98da02007ea71cd3b62d.png) ![private link create connection choose endpoint service register](/assets/images/private-link-create-connection-choose-endpoint- service-register-8c009c34d314872005d18bdba7ff5316.png) 2. After clicking **\+ Endpoint Service** , the pages will display the **Current Region** of the warehouse and the **ARN of VeloDB**. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint service. 3. Sign in to the AWS Console, select VPC-Endpoint services and switch to the same region as the current warehouse. 4. Click **Create endpoint service**. ![private link create endpoint service on aws](/assets/images/private-link- create-endpoint-service-on-aws-8753ff6a1f9f49f518910ba2798f16fb.png) 5. On the Endpoint Service configuration page, configure the relevant parameters and click **Create**. ![private link create connection choose endpoint service create](/assets/images/private-link-create-connection-choose-endpoint-service- create-ec1bb1494d197a06bb0c780ab7c814e7.png) ![private link create connection choose endpoint service create 1](/assets/images/private-link-create-connection-choose-endpoint-service- create-1-c42467af40a719bbea92b6e54d5ae9e4.png) 6. (Optional) If there is no available network load balancer, you need to click **Create Network Load Balancer** first. After the creation is completed, click the filter button to make a selection. ![private link create connection create nlb 0](/assets/images/private-link- create-connection-create-nlb-0-0bb75f2f55591199c4e13d5a8682f5f9.png) ![private link create connection create nlb 1](/assets/images/private-link- create-connection-create-nlb-1-35368125db89a47f7973f2c7ac6f6ecc.png) ![private link create connection create nlb 2](/assets/images/private-link- create-connection-create-nlb-2-7437087c61b5380d6fc23590440923d8.png) ![private link create connection create nlb 3](/assets/images/private-link- create-connection-create-nlb-3-9098c52fbe35e426110157f924b232e3.png) 7. (Optional) If there is no available target group, you need to click **Create Target Group** first. After the creation is completed, click the refresh button on the right to make a selection. ![private link create connection create tg 0](/assets/images/private-link- create-connection-create-tg-0-05dc539f4a6a2f4c6946abb02012847b.png) ![private link create connection create tg 1](/assets/images/private-link- create-connection-create-tg-1-4fa21cc341f1f0388cc082696c548016.png) 8. After creating the endpoint service, add the **ARN of VeloDB** in the **Allow principals** Tab of the endpoint service. ![private link create connection choose endpoint service details](/assets/images/private-link-create-connection-choose-endpoint- service-details-7605dc4ea57d09a24f3f8e7790062788.png) ![private link create connection choose endpoint service allow principals](/assets/images/private-link-create-connection-choose-endpoint- service-allow-principals-2b0d3f5d4378526b1499cd6b0649fd1f.png) 9. Copy the **Service ID** and **Service Name** from the **Endpoint Service Details** page, and fill them in the Endpoint Service registration page of VeloDB Cloud. ![private link create connection choose endpoint service details02](/assets/images/private-link-create-connection-choose-endpoint- service-details02-9989268d92104bf402a03784c24a04f2.png) 10. After the registration is complete, go to the next step, specify the **Endpoint Name** of VeloDB Cloud warehouse, and click **Create Now**. ![private link create connection choose endpoint service chosen](/assets/images/private-link-create-connection-choose-endpoint-service- chosen-304b58524e42096a965ee5d638eca32f.png) ![private link velodb acdess user vpc new connection create endpoint](/assets/images/private-link-velodb-acdess-user-vpc-new-connection- create-endpoint-b94352e6b17f835d3a47e25eac051288.png) 11. ​Accept endpoint connection request​ in the **Endpoint connections** Tab of the endpoint service. ![private link velodb acdess user vpc endpoint accept](/assets/images/private- link-velodb-acdess-user-vpc-endpoint- accept-f8530ef4ede58dc29fdae67fe82248fb.png) ![private link velodb acdess user vpc endpoint accept ok](/assets/images/private-link-velodb-acdess-user-vpc-endpoint-accept- ok-743e691ee7c593765a5de75736b1e332.png) 12. Refresh the page and wait for the status of the endpoint of VeloDB Cloud warehouse to be changed from "pendingAcceptance" to "available", which means the connection is successful. ![private link velodb acdess user vpc endpoint pendingacceptance](/assets/images/private-link-velodb-acdess-user-vpc- endpoint-pendingacceptance-e5d64c866b6c2063f73e6f2c431b94a7.png) ![private link velodb acdess user vpc endpoint available](/assets/images/private-link-velodb-acdess-user-vpc-endpoint- available-9c00edb5993ec894c464cd91b3d107ae.png) #### Azure​ 1. Switch to the target warehouse, click **Connections** on the navigation bar, and click **New Connection** for **VeloDB Accesses Your VPC** on the **Private Link** tab to create a connection to your endpoint service. ![azure velodb access vpc 1](/assets/images/azure-velodb-access- vpc-1-2036a40678c0bdacff6e3a68745f2cb8.png) 2. After clicking **\+ Endpoint Service** , the page will display the **Current Region** of the warehouse and the **Subscription ID of VeloDB**. You can click **Go to Create** to go to the cloud platform's Private Link product console and create an endpoint service (This refers to Azure private link service). ![azure velodb access vpc 2](/assets/images/azure-velodb-access- vpc-2-15342d65dd4309de1b03fdebfe11b7b9.png) 3. Sign in to the **[Azure portal](https://portal.azure.com/)** with your Azure account. In the **Basics** tab of the **Create private link service** page on the Private Link product console, you need to confirm that the region is the same as the VeloDB Cloud warehouse (limited by the cloud platform's Private Link product). Follow the wizard prompts to fill in the form as follows and click **Next: Outbound settings**. ![azure velodb access vpc 3](/assets/images/azure-velodb-access- vpc-3-a8c0d23849f7e98e333691c25e07b35f.png) Parameter| Category| Description| Subscription| Project details| Required. Select the subscription to create the private link service for database or datalake. All resources in an Azure subscription are billed together.| Resource group| Project details| Required. Select a resource group for the private link service to be created in it. If there is no suitable one, you can create a new one. A resource group is a collection of resources that share the same lifecycle, permissions, and policies.| Name| Instance details| Required. The instance name of the private link service to be created. You can customize it.| Region| Instance details| Required. Select the Azure region for the private link service to be created and located in it.Note: You need to select the region is the same as the VeloDB Cloud warehouse (limited by the cloud platform's Private Link product). ---|---|--- 4. In the **Outbound settings** tab of the **Create private link service** page. Follow the wizard prompts to fill in the form as follows and click **Next: Access Security**. ![azure velodb access vpc 4](/assets/images/azure-velodb-access- vpc-4-94adfa01b8a961ee99f0be5436c1302b.png) Parameter| Description| Load balancer| Required. Select a load balancer behind the private link service to load balances database or datalake. If there is no suitable one, you can create a new one on the cloud platform's Load Balancer product console.| Load balancer frontend IP address| Required. Select frontend IP address of the load balancer you selected above.| Source NAT Virtual network| Required.| Source NAT subnet| Required.| Enable TCP proxy V2| Required. Leave the default of No. If your application expects a TCP proxy v2 header, select Yes.| Private IP address settings| Leave the default settings ---|--- 5. In the **Access Security** tab of the **Create private link service** page, you need to choose **Restricted by subscription** for whom can request access to the private link service, and add the **Subscription ID of VeloDB** into the access whitelist of the private link service and choose **Yes** for auto-approve. Then click **Next: Tags**. ![azure velodb access vpc 5](/assets/images/azure-velodb-access- vpc-5-12901ac888fbbb41ad1836dae2434ed6.png) 6. In the **Tags** tab of the **Create private link service** page, keep the default settings and click **Next: Review + create**. Note: If you want to categorize the private link service and view consolidated billing, you can configure the tag for the private link service to be created. ![azure velodb access vpc 6](/assets/images/azure-velodb-access- vpc-6-728b40c08e93abd61d084586db5f50a1.png) 7. In the **Review + create** tab of the **Create private link service** page, you can review the settings for the private link service to be created. If some settings are not as expected, you can click **Previous** back to modify. If there is no problem, you can click **Create**. ![azure velodb access vpc 7](/assets/images/azure-velodb-access- vpc-7-50fe955adb6152f225b4e524d61f9f9c.png) 8. After the private link service is created, its status will be changed from "**Created** " to "**OK** ", indicating that the private link service has ready to be connected by the private endpoint of VeloDB Cloud warehouse. ![azure velodb access vpc 8](/assets/images/azure-velodb-access- vpc-8-0ad8e3492cc64efb28e4bf0f30ae1d53.png) ![azure velodb access vpc 8 2](/assets/images/azure-velodb-access- vpc-8-2-301315bd076b2c8872a877baca232e2c.png) 9. After creating the private link service, copy the **Rescource ID** and **Alias** from the private link service **Details** page, and fill them in the Endpoint Service registration page of VeloDB Cloud. ![azure velodb access vpc 9 1](/assets/images/azure-velodb-access- vpc-9-1-ebb8740fc6e0aa9aad2fc08f6ab52e8e.png) ![azure velodb access vpc 9 2](/assets/images/azure-velodb-access- vpc-9-2-03179ffc1a27a7fb357749a5b538086c.png) ![azure velodb access vpc 9 3](/assets/images/azure-velodb-access- vpc-9-3-518ed394ddccc66f08dd885bb12b6948.png) 10. After the registration is complete, go to the next step, specify the **Endpoint Name** of VeloDB Cloud warehouse, and click **Create Now**. ![azure velodb access vpc 10 1](/assets/images/azure-velodb-access- vpc-10-1-a3716547db46d049cffffda5b08937c4.png) ![azure velodb access vpc 10 2](/assets/images/azure-velodb-access- vpc-10-2-3cca22bfc176e4513941fc05380615ba.png) 11. Refresh the page and wait for the status of the endpoint of VeloDB Cloud warehouse to be changed from "**pendingAcceptance** " to "**Approve** ", which means the connection is successful. ![azure velodb access vpc 11 1](/assets/images/azure-velodb-access- vpc-11-1-18047ac80c56fafd9138eb61a5166f16.png) ![azure velodb access vpc 11 2](/assets/images/azure-velodb-access- vpc-11-2-9d0cbfd185f0dde8dec73ad0a4089679.png) ## Public Link​ On the **Connections** page, switch to the **Public Link** tab to manage the public network connection. ### Add IP Whitelist​ In order to access the VeloDB Cloud warehouse via the public network, you need to add the source public network IP address to the whitelist. Click **IP Whitelist Management** on the right of the **Connect Warehouse** card to add the source IP addresses or segments. ![public link](/assets/images/public- link-6b95eb428f3ce8d1b9bfcd1fc4084fb3.png) ![public link ip whitelist](/assets/images/public-link-ip- whitelist-40498cfa0943672f009fb66c31d213d5.png) In the IP whitelist, you can add or delete IP addresses to enable or disable their access to the warehouse. > **Note** By default, the IP segment 0.0.0.0/0 is set, which means the > warehouse is completely open to the public network. It is recommended to > remove it in time after use to reduce security risks. ### Access Warehouse​ After adding the source public network IP address to the whitelist, you can click **WebUI Login** to access the VeloDB Cloud warehouse through the public network. For the specific connection method, please refer to the **Other Methods**. ![public link connect warehouse methods](/assets/images/public-link-connect- warehouse-methods-a71e78dcefb8fb020be24c5670be342a.png) On This Page * Private Link * Access VeloDB from Your VPC * VeloDB Accesses Your VPC * Public Link * Add IP Whitelist * Access Warehouse --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/console-overview Version: 4.x On this page # Overview VeloDB Cloud is a cloud-native data warehouse that runs on multiple clouds, providing a consistent user experience and fully managed service. It is extremely fast, cost-effective, single-unified, and easy to use. This topic gives a brief overview of the main features the VeloDB Cloud console includes and how to navigate it. Later topics provide detailed descriptions of the specific features. ## Main Features​ * **Registration and Login**. * **Warehouse Management** : Provides free trial, paid warehouse creation, warehouse list, etc. * **Cluster Management** : Provides one-click creation, elastic resizing, fast upgrade, deletion, etc. * **Connections** : Provides the connection methods of the warehouses in the private network (VPC) and the public network. The public network connection supports whitelists. * **Metrics** : Provides metrics in dimensions such as resource usage, query, and write and supports flexible and easy-to-use alarm capability. * **Billing Center** : Provides usage statistics for the internal parts of organizations and warehouses. Billing is based on usage statistics. * **Others** : Including organization management, access control, notification, etc. ## Navigate VeloDB Cloud​ The overall layout of VeloDB Cloud console web interface is as follows: ![multi clusters](/assets/images/multi- clusters-964fa4fc141b0b95e31804a1bd10a4db.png) ### Navigation Bar​ Located on the left side of the web interface, **Navigation Bar** provides the main features of VeloDB Cloud's most crucial concept, **Warehouse** , including cluster management, connections, query, metrics, usage statistics, etc. ### Warehouse Selector​ Located at the top of the left navigation bar,**Warehouse Selector** displays all the warehouses under the current organization. You can switch warehouses, view warehouse info, create a new warehouse, etc. After switching to a warehouse, you can use it to experience all the features in the left navigation bar. ![warehouse-details](/assets/images/warehouse- details-d0bda65a7701f76590d4089c6372f2ff.png) ### User Menu​ Located at the bottom of the left navigation bar, **User Menu** provides some management features related to users and organizations, including security, notifications, users and roles, billings, etc. ![user-menu](/assets/images/user-menu-8c2dcc901b66c85f5a395eba482bdc5d.png) On This Page * Main Features * Navigate VeloDB Cloud * Navigation Bar * Warehouse Selector * User Menu --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/monitoring-overview Version: 4.x On this page # Monitoring Overview VeloDB Cloud provides monitoring and alerting so that you can track the health and performance of your warehouse or clusters and make adjustments. You can find the **Metrics** feature on the navigation bar, and you can * View metrics by warehouse or cluster. * Use **Starred** to display the metrics of interest in warehouse or different clusters together. * View historical metric data by adjusting the time selector, and you can view metric data of the past 15 days. * Use the auto-refresh feature to update metrics in real-time (5s). The metrics you can use in VeloDB Cloud fall into two categories. * **Basic Metrics** \- Basic metrics data helps you monitor physical aspects of your cluster, such as CPU usage, memory usage, and network throughput. * **Service Metrics** \- Query performance data helps you monitor warehouse or cluster activity and performance, such as QPS, query success rates, and more. It helps to understand the specific workload of the cluster. ## Basic Metrics​ ![metrics basic](/assets/images/metrics- basic-7c3b0bfba3df1552318c9d15f109ad67.png) Basic metrics provide physical monitoring information of the cluster by "node" dimension. You can determine whether the cluster is abnormal within a specified time frame by using the cluster's basic metrics. You can also see if historical or current queries are impacting cluster performance. You can use the cluster basic metrics to diagnose the cause of slow queries and take possible measures such as scaling up or scaling down the cluster capacity, optimizing SQL statements, etc. We provide the following cluster base metrics. ### CPU Utilization​ Displays the CPU utilization percentage of all nodes. You can find the lowest cluster utilization time from this chart before planning to scale a cluster and other resource-consuming operations. ### Memory Usage​ Displays the memory usage of all nodes. If memory usage is consistently high, you should consider scaling up your cluster. ### Memory Utilization​ Displays the memory utilization of all nodes. If memory utilization is consistently high, you should consider scaling up your cluster. ### I/O Utilization​ Displays the utilization of hard disk I/O. If I/O utilization is always maintained at a high level, you may consider scaling out more nodes for better query performance. ### Network Outbound Throughput​ Displays the average outbound speed of nodes per second over the network in MB/s. Queries that read data over the network are slower, and you should set up the cache correctly to minimize network reads. ### Network Inbound Throughput​ Displays the average inbound speed of nodes per second over the network in MB/s. ### Cache Read Throughput​ Displays the read throughput per second over the cache in MB/s. ### Cache Write Throughput​ Displays the write throughput per second over the cache in MB/s. ### Support Range of Basic Metrics​ Metrics| Warehouse| Cluster| CPU Utilization| Supported| Supported| Memory Usage| Supported| Supported| Memory Utilization| Supported| Supported| I/O Utilization| Supported| Supported| Network Outbound Throughput| Supported| Supported| Network Inbound Throughput| Supported| Supported| Cache Read Throughput| Not supported| Supported| Cache Write Throughput| Not supported| Supported ---|---|--- ## Service Metrics​ ![metrics query](/assets/images/metrics- query-6bf72cd81bd0898211d63e4bc1b1506a.png) ### Query Per Second (QPS)​ Displays the number of query requests per second. The required compute resource of a cluster can be determined based on your system's QPS during peak time. ### Query Success Rate​ Displays the percentage of successful queries to all queries updated by minutes. When the query success rate decreases abnormally, consider whether there is a cluster or node failure. ### Dead Nodes​ Displays the number of current cluster dead nodes. ### Average Query Runtime​ Displays the average time of queries updated by minutes. If the average query time rises abnormally, consider troubleshooting. ### Query 99th Latency​ Display the response time of the request that ranks at the 99th percentile in ascending order during a given time period, which reflects the speed of slow queries in the cluster. ### Cache Hit Rate​ Displays the percentage of I/O operations that hit the cache in all I/O operations. If the cache hit rate is too low, consider changing the cache policy or scaling up the space. ### Remote Storage Read Throughput​ Read the amount of data stored remotely per unit time. ### Sessions​ Display the number of sessions for the current warehouse, without distinguishing clusters. ### Load Rows Per Second​ A metric measuring the efficiency of data write operations, indicating the speed at which records are currently being successfully written to a database or other data storage systems. ### Load Bytes Per Second​ Display the current write task's rate, reflected by data size. ### Finished Load Tasks​ Display the number of tasks completed in the recent period. A sharp increase or decrease might indicate a business anomaly. ### Compaction Score​ Indicates the merging pressure of data files. The greater the Score, the greater the merging pressure. ### Transaction Latency​ Indicates the transaction latency of the warehouse write task. The smaller the delay, the faster the data can be queried. ### Support Range of Service Metrics​ Metrics| Warehouse| Cluster| Query Per Second| Supported| Supported| Query Success Rate| Supported| Supported| Dead Nodes| Not supported| Supported| Average Query Time| Supported| Supported| Query 99th Latency| Supported| Supported| Cache Hit Rate| Not supported| Supported| Remote Storage Read Throughput| Not supported| Supported| Sessions| Supported| Not supported| Load Rows Per Second| Supported| Supported| Load Bytes Per Second| Supported| Supported| Finished Load Tasks| Supported| Not supported| Compaction Score| Not supported| Supported| Transaction Latency| Supported| Not supported ---|---|--- # Alert Overview In addition to SMS alert notifications, VeloDB Cloud provides monitoring and alerting services at no additional charge. You can configure alert rules to be notified when cluster monitoring metrics change. ![metrics alerts](/assets/images/metrics- alerts-b8e7954b9cef484d27b9db495d8bbce3.png) ## Alert Configuration​ ### View Alert Rules​ You can view existing alerting rules and their current alerting status on the list page. "Red dot" means the alert rule is in effect, and "green dot" indicates the current alert rule is not triggered. ### Enable One-Click Alert​ ![one click alert](/images/cloud/one-click-alert.png) You can click **Enable One-Click Alert** to quickly set up basic alert rules, which will be applied to both current and future warehouses or clusters. ### New/Edit Alert Rule​ ![metrics alerts new alert rule](/assets/images/metrics-alerts-new-alert- rule-3c16bb005779c80dedf559a79d083a62.png) You can create an alert rule by clicking **New Alert Rule** or copying an existing one. You can also modify a current alert rule. The alert rule configuration consists of four parts. #### Rule Name​ You can customize the rule name, which must be unique within the warehouse. #### Cluster​ You can specify the cluster for which the alert rule is in effect. When a cluster is deleted, its alert rules will not be deleted but invalidated. #### Conditions​ You can set one or more rules for metrics to be met and how these conditions are combined (and, or). #### In Last​ "In Last" means the duration of time to meet the conditions. You should set this time appropriately to balance between timeliness and accuracy of alerts. ### Channel​ You can set one or more notification channels, and the alert messages will be pushed through the channels you set respectively. #### In-site Notification​ Configuration method: Select user. #### Email​ Configuration method: Select user. #### SMS​ Configuration mode: Select user / fill in cell phone numbers. #### WeCom​ Configuration method: fill in the robot webhook. 1. On WeCom for PC, find the target WeCom group for receiving alarm notifications. 2. Right-click the WeCom group. In the window that appears, click **Add Group Bot** . 3. In the window that appears, click **Create a Bot** . 4. In the window that appears, enter a custom bot name and click **Add** . 5. Copy the webhook URL. ![alerts WeCom](/assets/images/alerts- WeCom-a9638f3b7508aab8a1eeee57a826c82c.png) > **NOTE** If you need to restrict message sources, please set up IP > whitelist. VeloDB Cloud server IP address is 3.222.235.198. #### Lark​ Configuration method: fill in the robot webhook. To make a custom bot instantly push messages from an external system to the group chat, you need to use a webhook to connect the group chat and your external system. Enter your target group and click **Settings** > **BOTs** > **Add Bot** . Select **Custom Bot** . Enter a suitable name and description for your bot and click **Next** . ![alerts Lark step1](/assets/images/alerts-Lark- step1-00c5edc8209cee3af755a46669dd14ab.png) You'll then get the webhook URL. ![alerts Lark step2](/assets/images/alerts-Lark- step2-9121c339b509a40df1f44ec74210b61a.png) > **NOTE** If you need to restrict message sources, please set up IP > whitelist. VeloDB Cloud server IP address is 3.222.235.198. #### DingTalk​ Configuration method: fill in the robot webhook. To get the DingTalk robot webhook, please see [here](https://www.alibabacloud.com/help/en/application-real-time-monitoring- service/latest/obtain-the-webhook-url-of-a-dingtalk-chatbot) 1. Run the DingTalk client on a PC, go to the DingTalk group to which you want to add a chatbot, and then click the Group Settings icon in the upper-right corner. 2. In the **Group Settings** panel, click **Group Assistant** . 3. In the **Group Assistant** panel, click **Add Robot** . 4. In the **ChatBot** dialog box, click the **+** icon in the **Add Robot** section. Then, click **Custom** . ![alerts DingTalk01](/assets/images/alerts- DingTalk01-0ac9cd9e01d3c2dd99ede66b65fad8e2.png) 5. In the **Robot details** dialog box, click **Add** . 6. In the **Add Robot** dialog box, perform the following steps: ![alerts DingTalk02](/assets/images/alerts- DingTalk02-609efafd7b13357928e2c5bfad519294.png) > **NOTE** If you need to restrict message sources, please set up IP > whitelist. VeloDB Cloud server IP address is 3.222.235.198. 7. Set a profile picture and a name for the chatbot. 8. Select **Custom Keywords** for the **Security Settings** parameter. Then, enter **alert** . 9. Read the terms of service and select **I have read and accepted _DingTalk Custom Robot Service Terms of Service_** . 10. Click **Finished** . 11. In the **Add Robot** dialog box, copy the webhook address of the DingTalk chatbot and click **Finished** . ![alerts DingTalk03](/assets/images/alerts- DingTalk03-a2418b2b06881287e53f9cdeb726d80b.png) ## View Alert History​ You can view the alert history and filter it. On This Page * Basic Metrics * CPU Utilization * Memory Usage * Memory Utilization * I/O Utilization * Network Outbound Throughput * Network Inbound Throughput * Cache Read Throughput * Cache Write Throughput * Support Range of Basic Metrics * Service Metrics * Query Per Second (QPS) * Query Success Rate * Dead Nodes * Average Query Runtime * Query 99th Latency * Cache Hit Rate * Remote Storage Read Throughput * Sessions * Load Rows Per Second * Load Bytes Per Second * Finished Load Tasks * Compaction Score * Transaction Latency * Support Range of Service Metrics * Alert Configuration * View Alert Rules * Enable One-Click Alert * New/Edit Alert Rule * Channel * View Alert History --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/more/amazon-aws/create-data-credential Version: 4.x On this page # Create a Data Credential VeloDB adopts a storage-compute separation architecture, where data is typically stored in object storage. To ensure that the warehouse can access the underlying data properly, a Data Credential must be created in advance. The core of a Data Credential involves creating an IAM policy and an IAM role. VeloDB will automatically attach this role to the EC2 used by the VeloDB warehouse. Below are the detailed steps. ## Step 1: Create an S3 Bucket​ First, you need to prepare an S3 Bucket. If you already have one, you can skip this step and proceed to Step 2. > **NOTE** The S3 bucket you use must be located in the same AWS region where > your VeloDB warehouses are deployed. If you do not already have a bucket in > that region, please create one before proceeding. 1. Log in to the AWS S3 Console as a user with administrator privileges. 2. Click the **Create bucket** button. 3. On the **create bucket** page, set the following options: 1. Enter a name for the bucket. 2. Select the AWS region that you will use for your VeloDB warehouse deployment. 3. Enable Bucket Versioning (recommended). 4. Click **Create bucket**. 5. Copy the bucket name to add to VeloDB console. ## Step 2: Create an IAM Policy​ After the S3 bucket is provisioned, create an IAM policy that grants read and write access to the bucket. 1. Log into the **[AWS IAM Console](https://console.aws.amazon.com/iam/)** as a user with administrator privileges. 2. Click the **Policies** tab in the sidebar. 3. Click the **Create policy** button. 4. In the policy editor, click the **JSON** tab. 5. Copy and paste the following access policy into the editor, replacing `` with the name of the S3 bucket you prepared in the previous step. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "arn:aws:s3:::", "Action": [ "s3:GetBucketLocation", "s3:GetBucketVersioning", "s3:PutBucketCORS", "s3:ListBucket", "s3:ListBucketVersions", "s3:ListBucketMultipartUploads" ] }, { "Effect": "Allow", "Resource": "arn:aws:s3:::/*", "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ] }, { "Effect": "Allow", "Action": [ "sts:AssumeRole" ], "Resource": "*" } ] } 6. Click the **Next** button. 7. In the **Name** field, enter a policy name.(e.g.VeloDBDataStorageAccess) 8. Click **Create policy**. ## Step 3: Create a Service IAM Role​ 1. Click the **Roles** tab in the IAM console sidebar. 2. Click **Create role**. 1. Trusted entity type: Select **AWS** service. 2. Use cases: Select **EC2**. 3. Click the **Next** button. 4. Attach Permission Policies: In the policy search box, enter the name of the policy you created in Step 2. 5. In the **role name** field, enter a role name. (e.g. **VeloDBDataStorageAccessRole**) 6. Click **Create role**. 3. Update the Role's Trust Relationships. Now that you have created the role, you must update its trust policy to make it self-assuming. In the IAM role you just created, go to the Trust Relationships tab and edit the trust relationship policy as follows, replacing the`` and `` values. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com", "AWS": "arn:aws:iam:::role/" }, "Action": "sts:AssumeRole" } ] } 4. In the role summary, copy the **Instance Profile ARN** (format: arn:aws:iam::``:instance-profile/``) to add to VeloDB console. On This Page * Step 1: Create an S3 Bucket * Step 2: Create an IAM Policy * Step 3: Create a Service IAM Role --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/studio Version: 4.x On this page # Studio VeloDB Cloud Studio ("Studio") is a data development platform for data development scenarios. It is a data development platform on the cloud provided by VeloDB, which can assist users in managing and exploring data, and can replace Navicat. ## Main Function​ * **Warehouse Login** : Use different database users to log in to the warehouse in the Studio. * **Data query** : * **​ SQL Editor ​** : An easy-to-use SQL query editor that supports query execution, automatic SQL saving, query profiles, historical query records, etc. * ​**Log Analytics** ​: A user-friendly analysis tool for log scenarios, supporting SQL filtering, searching, and other functions. * ​**Session Management** ​ : Manage running SQL queries and allow viewing and terminating SQL queries. * ​**Query Audit** ​ : A one-stop historical query audit tool that can filter slow queries and view their execution. * ​**​ Workload Management ​** ​ : Support quick creation, editing and viewing of Workload Group. * **Data Management** : View and manage data in the database, currently supports viewing. * **Privilege Management** : Manage users and roles in the database, and grant and revoke permissions to them. * **Data Integration** : Easily connect to data in object storage on the cloud, connect to data lakes, and import sample data. * **​ Import: ​** Support the view of import tasks and operate on import tasks. ## Register and Login​ ### Using the Studio service​ In VeloDB Cloud Manager ("Manager"), each warehouse has a corresponding Studio service. In the "Connection" module of Manager, you can find the entrance to the Studio through a private network or a public network. You can also save the entry address of the Studio for direct access. ![public link connection info](/assets/images/public-link-connection- info-f7ca28959efe23469d159c720b6dd8e3.jpg) ### Login to Studio​ ![login](/assets/images/login-5e4899565ff7ee32567f34f04c8a1988.jpg) You need to enter the **Username** and **Password** of the warehouse on the login page. If you clicked the link to log in from the Manager, the warehouse name should be pre-filled. We will not record your login account and password, but you can use the recording function that comes with your browser. ## Data​ The "data" module is the basic function of Studio to manage the database, and it mainly has two functions: 1. Check the data and its organizational form, such as database table structure, data size, table creation statement, table field information, data preview, etc. 2. Add, delete and modify database objects, including new creation, deletion, and renaming of database objects. The data module is displayed according to the organizational form of the data in the database, and is divided into **Catalog** -**Database** -**Table** /**View** . ### Catalog​ Catalog is a collection of databases. Catalog is divided into internal catalog and external catalog. Internal catalog contains VeloDB's own database, external directories can be connected to Hive, Iceberg, Hudi, etc., as VeloDB supports the data lake features. VeloDB Studio supports direct deletion of Catalog objects. ![data catalog](/assets/images/data-catalog- ba07cbfc6bfa488bcf285ed295f28210.jpg) ### Database​ A database is a collection of tables, views, materialized views, and functions. The database belongs to the directory. When a directory is selected, you can view the database under the directory and the size of the database. At the same time, you can create, delete, and rename the database under the page. ![data internal database](/assets/images/data-internal- database-a7acabc50312d722870fbeb0236b6087.jpg) ### Table​ Table is the basic unit of VeloDB data warehouse, and table belongs to database. When a database is selected, you can see the tables under the database, as well as the size of the table, creation and modification time. ![data internal table](/assets/images/data-internal-table- afe701d6f09ad79dc8f046da21f53ec7.jpg) When you click on a table, you can enter the details management page of the table and view the DDL definition, fields, index and other information of the table. ![data internal table details](/assets/images/data-internal-table- details-08c6e68e902bfa2d085622bbf4105a11.jpg) The Data Preview page is used to quickly preview the data data of the table, and by default preview of the first 100 pieces of data of the table from the interface. "Total x data" is obtained from the metadata service, so there may be delays. ![data internal table data preview](/assets/images/data-internal-table-data- preview-8c8f659763bdd3be54e9b9edb192fd20.jpg) ### View​ A view is a visual table based on the result set of SQL statements. The view page is roughly similar to the table page. Attributes (such as indexes, details) that the view does not have will not be displayed. The view also supports data preview function (the first 100 pieces of data). ### Materialized View​ Materialized View is a table that pre-calculates query results and stores, which can be used to accelerate query performance and reduce real-time computing pressure. The Studio database page can list the materialized view information under the database. ### Function​ The Studio database page can list the function information under the database, and supports viewing the function type, return type, creation statement and other information. ## SQL Editor​ The query result will be returned below the edit box, and the error or success status and information returned by the query will also be displayed at the query result. At the same time, you can click the drop-down button on the right side of **Run (LIMIT 1000)** and switch to **Run and Download** to download your query results. ![sql console](/assets/images/sql- console-78ad98f40e5f20bfd4e16513f1ec1ec2.jpg) Session records are the history of the Tab you open in the SQL Editor. You can click on the SQL statement in the record and copy it to the SQL Editor for execution. ![query session](/assets/images/query- session-a30837a2199be5945a4ad9a835e32745.jpg) Query history is the history of the SQL statement you execute in the SQL editor. You can click on the SQL statement in the record to view the Profile information of the statement. > **NOTE** There is no Query ID for non-query statements, nor for failed > statements. ![query history](/assets/images/query- history-2cf93dd459770b509f2a7ab6a33eef45.jpg) By default, query plans are enabled for queries initiated in the Studio, which will not affect the performance of a single query. Click "Query Statement" to enter the execution plan page. The download button can download Profile information, including Profile information in pure TEXT format and visual Profile images. The Import Profile button can import Profile information in the TEXT format, and after importing, you can visually view the Profile. This helps you visually analyze queries initiated from other clients. ![profile](/assets/images/profile-948f59e8626af97eba2170c61fc8162c.jpg) We have built-in sample query statements for some test datasets in Studio to help you do some simple performance testing. ![sql templates](/assets/images/sql- templates-7cc7d8bac230de0e708f0a8e1891cd91.jpg) In the results panel, you can see the execution results of SQL statements, including query results, execution time, number of rows, etc. You can also search for results through the search box, or click the table header to sort the results. ![sql console result](/assets/images/sql-console- result-62e1bdfbfa81f55ed37c79f10d848cb6.jpg) ## Session Management​ Session management allows administrator users to manage the use of resources and prioritize critical queries to improve system performance and provides detailed information about each session, such as execution time, the user who initiated the query, and the resources being used. You can view all currently running SQL queries and terminate any queries that cause problems or run time exceeds expectations. ![session](/assets/images/session-34ad6f91deba9f0965b1036076ca5f81.jpg) You can check the table to display more information about running SQL queries, such as scan size, scan number of rows, return number of rows, etc. ![session display row](/assets/images/session-display- row-287ebcce9fbbb93e33960ee29053ffc4.jpg) Click the Query ID of the session to further view the complete information of the session, including the executing user, the FE node that received the session, and the execution plan (Profile) of the SQL. ![session detail](/assets/images/session-detail- af06f87051c80fc9008f1e9f158eab31.jpg) ## Query Audit​ Query audits are used to audit and analyze query history executed in the system. It allows you to filter and identify poor performance queries to optimize database performance. The tool includes analytics to gain insight into the execution plan and resource usage of each query. As a one-stop solution for tracking query performance, discovering trends, and diagnosing problems. You can filter historical queries and in List Selection, select more dimensions to assist in analysis. Click "Query ID" to enter the query detailed page. You can view more Query information. If Profile is enabled, you can view the query profile on this page. ![audit log](/assets/images/audit-log-bd66f1aaa8a3d2bc4354b78f4c070a53.jpg) ## Search Analysis​ Search and analysis is launched by VeloDB Studio. It is a query tool for log analysis scenarios, which can easily search, query and count logs. The interactive search and analysis interface is similar to the Kibana Discover page, which optimizes in-depth experience for log retrieval and is divided into 4 areas: * **Input area at the top** : Select the cluster, table, time field, and query time period. The main input box supports two modes: keyword retrieval and SQL. * **The field display and selection area on the left** : Display all fields in the current table. You can select which fields are displayed in the detailed display area on the right. Hovering over the field will show the 5 values ​​and the proportion of the occurrence of this field. You can further filter by value. The filtering conditions are reflected in the filtering part of the input area. * **The trend chart display and interaction area in the middle** : Display the number of logs that meet the conditions at a certain time interval. Users can select a period of time in the box on the trend chart to adjust the query time period. * **Detailed data display and interaction area below** :: Display log details, you can click to view the details of a certain log. It supports two formats: table and JSON. The table form also supports interactive creation of filter conditions. Click `Query > Search Analysis` and select the table as `internal_schema > audit_log`, Studio will automatically query the fields in the table and select the first time field. ![discover](/assets/images/discover-2b32d1e6be25c37e2a674a866fd67f82.jpg) Hover over the state field on the left to display the highest frequency state values ​​EOF, OK, ERR, and you can also view the proportion. In addition, you can also create filter conditions by clicking the plus sign (+) or minus sign (—) button, for example, by clicking the minus sign (—) button to the right of ERR, state != ERR is displayed in the filter conditions by clicking the minus sign (—) button to the right of ERR. ![discover top field](/assets/images/discover-top- field-f51f8a11fd20bde2c222775b99b35b5a.jpg) In the main input box, use search and SQL modes to query keywords.Search mode is supported only on tables with inverted indexes. Under the search box, select Search, and then enter GET on the right, click Query. In search mode, search for a log containing the keyword GET. The GET in the details will be highlighted, and the number of data strips in the trend chart will change accordingly. ![discover search](/assets/images/discover-search- ee0bfe2effe0846932375ef4478a955e.jpg) > **NOTE** Searching for the MATCH_ANY statement that matches any keyword can > match any field in the log. Note that the highlighting of the search results > will match all search keywords as much as possible, but due to some special > characters, it does not always match the search keywords exactly. You can use double quotes to wrap phrases in searches, such as `"GET /api/v1/user"`. Will match the entire phrase. The phrase uses `MATCH_PHRASE` to match the phrase. If more precise matches are required, you can use SQL pattern. Under the search box, select `SQL`, and in `SQL mode`, enter the SQL WHERE condition and click `Query`. ![discover sql](/assets/images/discover- sql-5586a4f8bd0308b1002f7d7fe74f22ef.jpg) Expand log details, optionally in Table or JSON format, the Table format supports interactive creation of filters. ![discover row detail](/assets/images/discover-row- detail-a248ecdc9dea85826250bb0fe09720cb.jpg) Click the context search on the right to view the 10 logs before and after this log. You can continue to add filter conditions in the context search. ![discover surrounding](/assets/images/discover-surrounding- fcb7512a64b52e5ca4d00eff75e55f5e.jpg) Introduced a new data type `VARIANT`, it can store semi-structured JSON data. The `VARIANT` type is especially suitable for handling complex nested structures that may change at any time. Studio will recognize the `VARIANT` data type, automatically expand the hierarchy of that data type, and provide a special filtering method. Let's take the github_events table as an example to show how to filter fields of `VARIANT` data type. In the filtering condition, we can select the field of the `VARIANT` data type and select the subfields in it for filtering. ![discover variant filter](/assets/images/discover-variant- filter-17ddc9296e1e0714dba6aca6a39c8739.jpg) ## Workload Group Management​ > **NOTE** Workload Group Management supports VeloDB Cloud 4.0.0 and above. Workload Group Management supports the rapid creation, editing and viewing of Workload Group. Using Workload Group, you can manage the CPU/memory/IO resource usage used by querying and importing loads in the cluster, and control the maximum concurrency of queries in the cluster. ![workload](/assets/images/workload-bdd91ce83b367166973abdfc2dd1fa14.jpg) You can view more items in the table filter above the Workload Group list. ![workload more](/assets/images/workload- more-089942b6dd811fca7499b3286e4b8293.jpg) In the New Workload Group interface, you can click on the question mark of the parameter, and the description of the parameter will be displayed. ![workload add](/assets/images/workload- add-79eff7df20afe70ab5d3d6cebdd655cc.jpg) ## Integrations​ Integrations are portals connecting VeloDB Cloud with data outside the warehouse. Currently, you can create two new integrations, namely Stage integration (object storage) and sample data. ![integration](/assets/images/integration- ba1dbbd31b557807ecfd414fd7ca0068.jpg) ### Object Storage​ By creating a new object storage integration, you can establish a **Connection** with data in object storage. Through the **Integrate + Copy Into** command, you can **Import** the data in the object storage to the warehouse. When creating a new object storage integration, you need to enter the following: * **Integration Name** : Consistent with the database object naming rules, up to 64 characters, letters, numbers, and underscores can be used. * **Comments** : Integrated comments. * **Bucket** : The bucket you need to integrate. * **Default file path** : The file path to be accessed in the bucket. VeloDB will only access the files under the path you fill in. If you do not fill in, the default is that the data in the entire bucket can be accessed. * **Access Authorization** : The way to allow VeloDB to access your bucket. It is divided into Access key and cross-account authorization. We recommend using cross-account authorization for better security. For guidelines on cross-account authorization, please refer to: [ IAM Cross-Account Access Guide​ ](https://docs.velodb.io/cloud/management-guide/studio#iam-cross-account-access-guide-aws)。You must pass the permissions check to successfully create an integration. * **Advanced Configuration** : Details below. ![integration new object storage s](/assets/images/integration-new-object- storage-s-9235ddc5181d226235e40254a2ee0d3c.jpg) Divided into **File Type** and **Import Configuration**. These are the parameters that you may use when importing integrated data. You can set them here, or specify them when importing. If you do not set or specify them, the system will execute the import task of the integration with the default configuration. ![object storage advanced configuration](/assets/images/object-storage- advanced-configuration-3eeca7579f5d9dd1de76fca0627ce7fb.jpg) * **File type** : The default type of the integrated storage file, currently supports `csv`, `json`, `orc`, `parquet`. The default is that the system infers from the filename suffix. * **Compression method** : The default compression type of the integrated storage file, currently supports `gz`, `bz2`, `lz4`, `lzo`, `deflate`. The default is that the system infers from the filename suffix. * **Column separator** : The default column separator of the integrated storage file, the default `\t`. * **Line separator** : The default line separator of the integrated storage file, the default `\n`. * **File size** : When importing files under this integration, the default import size limit is unlimited by default. * **On Error** : When importing files under this integration, when the data quality is unqualified, the default error handling method. There are three types: continue importing, stop importing, and continue importing when the proportion of error data does not exceed a certain value. * **Strict Mode** : Strictly filter the column type conversion during the import process. Default is off. ### Sample Data​ Creating a new sample data integration will import sample data into the database on the basis of creating an object storage integration. Therefore, you need to select the cluster to complete the new creation. TPCH, Github Event, SSB-FLAT test data size has the following choices: sf1 (1GB), sf10 (10GB), sf100 (100GB), select through the drop-down menu, and the test warehouse can only choose 1sf (1GB). Clickbench only has the option of sf100 (100GB), we recommend that you use a larger cluster to import Clickbench sample data. ![new sample data clickbench](/assets/images/new-sample-data- clickbench-818bb16529ed7ffd899182840a8b8786.jpg) You can view the import progress in the sample data details. ![clickbench importing](/assets/images/clickbench- importing-554810447efab919104dc37b061a13e1.jpg) ## Permissions​ ### User​ Display the users in the VeloDB repository. Note that the root user will not be displayed here. Only users with Admin authority can add and modify other users. ![privileges](/assets/images/privileges-fabb3bee53d3f9fbfb140841052fc25e.jpg) You can create a new user on this page, except for the username, other content is optional. However, we strongly recommend that you add passwords for users and restrict access to hosts for enhanced security. ![privileges users](/assets/images/privileges- users-37beeddb576a226a284e85e17d957a6b.jpg) ### Role​ Here you can manage the roles in VeloDB, and also perform authorized operations on the roles. Only users with Admin permissions can add and modify other roles. VeloDB currently does not support managing users under roles through roles, which means you need to specify your roles when creating or modifying users. ![privileges roles](/assets/images/privileges-roles- afa47cfebf18c621b9e386db7c8a74dc.jpg) ![roles new](/assets/images/roles-new-79b0b684b2dec4b4d41631baacda8fbb.jpg) ### Authorize​ On the user or role details page, click on the specific user or role name to enter the permission configuration page, and you can perform authorization/revocation operations. You need to have Admin or Grant permissions at the corresponding level in order to perform authorization/revocation. In VeloDB, permissions are divided into the following categories: * **Global** : Global permissions are permissions at the entire database level, with global permissions, and automatically have corresponding permissions for all corresponding objects in the database. * **Data** : refers to the permissions of data resources. You can authorize them according to the level, have the permissions at the parent level, and automatically have the corresponding permissions of its children's content. * **Workload Group** : Usage permissions only. * **Resource** : It is the permission of Resource, including Grant and Usage. * **Compute Group** : The memory separation cluster exists in VeloDB 3.0 and controls the Usage permissions of different computing groups. * **Cluster** : Exist in VeloDB Cloud connections, controlling Usage permissions for different clusters. ![privileges authorize](/assets/images/privileges-authorize- ae88c6b2af6452c2d041cc07642cd8ac.jpg) ## Import​ VeloDB Studio supports the management of load tasks such as Stream Load, Routine Load, Broker Load, and Insert Into in connection, and currently supports the following operations: * Information query for load tasks * Stop Routine Load, Broker Load, Insert Into * Pause/Edit/Recover Routine Load You first select a database and then view all load tasks under that database in the load task list. ![load task](/assets/images/load-task-a47efca04b4599659a8bce1fe9749134.jpg) Click the load task name to view the detailed information of the load task. ![load task detail](/assets/images/load-task- detail-b23f5f407f839807838db8fd9024bbec.jpg) ## IAM Role Setup Guide (AWS)​ Please use the following steps to create the role and add permissions in your AWS console: 1. Access the **IAM** service and select **Roles** from the menu. Click on the **Create role** button. ![create iam role](/assets/images/create-iam- role-89e578168fc692a0defe9aa652185adc.png) 2. Select **Custom trust policy** in the **Select trusted entity** section. ![trust entity](/assets/images/trust- entity-a68f27f8a8c740ca4c802c6f374e9c3b.png) Replace the `` in the following trust policy with the actual IAM Role ARN of your VeloDB warehouse . { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "" }, "Action": "sts:AssumeRole" } ] } 3. Select the permission policies you would like to attach to the role. Click on the **Next** button. ![permission policies](/assets/images/permission- policies-1e59901120aefa595094ce652aaab21b.png) 4. Config ​**Role name** ​,and click on the **Create role** button to finish. ![iam create role](/assets/images/iam-create- role-797ee853d0d4713f3e9d795c6bac601d.png) 5\. Click on the role name in the list of roles. Copy the value of the **ARN** from the **Summary** section to provide the value in VeloDB Cloud. ![iam role detail](/assets/images/iam-role- detail-7f054f4ab4a6e1782bbe15fac69fed4f.png) On This Page * Main Function * Register and Login * Using the Studio service * Login to Studio * Data * Catalog * Database * Table * View * Materialized View * Function * SQL Editor * Session Management * Query Audit * Search Analysis * Workload Group Management * Integrations * Object Storage * Sample Data * Permissions * User * Role * Authorize * Import * IAM Role Setup Guide (AWS) --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/usage-and-billing Version: 4.x On this page # Billings This topic describes how to manage fee deduction channels and view bills for organization administrators. Before applied in the production environment, it is recommended to link a credit card or open a cloud marketplace deduction channel to ensure the continuous operation of the service. ## Deduction Channels​ VeloDB Cloud currently supports four deduction channels, which are credit card, cloud marketplace, cash and vouchers. VeloDB Cloud will generate bills periodically and deducts fees from those channels. Click the **User Menu** > **Billings** , enter Billing Overview page to view the overall usage of the those fee deduction channels. ![billing overview](/assets/images/billing- overview-7121fc824d666a11f1a39d57f8617750.png) The following describes the use of the above deduction channels: ### Credit Card​ In Billing Overview page, click **Add** on **Credit Card** to complete the setup. ![billing-credit-card](/assets/images/billing-credit- card-11e053348897e005265a7c4110a244f0.png) You can’t remove a credit card in the Billing Overview page, but you can update it anytime. This helps ensure your organization always has a valid payment method. If you need to remove your credit card, please contact VeloDB Cloud support for help. ### Open Cloud Marketplace​ #### AWS Marketplace​ This topic mainly describes how to use the AWS Marketplace deduction channel. The specific opening process is as follows: 1. In Billing Overview page, click **Subscribe** on **Cloud Marketplace** card, find **AWS Marketplace** in the drawer page, then click **Go to Subscribe** to enter the VeloDB Cloud commodity page of the AWS Marketplace. ![billing cloud marketplace deduction channel](/assets/images/billing-cloud- marketplace-deduction-channel-d8dc98319256941a3d6deae61252114e.png) 2. Click **View purchase options** to enter the Subscription page of the AWS Marketplace. ![velodb cloud on aws marketplace](/assets/images/velodb-cloud-on-aws- marketplace-590f49d5366456c8631f265a2b6fb829.png) 3. Click **Subscribe** ,when the page displays "You are currently subscribed to this offer", click **Set up your account** to go to the authorization page of VeloDB Cloud. ![subscribe to velodb cloud on aws marketplace](/assets/images/subscribe-to- velodb-cloud-on-aws-marketplace-034bc037473d91e09e9e3419832c487d.png) 4. Log in with your VeloDB Cloud account on the authorization page. ![authorized to aws marketplace deduction login](/assets/images/authorized-to- aws-marketplace-deduction-login-a7d235cf8875842808567317a34361f0.png) 5. Select the target organization from the organization list, and click **Confirm Authorization**. Once the authorization is successful, your AWS account will deduct the subsequent expenses. ![authorized to aws marketplace deduction choose organization](/assets/images/authorized-to-aws-marketplace-deduction-choose- organization-6cc5d0950a1376c15d6a49d78dcc5158.png) 6. Click **Check** go back to Billing Overview page after completing the authorization. If **AWS Marketplace** is displayed on the cloud marketplace deduction channels, that means you have successfully opened the cloud marketplace deduction channel. ![authorized to aws marketplace deduction succeeded](/assets/images/authorized-to-aws-marketplace-deduction- succeeded-9b32578a6d9703484398a9cc82e62164.png) ![billing overview opened cloud marketplace deduction channel](/assets/images/billing-overview-opened-cloud-marketplace-deduction- channel-32704398f2eb4d70a282168fe62eb2b0.png) #### GCP Marketplace​ This topic mainly describes how to use the AWS Marketplace deduction channel. > Note: The additional commission rate in GCP Marketplace is 3% of the paid > amount. **1\. Go to GCP Marketplace VeloDB Cloud product** You can jump to the GCP Marketplace through the VeloDB Cloud console, or search for VeloDB directly in Marketplace. * Jump to the GCP Marketplace through the VeloDB Cloud console On the **Billing Overview** page in [VeloDB Cloud console](https://www.velodb.cloud/), click **Open** on the **Cloud Marketplace Deduction Channel** card, find **GCP Marketplace** on the drawer page, then click **Go to Subscribe** to jump to the VeloDB Cloud product page in GCP Marketplace. ![billing-cloud-marketplace-deduction-channel](/assets/images/billing-cloud- marketplace-deduction-channel-d8dc98319256941a3d6deae61252114e.png) * Search for VeloDB directly in GCP Marketplace You can also find the VeloDB Cloud product on [GCP Marketplace](https://console.cloud.google.com/marketplace) by searching for "**VeloDB** " or "**Doris** " and then enter the VeloDB Cloud product page in GCP Marketplace. ![marketpalce gcp 1 2](/assets/images/marketpalce- gcp-1-2-3cd460317faf394ec5c2f1c768815187.png) **2\. Subscribe VeloDB Cloud** On the VeloDB Cloud product page in GCP Marketplace, click the button **SUBSCRIBE** to go to the order confirmation page. ![marketpalce gcp 2 1](/assets/images/marketpalce- gcp-2-1-16480bb47e2d70266deaed0319e506d3.png) ![marketpalce gcp 2 2](/assets/images/marketpalce- gcp-2-2-3d32e34f43483bae60c974c13a178993.png) On the order confirmation page, check the terms and click the button **SUBSCRIBE**. ![marketpalce gcp 2 3](/assets/images/marketpalce- gcp-2-3-5a317a35b65f42aa57dcf381f0fd9fea.png) In the secondary confirmation dialog box, you can click the button **GO TO PRODUCT PAGE** to view the subscription effect, or you can click the button **MANAGE ORDERS** to view the order changes. ![marketpalce gcp](/assets/images/marketpalce- gcp-2-4-6f68f05a202623edef61cbdb757eceb0.png) ![marketpalce gcp 2 5](/assets/images/marketpalce- gcp-2-5-e7d2505c7a8d093dcbe582688b4bb49f.png) ![marketpalce gcp 2 6](/assets/images/marketpalce- gcp-2-6-937418b5dfc5ee93e283487cb801502c.png) On the VeloDB Cloud product page in GCP Marketplace, you need to click the button **MANAGE ON PROVIDER** to jump to VeloDB Cloud console to register as a user and log in to complete the authorization process. ![marketpalce gcp 2 7](/assets/images/marketpalce- gcp-2-7-f5c94a7958821879959abb61a57b6115.png) ![marketpalce gcp 2 8](/assets/images/marketpalce- gcp-2-8-f5c94a7958821879959abb61a57b6115.png) If you have already registered, you can **login** directly. ![marketpalce gcp 2 9](/assets/images/marketpalce- gcp-2-9-9eef0e0d77c41f6a9c94702ada4e9e63.png) You can log in via your mobile phone number or email and proceed to the second step: **Authorize Organization**. ![marketpalce gcp 2 10](/assets/images/marketpalce- gcp-2-10-e6be16f83ed5b512facc7c7f40b7ca78.png) Select the target organization and click the button **Confirm Authorization**. ![marketpalce gcp 2 11](/assets/images/marketpalce- gcp-2-11-6da4cef80cf54dfcbc4ea9debf9f8163.png) > Note: There may be a delay in the order status, and you need to wait about 1 > minute before authorization. ![marketpalce gcp 2 > 12](/assets/images/marketpalce- > gcp-2-12-234407ef527c06787743cdb6c47a50e2.png) After successful authorization, proceed to the third step: **View Authorization Result**. Click the button **Check** to go to the **Billing Overview** page to view the status of the cloud marketplace deduction channel activation. ![marketpalce gcp 2 13](/assets/images/marketpalce- gcp-2-13-b260536f15c76ffa6e30eae40e38a764.png) ![marketpalce gcp 2 14](/assets/images/marketpalce- gcp-2-14-f32a74855388a85b761db9473285c8a1.png) **3\. Unsubscribe VeloDB Cloud** On the **Billing Overview** page in [VeloDB Cloud console](https://www.velodb.cloud/), click **Change** on the **Cloud Marketplace Deduction Channel** card, find **GCP Marketplace** on the drawer page, then click the button **Go to Unsubscribe** to jump to the order page in GCP Marketplace. ![marketpalce gcp](/assets/images/marketpalce- gcp-3-1-f11ff5f2007f583c8789b32adeaa9154.png) On the order page in GCP Marketplace, find the target order by **Order Number** (the status is currently "Active"), click the action column on the right, expand the drop-down menu, and click **Cancel order**. ![marketpalce gcp 3 2](/assets/images/marketpalce- gcp-3-2-3a396b76dbb8d7149998c4cd82ce8da6.png) In the secondary confirmation dialog box, enter the **Order Number** and click **CANCEL ORDER**. ![marketpalce gcp](/assets/images/marketpalce- gcp-3-3-c90cddbe7334445a3cfc3eb6597e6440.png) After success, you can see that the status of the target order is changed to "**Canceled** " and the **GCP Marketplace Deduction Channel** card in VeloDB Cloud console is restored to the initial unsubscribed state. ![marketpalce gcp](/assets/images/marketpalce- gcp-3-4-cfd6c0f3c4e03d07348e0907da2b832e.png) > Note: There may be a delay in the order status. You need to wait about 1 > minute and refresh the Billing Overview page in [VeloDB Cloud > console](https://www.velodb.cloud/) to see the change of the Cloud > Marketplace Deduction Channel. **4\. Contact Sales** If you want to know more about the product, you can CONTACT SALES by email. ![marketpalce gcp 4 1](/assets/images/marketpalce- gcp-4-1-a4cd95f81cc56fc88bdb6ce8fbf61251.png) ### Recharge Cash​ You can pay directly into your VeloDB account. After you complete the payment, please provide the **payment receipt** , your **organization id** and **organization name** to VeloDB. You can contact VeloDB sales or send email to `support@velodb.io`. We will recharge your account cash balance. You can find in your organization id and organization name in Organization Management. For Details, please refer to Organization Management. VeloDB Bank Account Information: - Beneficiary Name: VELODB INC - Beneficiary Address: 1142 Juniper Ct, San Jacinto, CA 92582 - Beneficiary Bank: Citibank, N.A. - Beneficiary Bank Address: 388 Greenwich Street New York, NY 10013 - Beneficiary Account Number: 40806519 - SWIFT Code: CITIUS33XXX - BRANCH CODE: 930 - ABA: 021000089 ### Activate Voucher​ In Billing Overview page, switch to **Vouchers**. ![voucher management](/assets/images/voucher-management- ccf3c45092bff58175f67e5733de1aa9.png) Click **add voucher** and input the voucher activation code issued by VeloDB Cloud to activate the voucher. ![billing overview add voucher](/assets/images/billing-overview-add- voucher-a2c444aa4866df412a203db7039fcd88.png) You can also view voucher usage and voucher activation history in Voucher Management page. ## Bills Statements​ The VeloDB Cloud collects the above usage information of the entire organization every minute, deducts fees by hour, and generates hourly bills and monthly bills, which are mainly provided to the organization administrators for reconciliation and cost analysis. In Billing Statements page, you can view or export the bills. ![billing overview monthly bill](/assets/images/billing-overview-monthly- bill-1f60ecbb5ed3be5e3472854a36d3b634.png) If the credit card, cash and voucher balance are insufficient or no cloud marketplace deduction channel can be used, VeloDB Cloud will stop the service, but the data will be retained for 7 days. To keep the online service running continuously, please ensure that the cash balance is sufficient or open a cloud marketplace deduction channel. ## Usage​ After creating a warehouse, VeloDB Cloud collects the usage of different resources in a warehouse, including compute, cache and storage, to help you analyze the cost distribution in a specific warehouse. Click **Usage** in the navigation bar on the left to view the current warehouse usage information. ![usage](/assets/images/usage-53f91700149c65f37cdf8a53e357abba.png) On This Page * Deduction Channels * Credit Card * Open Cloud Marketplace * Recharge Cash * Activate Voucher * Bills Statements * Usage --- # Source: https://docs.velodb.io/cloud/4.x/management-guide/user-and-organization Version: 4.x On this page # User and Organization ## Registration and Login​ Click to enter the VeloDB Cloud registration and trial page and fill in the relevant information to complete the registration. ![user register](/assets/images/user- register-f4bf7408e0671addc942afba8a108c54.png) > **Tip** VeloDB Cloud includes two independent account systems: One is used > for logging into the console, as described in this topic. The other one is > used to connect to the warehouse, which is described in the Connections > topic. If you have already registered on VeloDB Cloud, you can click Go to login below to log in directly. ![user login](/assets/images/user-login-b17b88e370c05389979ac3b6cb956faa.png) ## Account Management​ ### Change Password​ After login, click **User Menu** > **Security** to change the login password for the VeloDB Cloud console. ![change user password](/assets/images/change-user- password-9531a1cd03598f385130a912a4035ef2.png) Once you have successfully changed the password for the first time, you can use the password for subsequent logins. ### Manage Multi-Factor Authentication (MFA)​ Multi-factor authentication adds additional security by requiring an Authenticator app to generate a one-time verification code for login. When you log in, VeloDB Cloud verifies both your password and the MFA verification code. You can use any Authenticator app from the iOS or Android App Store to generate this password, such as Google Authenticator and Authy. ![security-mfa](/assets/images/security-mfa- de34ce49125251a1a33f498140ff9977.png) ### Notifications​ At the bottom of the left navigation bar, click **User Menu** > **Notifications** to go to the message center. Users, organizations, authorized warehouses, cluster operations, and alarms in the platform will be notified to remind users when they are triggered. You can filter by time range, filter unread/read messages with one click, view messages in pages, mark all messages as read with one click, mark checked messages as read with one click, etc. ![notifications](/assets/images/notifications-d33eadd0dbfbdca33a8eea18d2a01297.png) You can switch to the **Scheduled Events** page to see scheduled events. Scheduled events include system-initiated events (for example, the system automatically upgrades the core version according to the policy set by the user) and user-initiated events (for example, manually upgrading the core version by specifying an execution time window). Some events (such as version upgrades) may cause disconnection and other impacts on the business. Please ensure that the business has the reconnection mechanism. Before the event is executed, you can modify the scheduled execution time window or cancel the event. ## Organization Management​ Organization is the billing unit. Each organization will be billed individually. We recommend that you divide organizations by cost unit, and one user can be affiliated to multiple organizations. Multiple warehouses can be created under one organization, and the data of different warehouses are isolated. You can switch the current organization in the menu bar - switch organization in the user menu. ![switch-organization](/assets/images/switch- organization-023e35ca045ea5913b6fde9075818d53.png) ### Role Management​ In the lower left corner, click **User Menu** > **Access Control** > **Role Management**. There are three roles by default in an organization, and you can create multiple custom roles. | **Manage Access Control**| **Manage Billing**| **Manage Organization**| **Manage Warehouse**| Organization Admin| Yes| Yes| Yes| All warehouse: Create / Edit / View / Query / Monitor| Warehouse Admin| No| No| No| All warehouse: Edit / View / Query / Monitor| Warehouse Viewer| No| No| No| All warehouse: View / Query / Monitor ---|---|---|---|--- * View existing roles: ![access control role management](/assets/images/access-control-role- management-c98e5670d120dd5f09a94c73113158fe.png) * New role: You can specify the role name and its corresponding privileges during creation. Custom roles can also be deleted or edited. The user who creates the organization will be the organization administrator role by default. ![access control new role](/assets/images/access-control-new-role- cef0df279fcd5d92a42a8d6abb138317.png) ### User Management​ Organization administrators can invite new users to the current organization and grant different roles. New users can join the organization by activating the link in the invitation email. ![access control user management invite users](/assets/images/access-control- user-management-invite-users-fb4e5b707d28d4898d9789621c89e63b.png) ### MFA Settings​ After enabling MFA, all organization users must complete secondary authentication before logging in. ![mfa-settings](/assets/images/mfa- settings-9a8ef338348c896a8fbdeed228ace470.png) ### Audit​ After login, click **User Menu** > **Audit** to see the audit log for the VeloDB Cloud console. VeloDB Cloud logs the historical activities at the organization level. An event indicates a change in your VeloDB Cloud organization. You can view the logged activitied on the audit page, including the activity, time, IP and user. ![audit-log](/assets/images/audit-log-42d4d408b758359153ea6832c397eb22.png) ### Organization Details​ click **User Menu** > **Organization Details** to see the organization ID, create time and organization name. ![organizaion-details](/assets/images/organizaion- details-38091b7b98e1d21f2ef4b20ae3275077.png) On This Page * Registration and Login * Account Management * Change Password * Manage Multi-Factor Authentication (MFA) * Notifications * Organization Management * Role Management * User Management * MFA Settings * Audit * Organization Details --- # Source: https://docs.velodb.io/cloud/4.x/release-notes/platform-release-notes Version: 4.x On this page # Platform Release Notes This article describes the release notes for the management and control platform of VeloDB Cloud. ## December 2025​ **New Features** * Added a one-click alert feature to rapidly set up an alerting system, enabling timely awareness of exceptions in key monitoring items. * Optimized the AWS Cloud BYOC template mode by upgrading authentication from AK/SK to IAM Role and supporting reuse through a credential wizard. * Added support for visual creation of external Catalogs, lowering the barrier for multi-source data integration. ## November 2025​ **New Features** * Added Transparent Data Encryption (TDE) function, providing higher-level security protection for static data. * Supports data backup and recovery, ensuring the reliability and continuity of business data. * Added operational audit logs, meeting security compliance and operational traceability requirements. * BYOC supports wizard mode, making bring-your-own-cloud cluster deployment easier and faster. * Supports seamless data import from Confluent Cloud and Kafka, simplifying real-time data integration. * Added credit card payment method, providing users with more convenient and flexible payment options. * AWS Marketplace supports Private Offer/Contract/Free Trial. **Improvements** * Deeply integrated with SQL editor, it delivers a seamless and smooth experience for data development and management. ## August 2025​ **New Features** * Added support for Single Sign-On (SSO) via Google and Microsoft. * MFA now supports authenticator apps (such as Google Authenticator, Microsoft Authenticator, or Authy). **Pricing** * Warehouse usage is now free of charge, no separate fees will be applied. **Improvements** * Warehouse connections are now public by default, with an optimized private endpoint configuration process and improved connection information display. * Reorganized the management platform menu for clearer organization and personal configuration options. * Added a new warehouse usage guide and improved guidance from usage to payment. * Enhanced alert notifications and added alert recovery reminders. * Supports synchronous deployment of warehouse and cluster with parallelized processes. **New Regions** * The SaaS model was launched in the Tokyo region of AWS. ## June 2025​ **New Features** * Add premium technical support service billing item. Customers who purchase this service will need to pay an additional fee. * Supported monitoring and alerting of cache space utilization. **Improvements** * Smooth out the commission fee difference between the cloud marketplace deduction channel and the cash deduction method. Customers who use AWS Marketplace or GCP Marketplace deduction channels will have the same cost as the cash deduction method for recharging on VeloDB Cloud. By using the cloud marketplace deduction channel, there is no need to bear additional commission fees. * Optimized Studio login prompt information. **New Regions** * The BYOC model was launched in the Middle East (Bahrain) region of AWS. ## May 2025​ **New Features** * Supported multi availability zone disaster recovery, by mounting the active and standby clusters through the virtual cluster, it can automatically failover to the standby cluster in another availability zone when the active cluster fails, and continue to provide service. When users need to test and rehearse, they can also manually switch between the active and standby clusters. This feature has requirements for core version and region: core version not lower than 4.0.7, and at least 3 availability zones in the region. **Improvements** * Optimized the email notification content for warehouse code version upgrade failures. * Optimized the BYOC warehouse core version upgrade function prompt content, reminding that after upgrading the core version, the cluster HTTP protocol port will change, and users need to add the new port to the access control whitelist to allow outgoing requests to access this port. * The VeloDB Cloud product introduction page on AWS Marketplace had added the "Deploy on AWS" designation. ## April 2025​ **New Features** * Supported bank corporate transfer recharge. * Supported VeloDB Professional Services purchase on AWS Marketplace. * When creating a BYOC warehouse, a subnet segment detection step bad been added. If it is too small to allocate the IPs, the process will be interrupted and an error message will be displayed. **Improvements** * Optimized the BYOC warehouse core version upgrade function. **New Regions** * The BYOC model was launched in the Asia Pacific (Singapore) region of AWS. ## March 2025​ **New Features** * Add basic metrics and service metrics monitoring for warehouse. * Add alarm for warehouse metrics. **New Regions** * The SaaS model was launched in the Asia Pacific (Hong Kong) region of AWS. ## February 2025​ **New Features​** * Supported BYOC warehouse and cluster custom tags. **Improvements** * Optimize error information when creating the BYOC warehouse. **New Regions** * The BYOC model was launched in the us-east4 region of GCP. ## January 2025​ **New Features** * Supported choosing CPU architecture when creating a new cluster in VeloDB Cloud SaaS or BYOC warehouse on AWS, default is x86, customers can choose ARM. Once a cluster is created, modifying the CPU architecture is not supported. ## December 2024​ **New Features and Improvements** * Support multi available zone disaster recovery * Azure cluster supports independent cache expansion * When creating a warehouse, you can specify whether the table name is case sensitive **New Cloud Platforms** * BYOC mode was launched on Azure ## November 2024​ **New Features and Improvements** * Support GCP Marketplace * Alert rules and alert history support paging * Add a permanent Get Help entrance in the lower right corner ## October 2024​ **New Features and Improvements** * Optimization of the registration/login/free trial links between the official website and Cloud * The SaaS free trial warehouse period has been increased from 7 days to 14 days * Open personal email registration/login, and add support for mobile phone login * Automatically create an organization when a new user registers and logs in, reducing operations * Verify when activating a free warehouse: whether the organization has been associated with an enterprise email, if not, it needs to be associated with an enterprise email ## September 2024​ **New Features and Improvements** * BYOC warehouse usage optimization * Optimize the process of creating a BYOC warehouse, add preparation guidance and document guidance * Optimize the deletion of the last BYOC warehouse and clear the BYOC environment * Optimize WebUI Link availability check, connection distinction between public and private IP * Optimize the core version upgrade, bind upgrade Meta Service * Optimize the minimum permission set of Amazon Web Services * Optimize the source of Amazon Web Services security group, narrow it down to subnet CIDR * Optimize the unified alarm link **New Zones** * BYOC mode was launched in Huawei Cloud Beijing 4 regions. ## August 2024​ **New Regions** * BYOC mode was launched in the US West (Oregon) region of AWS. ## July 2024​ **New Features** * Supported setting the **O &M Time Window** for each warehouse. * Supported setting the **Patch Version Upgrade Policy** of VeloDB Core, users can choose auto upgrade or manually upgrade. * Supported the **Scheduled Events** for each warehouse. The event type was only "**Upgrade Version** ", including events that the system automatically upgrading the patch version of VeloDB Core according to the policy set by the user and the user manually upgrading the version of VeloDB Core by specifying an execution time window. * Supported **Message Center** , currently including **In-site Messages** and **Scheduled Events** list management functions. **New Cloud Platforms** * BYOC mode was launched on GCP. * SaaS mode was launched on Azure. **New Regions** * BYOC mode was launched in the Oregon (us-west1) region of GCP. * SaaS mode was launched in the West US 3 (Arizona) region of Azure. **Improvements** * In-site message function optimization, supported list management, including: filtering by time range, one-click filtering of unread/read messages, paging messages, one-click marking of all messages as read, one-click marking of checked messages as read, etc. ## June 2024​ **New Features** * Supported presentation the port information of clusters, allowing users to conveniently import data using Stream Load method. * Supported users to directly view the statistical results of Consumption Amount, Pretax Amount, and Arrears Amount. * Supported _compaction score_ metric monitoring and alarm. * Supported whitelisted personal email registration the organization (account) on VeloDB Cloud. **Improvements** * Organization administrators (including organization creators) cannot modify their own roles. * The Cash Balance, Voucher Balance, Cloud Marketplace Deduction Channel and other information layout optimization in Billing Center -> Billing Overview page. ## March 2024​ **New Features** * Supported users to individually adjust the cache space of the cluster (currently only scaling out is supported). * The yearly billing resources of VeloDB Cloud cluster on AWS support scaling out. **Improvements** * Users can view monitoring information when the cluster is not running. * When deleting the SaaS mode trial cluster, the SaaS mode trial warehouse will also be deleted. * When users select the WeCom group, Lark group, or DingTalk group as the alert channel, they will be reminded that the VeloDB Cloud server IP address can be added in the access control whitelist of **Webhook**. ## February 2024​ **New Features** * Supported **on-demand(hourly)** , **subscription(monthly)** , and **subscription(yearly)** billing method for the paid clusters. The paid clusters can have only one of these billing methods, or a combination of [monthly + hourly] or [yearly + hourly] billing methods. Users can directly convert the on-demand(hourly) billing resources after testing and stabilization to monthly or yearly billing to save long-term ownership and use costs; they can also flexibly scale out/in the on-demand(hourly) resources at any time to cope with temporary increases and decreases in business on the basis of monthly or yearly billing resources. **New Cloud Platforms** * SaaS mode was launched on Alibaba Cloud. **New Regions** * SaaS mode was launched in the Singapore region of Alibaba Cloud. **Improvements** * The SaaS mode on Alibaba Cloud is officially commercialized with price. * The SaaS mode on HUAWEI CLOUD is officially commercialized with price. * When creating a new warehouse, the configuration parameter _region_ supports classification, corresponding to different price classifications. ## December 2023​ **New Features** * BYOC mode supported distinguishing between the free warehouse and paid warehouses, and the free warehouse can be upgraded to paid use. * BYOC free warehouse quota limit. Each organization can only activate one free warehouse. Only one free cluster can be created in the free warehouse. The maximum computing resources are 64 vCPU. The upper and lower limits of the cache space are limited by the computing resources and vary. **Improvements** * Optimization of the description, graphics and hypertext links for HUAWEI CLOUD **Private Network Connection** function in SaaS mode. ## November 2023​ **New Features** * Supported customizing the cache space when creating a new cluster. The upper and lower limits of the cache space are affected by computing resources and vary. **New Cloud Platforms** * SaaS mode was launched on HUAWEI CLOUD. **New Regions** * SaaS mode was launched in the AP-Jakarta region of HUAWEI CLOUD. **Improvements** * The **WebUI** login entrance had been added to the warehouse function menu, making it more convenient and faster. ## October 2023​ **New Features** * A new private warehouse (**BYOC, Bring Your Own Cloud**) product mode had been added, and whitelist customers were invited to experience it for free. For customers who need to run the VeloDB data warehouse in their own cloud account and VPC, they can use this product mode. This mode of product has the same capabilities as a proprietary warehouse (**SaaS, Software as a Service**) mode, including: cloud native computing and storage separation, elastic scaling, monitoring and alarming, etc. In addition, it can also meet customers' additional needs, including: higher compliance requirements, better cloud resource discounts, and better connection with the surrounding big data ecosystem. **New Cloud Platforms** * BYOC mode was launched on AWS. **New Regions** * BYOC mode was launched in the US East (N. Virginia) region of AWS. **Improvements** * Overall optimization of monitoring metrics. * Storage resource usage statistics were more accurate. ## September 2023​ **New Features** * Supported **Auto Resume** when receiving a business request when the on-demand cluster was shut down, improving the **Auto Pause/Resume** function. * Supported the **Auto Pause** function of the SaaS free trial cluster. This function is enabled by default (disable is not supported). It will be automatically paused after being idle for 360 minutes (user-definable). Users need to manually resume it. **Improvements** * The functional constraints of the free trial warehouse and cluster in various states are more standardized, and usage statistics are more accurate. * Usage information display optimization. * Added 3 new monitoring metrics: Load Rows Per Second (Row/s), Load Bytes Per Second (MB/s), and Finished Load Tasks. * When deleting a warehouse, the current operator's email address is displayed for receiving verification codes. ## August 2023​ **New Features** * Supported creating and modifying organizations. * Supported new customer self-registration organizations (login is registration). **New Regions** * AWS Europe (Frankfurt) **Improvements** * The list of AWS endpoints for private network connection was optimized, and tips and links were given on where to find the Endpoint DNS Name. * **IP Whitelist Management** optimization for public network connection. * Quota prompts for **New Organization** , **New Warehouse** , and **New Cluster**. * Update the content of **In-site Notifications** and **Email Notifications**. ## June 2023​ **New Features** * On-demand billing clusters support **Time-based Scaling** , which can not only meet the needs of business load scenarios with obvious peaks and lows in a day and have time-periodical regularity, but also avoid the situation that the configuration is too low to cause insufficient resources or the configuration is too high to cause resource waste. * On-demand billing clusters supported **Manual Pause/Resume** , and **Auto Pause**. It can release computing resources while retaining cache space when the cluster has no load, reducing resource waste and saving costs. It can also quickly pull up computing resources and mount reserved cache resources and data, so that business requests can be quickly responded to. * WebUI supports multiple tab pages, which is convenient for users to process multiple SQL queries in parallel. **Improvements** * WebUI space utilization optimization and database table directory tree optimization provide larger query statement/result display space. ## May 2023​ **New Features** * The cluster supported cloud disk caching, the ratio of vCPU memory is fixed at 1:8, and the ratio of vCPU cache is temporarily 1:50. * Supported "Lake House", integrate structured or semi-structured source data such as Hive, object storage (S3), MySQL, and Elasticsearch from user data lake through public network or private network connections, and perform federated query analysis in one VeloDB Cloud data warehouse; At the same time, the style of the private network connection had been reconstructed, and two methods are supported: access to VeloDB Cloud data warehouse from the user's clients or applications and access to the user's data lake from VeloDB Cloud data warehouse. * Supported **Multi-Factor Authentication (MFA)** , strengthen login identity authentication and sensitive operation security (related functions include: MFA policy settings, batch invite users, profile, enroll mobile phone, SMS verification, password reset, etc). * Added 3 information cards to the **Usage** page: Latest Compute Capacity (vCPU), Latest Cache Space (GB), and Latest Storage Size (GB). **New Regions** * AWS Asia Pacific (Singapore) **Improvements** * The cluster was adjusted to the configuration of the cluster's overall resources (vCPU, memory, and cache) from the configuration of multiplying the node size and the number of nodes. * Cloud marketplace deduction authorization process optimization (new user guidance prompts, authorized organizations directly enter the console). * Security certification: Passed six certifications of ISO. * WebUI login entrance optimization (prominent position, early prediction and prompts whether and how to log in). * Optimized the **IP whitelist** for public network connections (adding the last operator information). * Warehouse navigation and detail optimization (added zone and creator information, rearranging the overall information). ## February 2023​ **New Features** * The **Billing Center** page had been revised, and it supported **Monthly Bill** , **Hourly Bill** , **Billing Details** , and **Voucher Management**. **New Regions** * AWS US West (N. California) **Improvements** * The account system was restructured, and the permissions of VeloDB Cloud users and the database users were separated. * The **Query** function module was independently used as a **WebUI** tool, and users need to log in to the warehouse to query data. * The **Usage** page had been revised, and the Unit metering mechanism had been changed to vCPU-Hour and GB-Hour metering mechanisms. * The **Billing Center** page had been revised, and the Unit billing and deduction mechanism had been changed to currency billing and deduction mechanism. * Improved message templates for **In-site Notification** function and **Email Notification** function, updates related links and description. ## November 2022​ **New Features** * The core version can be configured when creating a new warehouse, and in the drop-down selection box, only the latest patch version was retained for each minor version x.y. * The **Warehouse Details** card added the core version number information. If the current version is not the highest version in the region of the cloud, there will be an upgrade reminder. Click the link icon can go to the **Settings** page to upgrade the version. * The **Warehouse Details** added creation time information. * The Warehouse statuses added "upgrading". * **In-site Notification** function, adding support for notification of core version upgrade success and notification of core version upgrade failure. * Supported the reminder card for the remaining time of the trial warehouse, which can be upgraded to paid warehouse with one click. **Improvements** * Adjusted the position of the core version upgrade entry, moved from the **Cluster Details** page to the **Warehouse Details** card, and can upgrade the core version of the warehouse and all clusters in it. The core version number was divided into three levels: Major, Minor, and Patch, and the format is as follows: x.y.z. * Both the cluster card on the **Cluster Overview** page and the basic information on the **Cluster Details** page shielded the core version number, and the function operation area on the **Cluster Details** page shielded the **Version Upgrade** function. * The **Cluster Resize** function and the **Cluster Scaling** function were integrated, and the name of the new function was unified as "**Cluster Scaling** ". ## October 2022​ **New Features** * Cluster reconstruction, was split into the warehouse service and the computing cluster. * Supported storage-computing separation architecture, multiple computing clusters, and shared object storage data. * Supported local disk as cluster cache. * Supported **AWS Marketplace Deduction Channel** , AWS customers can reuse the balance of the AWS cloud account, and uniformly issue bills and Invoices from AWS. * **In-site Notification** function, adding support for notifications of warehouse creation success, notifications of warehouse creation failure, notifications of warehouse deletion success, notifications of warehouse deletion failure, reminders of trial warehouse is about to expire and stop service, notifications of trial warehouse expiration and suspension of service, reminders of trial warehouse and its data will soon be deleted, notifications of trial warehouse recovery service, notifications of trial warehouse and its data deletion, reminders of suspension of service of paid warehouses due to arrears of payment, notification of suspension of service of paid warehouses due to arrears of payment, reminders of paid warehouses and their data will be deleted, notifications of paid warehouse recovery service, and snotifications of paid warehouses and their data are deleted. * **Email Notification** function, adding support for notifications of welcome to join the organization, notifications of verification code, reminders of trial warehouse is about to expire and stop service, notifications of trial warehouse expiration and suspension of service, reminders of trial warehouse and its data will soon be deleted, notifications of trial warehouse recovery service, notifications of trial warehouse and its data deletion, reminders of suspension of service of paid warehouses due to arrears of payment, notification of suspension of service of paid warehouses due to arrears of payment, reminders of paid warehouses and their data will be deleted, notifications of paid warehouse recovery service, and snotifications of paid warehouses and their data are deleted. * The console **Login** page supported switching between the Chinese station and the international station. **New Regions** * AWS US West (Oregon) **Improvements** * For operations that would cause cost changes (including: **New Cluster** , **Cluster Resize** , and **Cluster Scaling**), added a second confirmation. * The **Organization Management** function supported organization ID (unique identifier) and setting duplicate organization names. * **Data Query** function was enhanced. * The entrance position of the **Access Control** function was adjusted, and it was moved from the warehouse operation area to the user operation area. * The console interface had been revised and optimized, and the overall layout and UI components had been unified and standardized. ## August 2022​ **New Features** * Supported SaaS mode, that is, both the cluster and the management and control platform were deployed in the VeloDB VPC. * The **Connection** module was independent from **Cluster Management** , and supported public network connection and private network connection, and the **Private Network Connection** function supported AWS PrivateLink. * Supported cloud disk storage. * The cluster added the "Trial" free trial node size. * Supported the On-Demand billing method, and charged for the overall resources of the cluster. * Both **In-site Notification** function and **Email Notification** function supported reminders of upcoming arrears, notifications of suspension of services due to arrears, reminders of imminent deletion of data, notifications of cluster recovery service, and notifications of cluster release and data deletion. **Improvements** * Console interface revision and optimization, including: **New Cluster** , **Cluster Details** , **Cluster Upgrade** , **Cluster Resize** , **Cluster Scaling** , **Cluster Deletion** , **Billing Overview** , **Billing Help** , **Purchase Units** , **Historical Orders** , etc. * The **Metering and Billing** page was split into the **Usage** page and the **Billing Center** page. The **Usage** page remained in the navigation bar of the cluster operation area, and the entrance to the **Billing Center** page was moved to the user operation area. * Removed the function of **AK &SK Authorization of Customer Cloud Account**. ## July 2022​ **New Features** * Supported hybrid mode, that is, the cluster was deployed in the customer VPC, and the management and control platform is deployed in the VeloDB VPC. * Supported basic functions such as **Cluster Management** , **Data Query** , **Performance Monitoring** , **Access Control** , **AK &SK Authorization of Customer Cloud Account**, and **Metering and Billing**. * Supported the On-Demand billing method, and only charge value-added service fees. **New Cloud Platforms** * AWS **New Regions** * AWS US East (N. Virginia) On This Page * December 2025 * November 2025 * August 2025 * June 2025 * May 2025 * April 2025 * March 2025 * February 2025 * January 2025 * December 2024 * November 2024 * October 2024 * September 2024 * August 2024 * July 2024 * June 2024 * March 2024 * February 2024 * December 2023 * November 2023 * October 2023 * September 2023 * August 2023 * June 2023 * May 2023 * February 2023 * November 2022 * October 2022 * August 2022 * July 2022 --- # Source: https://docs.velodb.io/cloud/4.x/security/audit-plugin Version: 4.x On this page # Audit Log Doris provides auditing capabilities for database operations, allowing the recording of user logins, queries, and modification operations on the database. In Doris, audit logs can be queried directly through built-in system tables or by viewing Doris's audit log files. ## Enabling Audit Logs​ The audit log plugin can be enabled or disabled at any time using the global variable `enable_audit_plugin` (disabled by default), for example: `set global enable_audit_plugin = true;` Once enabled, Doris will write the audit logs to the `audit_log` table. You can disable the audit log plugin at any time: `set global enable_audit_plugin = false;` After disabling, Doris will stop writing to the `audit_log` table. The already written audit logs will remain unchanged. ## Viewing the Audit Log Table​ Note Before version 2.1.8, as the system version was upgraded, the audit log table fields may have increased. After upgrading, you need to add fields to the `audit_log` table using the `ALTER TABLE` command based on the fields in the audit log table. Starting from Doris version 2.1, Doris can write user behavior operations to the audit_log table in the `__internal_schema` database by enabling the audit log feature. The audit log table is a dynamically partitioned table, partitioned daily by default, retaining the most recent 30 days of data. You can adjust the retention period of dynamic partitions by modifying the `dynamic_partition.start` property using the `ALTER TABLE` statement. ## Audit Log Files​ In `fe.conf`, `LOG_DIR` defines the storage path for FE logs. All database operations executed by this FE node are recorded in `${LOG_DIR}/fe.audit.log`. To view all operations in the cluster, you need to traverse the audit logs of each FE node. ## Audit Log Format​ In versions before 3.0.7, the symbols `\n`, `\t`, and `\r` in statements would be replaced with `\\n`, `\\t`, and `\\r`. These modified statements were then stored in the `fe.audit.log` file and the `audit_log` table. Starting from version 3.0.7, for the `fe.audit.log` file, only `\n` in statements will be replaced with `\\n`. The `audit_log` table, stores the original format of statements. ## Audit Log Configuration​ **Global Variables:** Audit log variables can be modified using `set [global] = `. Variable| Default Value| Description| `audit_plugin_max_batch_interval_sec`| 60 seconds| Maximum write interval for the audit log table.| `audit_plugin_max_batch_bytes`| 50MB| Maximum data volume per batch for the audit log table.| `audit_plugin_max_sql_length`| 4096| Maximum length of SQL statements recorded in the audit log table.| `audit_plugin_load_timeout`| 600 seconds| Default timeout for audit log import jobs.| `audit_plugin_max_insert_stmt_length`| Int.MAX| The maximum length limit for `INSERT` statements. If larger than `audit_plugin_max_sql_length`, the value of `audit_plugin_max_sql_length` is used. This parameter is supported since 3.0.6. ---|---|--- Because some `INSERT INTO VALUES` statements may be too long and submitted frequently, causing the audit log too large. Therefore, Doris added `audit_plugin_max_insert_stmt_length` in version 3.0.6 to limit the audit length of `INSERT` statements separately. This avoids the expansion of the audit log and ensures that the SQL statements are fully audited. **FE Configuration Items:** FE configuration items can be modified by editing the `fe.conf` directory. Configuration Item| Description| `skip_audit_user_list`| If you do not want operations of certain users to be recorded in the audit logs, you can modify this configuration (supported since version 3.0.01). For example, use the config to exclude `user1` and `user2` from audit log recording: `skip_audit_user_list=user1,user2` ---|--- On This Page * Enabling Audit Logs * Viewing the Audit Log Table * Audit Log Files * Audit Log Format * Audit Log Configuration --- # Source: https://docs.velodb.io/cloud/4.x/security/auth/authentication-and-authorization Version: 4.x On this page # Authentication and Authorization The Doris permission management system is modeled after the MySQL permission management mechanism. It supports fine-grained permission control at the row and column level, role-based access control, and also supports a whitelist mechanism. ## Glossary​ 1. User Identity Within a permission system, a user is identified as a User Identity. A User Identity consists of two parts: `username` and `host`. The `username` is the user's name, consisting of English letters (both uppercase and lowercase). `host` represents the IP from which the user connection originates. User Identity is represented as `username@'host'`, indicating `username` from `host`. Another representation of User Identity is `username@['domain']`, where `domain` refers to a domain name that can be resolved into a set of IPs through DNS. Eventually, this is represented as a set of `username@'host'`, hence moving forward, we uniformly use `username@'host'` to denote it. 2. Privilege Privileges apply to nodes, data directories, databases, or tables. Different privileges represent different operation permissions. 3. Role Doris allows the creation of custom-named roles. A role can be viewed as a collection of privileges. Newly created users can be assigned a role, automatically inheriting the privileges of that role. Subsequent changes to the role's privileges will also reflect on the permissions of all users associated with that role. 4. User Property User properties are directly affiliated with a user, not the User Identity. Meaning, both `user@'192.%'` and `user@['domain']` share the same set of user properties, which belong to the user `user`, not to `user@'192.%'` or `user@['domain']`. User properties include but are not limited to: maximum number of user connections, import cluster configurations, etc. ## Authentication and Authorization Framework​ The process of a user logging into Apache Doris is divided into two parts: **Authentication** and **Authorization**. * Authentication: Identity verification is conducted based on the credentials provided by the user (such as username, client IP, password). Once verified, the individual user is mapped to a system-defined User Identity. * Authorization: Based on the acquired User Identity, it checks whether the user has the necessary permissions for the intended operations, according to the privileges associated with that User Identity. ## Authentication​ Doris supports built-in authentication schemes as well as LDAP authentication. ### Doris Built-in Authentication Scheme​ Authentication is based on usernames, passwords, and other information stored within Doris itself. Administrators create users with the `CREATE USER` command and view all created users with the `SHOW ALL GRANTS` command. When a user logs in, the system verifies whether the username, password, and client IP address are correct. #### Password Policy​ Doris supports the following password policies to assist users in better password management. 1. `PASSWORD_HISTORY` Determines whether a user can reuse a historical password when resetting their current password. For example, `PASSWORD_HISTORY 10` means the last 10 passwords cannot be reused as a new password. Setting `PASSWORD_HISTORY DEFAULT` will use the value from the global variable `PASSWORD_HISTORY`. A setting of 0 disables this feature. The default is 0. Examples: * Set a global variable: `SET GLOBAL password_history = 10` * Set for a user: `ALTER USER user1@'ip' PASSWORD_HISTORY 10` 2. `PASSWORD_EXPIRE` Sets the expiration time for the current user's password. For instance, `PASSWORD_EXPIRE INTERVAL 10 DAY` means the password will expire after 10 days. `PASSWORD_EXPIRE NEVER` indicates the password never expires. Setting `PASSWORD_EXPIRE DEFAULT` will use the value from the global variable `default_password_lifetime` (in days). The default is NEVER (or 0), indicating it does not expire. Examples: * Set a global variable: `SET GLOBAL default_password_lifetime = 1` * Set for a user: `ALTER USER user1@'ip' PASSWORD_EXPIRE INTERVAL 10 DAY` 3. `FAILED_LOGIN_ATTEMPTS` and `PASSWORD_LOCK_TIME` Configures the number of incorrect password attempts after which the user account will be locked and sets the lock duration. For example, `FAILED_LOGIN_ATTEMPTS 3 PASSWORD_LOCK_TIME 1 DAY` means if there are 3 incorrect logins, the account will be locked for one day. Administrators can unlock the account using the `ALTER USER` statement. Example: * Set for a user: `ALTER USER user1@'ip' FAILED_LOGIN_ATTEMPTS 3 PASSWORD_LOCK_TIME 1 DAY` 4. Password Strength This is controlled by the global variable `validate_password_policy`. The default is `NONE/0`, which means no password strength checking. If set to `STRONG/2`, the password must include at least three of the following: uppercase letters, lowercase letters, numbers, and special characters, and must be at least 8 characters long. Example: * `SET validate_password_policy=STRONG` For more help, please refer to [ALTER USER](/cloud/4.x/sql-manual/sql- statements/account-management/ALTER-USER). ## Authorization​ ### Permission Operations​ * Create user: [CREATE USER](/cloud/4.x/sql-manual/sql-statements/account-management/CREATE-USER) * Modify user: [ALTER USER](/cloud/4.x/sql-manual/sql-statements/account-management/ALTER-USER) * Delete user: [DROP USER](/cloud/4.x/sql-manual/sql-statements/account-management/DROP-USER) * Grant/Assign role: [GRANT](/cloud/4.x/sql-manual/sql-statements/account-management/GRANT-TO) * Revoke/Withdraw role: [REVOKE](/cloud/4.x/sql-manual/sql-statements/account-management/REVOKE-FROM) * Create role: [CREATE ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/CREATE-ROLE) * Delete role: [DROP ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/DROP-ROLE) * Modify role: [ALTER ROLE](/cloud/4.x/sql-manual/sql-statements/account-management/ALTER-ROLE) * View current user's permissions and roles: [SHOW GRANTS](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-GRANTS) * View all users' permissions and roles: [SHOW ALL GRANTS](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-GRANTS) * View created roles: [SHOW ROLES](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-ROLES) * Set user property: [SET PROPERTY](/cloud/4.x/sql-manual/sql-statements/account-management/SET-PROPERTY) * View user property: [SHOW PROPERTY](/cloud/4.x/sql-manual/sql-statements/account-management/SHOW-PROPERTY) * Change password: [SET PASSWORD](/cloud/4.x/sql-manual/sql-statements/account-management/SET-PASSWORD) * View all supported privileges: [SHOW PRIVILEGES] * View row policy: [SHOW ROW POLICY] * Create row policy: [CREATE ROW POLICY] ### Types of Permissions​ Doris currently supports the following permissions: 1. `Node_priv` Node modification permission. Includes adding, deleting, and offlining FE, BE, BROKER nodes. Root users have this permission by default. Users who possess both `Grant_priv` and `Node_priv` can grant this permission to other users. This permission can only be granted at the Global level. 2. `Grant_priv` Permission modification authority. Allows execution of operations including granting, revoking, adding/deleting/modifying users/roles. Before version 2.1.2, when granting permissions to other users/roles, the current user only needed the respective level's `Grant_priv` permission. After version 2.1.2, the current user also needs permission for the resource they wish to grant. When assigning roles to other users, Global level `Grant_priv` permission is required. 3. `Select_priv` Read-only permission for data directories, databases, and tables. 4. `Load_priv` Write permission for data directories, databases, and tables. Includes Load, Insert, Delete, etc. 5. `Alter_priv` Alteration permissions for data directories, databases, and tables. Includes renaming libraries/tables, adding/deleting/modifying columns, adding/deleting partitions, etc. 6. `Create_priv` Permission to create data directories, databases, tables, and views. 7. `Drop_priv` Permission to delete data directories, databases, tables, and views. 8. `Usage_priv` Usage permissions for Resources and Workload Groups. 9. `Show_view_priv` Permission to execute `SHOW CREATE VIEW`. ### Permission Levels​ #### Global Permissions​ Permissions granted through the GRANT statement with `*.*.*` scope. These permissions apply to any table within any catalog. #### Catalog Permissions​ Permissions granted through the GRANT statement with `ctl.*.*` scope. These permissions apply to any table within the specified catalog. #### Database Permissions​ Permissions granted through the GRANT statement with `ctl.db.*` scope. These permissions apply to any table within the specified database. #### Table Permissions​ Permissions granted through the GRANT statement with `ctl.db.tbl` scope. These permissions apply to any column within the specified table. #### Column Permissions​ Column permissions are primarily used to restrict user access to certain columns within a table. Specifically, column permissions allow administrators to set viewing, editing, and other rights for certain columns, controlling user access and manipulation of specific column data. Permissions for specific columns of a table can be granted with `GRANT Select_priv(col1,col2) ON ctl.db.tbl TO user1`. Currently, column permissions support only `Select_priv`. #### Row-Level Permissions​ Row Policies enable administrators to define access policies based on fields within the data, controlling which users can access which rows. Specifically, Row Policies allow administrators to create rules that can filter or restrict user access to rows based on actual values stored in the data. From version 1.2, row-level permissions can be created with the `CREATE ROW POLICY` command. From version 2.1.2, support for setting row-level permissions through Apache Ranger's `Row Level Filter` is available. #### Usage Permissions​ * Resource Permissions Resource permissions are set specifically for Resources, unrelated to permissions for databases or tables, and can only assign `Usage_priv` and `Grant_priv`. Permissions for all Resources can be granted with the `GRANT USAGE_PRIV ON RESOURCE '%' TO user1`. * Workload Group Permissions Workload Group permissions are set specifically for Workload Groups, unrelated to permissions for databases or tables, and can only assign `Usage_priv` and `Grant_priv`. Permissions for all Workload Groups can be granted with `GRANT USAGE_PRIV ON WORKLOAD GROUP '%' TO user1`. ### Data Masking​ Data masking is a method to protect sensitive data by modifying, replacing, or hiding the original data, such that the masked data retains certain formats and characteristics while no longer containing sensitive information. For example, administrators may choose to replace part or all of the digits of sensitive fields like credit card numbers or ID numbers with asterisks `*` or other characters, or replace real names with pseudonyms. From version 2.1.2, support for setting data masking policies for certain columns through Apache Ranger's Data Masking is available, currently only configurable via [Apache Ranger](/cloud/4.x/security/auth/authorization/ranger). ### Doris Built-in Authorization Scheme​ Doris's permission design is based on the RBAC (Role-Based Access Control) model, where users are associated with roles, and roles are associated with permissions. Users are indirectly linked to permissions through their roles. When a role is deleted, users automatically lose all permissions associated with that role. When a user is disassociated from a role, they automatically lose all permissions of that role. When permissions are added to or removed from a role, the permissions of the users associated with that role change accordingly. ┌────────┐ ┌────────┐ ┌────────┐ │ user1 ├────┬───► role1 ├────┬────► priv1 │ └────────┘ │ └────────┘ │ └────────┘ │ │ │ │ │ ┌────────┐ │ │ │ role2 ├────┤ ┌────────┐ │ └────────┘ │ ┌────────┐ │ user2 ├────┘ │ ┌─► priv2 │ └────────┘ │ │ └────────┘ ┌────────┐ │ │ ┌──────► role3 ├────┘ │ │ └────────┘ │ │ │ │ │ ┌────────┐ │ ┌────────┐ │ ┌────────┐ │ userN ├─┴──────► roleN ├───────┴─► privN │ └────────┘ └────────┘ └────────┘ As shown above: User1 and user2 both have permission `priv1` through `role1`. UserN has permission `priv1` through `role3`, and permissions `priv2` and `privN` through `roleN`. Thus, userN has permissions `priv1`, `priv2`, and `privN` simultaneously. For ease of user operations, it is possible to directly grant permissions to a user. Internally, a unique default role is created for each user. When permissions are granted to a user, it is essentially granting permissions to the user's default role. The default role cannot be deleted, nor can it be assigned to someone else. When a user is deleted, their default role is automatically deleted as well. ### Authorization Scheme Based on Apache Ranger​ Please refer to [Authorization Scheme Based on Apache Ranger](/cloud/4.x/security/auth/authorization/ranger). ## Common Questions​ ### Explanation of Permissions​ 1. Users with ADMIN privileges or GRANT privileges at the GLOBAL level can perform the following operations: * CREATE USER * DROP USER * ALTER USER * SHOW GRANTS * CREATE ROLE * DROP ROLE * ALTER ROLE * SHOW ROLES * SHOW PROPERTY FOR USER 2. GRANT/REVOKE * Users with ADMIN privileges can grant or revoke permissions for any user. * Users with ADMIN or GLOBAL level GRANT privileges can assign roles to users. * Users who have the corresponding level of GRANT privilege and the permissions to be assigned can distribute those permissions to users/roles. 3. SET PASSWORD * Users with ADMIN privileges or GLOBAL level GRANT privileges can set passwords for non-root users. * Ordinary users can set the password for their corresponding User Identity. Their corresponding User Identity can be viewed with the `SELECT CURRENT_USER()` command. * ROOT users can change their own password. ### Additional Information​ 1. When Doris is initialized, the following users and roles are automatically created: * operator role: This role has `Node_priv` and `Admin_priv`, i.e., all permissions in Doris. * admin role: This role has `Admin_priv`, i.e., all permissions except for node changes. * root@'%': root user, allowed to log in from any node, with the operator role. * admin@'%': admin user, allowed to log in from any node, with the admin role. 2. Deleting or altering the permissions of default created users, roles, or users is not supported. * Deleting the users root@'%' and admin@'%' is not supported, but creating and deleting root@'xxx' and admin@'xxx' users (where xxx refers to any host except %) is allowed (Doris treats these users as regular users). * Revoking the default roles of root@'%' and admin@'%' is not supported. * Deleting the roles operator and admin is not supported. * Modifying the permissions of the roles operator and admin is not supported. 3. There is only one user with the operator role, which is Root. There can be multiple users with the admin role. 4. Some potentially conflicting operations are explained as follows: 1. Domain and IP conflict: Suppose the following user is created: `CREATE USER user1@['domain'];` And granted: `GRANT SELECT_PRIV ON *.* TO user1@['domain']` This domain is resolved to two IPs: ip1 and ip2. Suppose later, we grant a separate permission to `user1@'ip1'`: `GRANT ALTER_PRIV ON . TO user1@'ip1';` Then `user1@'ip1'` will have permissions for both Select_priv and Alter_priv. And when we change the permissions for `user1@['domain']` again, `user1@'ip1'` will not follow the change. 2. Duplicate IP conflict: Suppose the following users are created: CREATE USER user1@'%' IDENTIFIED BY "12345"; CREATE USER user1@'192.%' IDENTIFIED BY "abcde"; In terms of priority, `'192.%'` takes precedence over `'%'`, so when user `user1` from machine `192.168.1.1` tries to log into Doris using password `'12345'`, access will be denied. 5. Forgotten Password If you forget the password and cannot log into Doris, you can add `skip_localhost_auth_check=true` to the FE's config file and restart the FE, thus logging into Doris as root without a password from the local machine. After logging in, you can reset the password using the `SET PASSWORD` command. 6. No user can reset the root user's password except for the root user themselves. 7. `Admin_priv` permissions can only be granted or revoked at the GLOBAL level. 8. `current_user()` and `user()` Users can view their `current_user` and `user` by executing `SELECT current_user()` and `SELECT user()` respectively. Here, `current_user` indicates the identity the user authenticated with, while `user` is the actual User Identity at the moment. For example: Suppose `user1@'192.%'` is created, and then user `user1` logs in from `192.168.10.1`, then the `current_user` would be `user1@'192.%'`, and `user` would be `user1@'192.168.10.1'`. All permissions are granted to a specific `current_user`, and the real user has all the permissions of the corresponding `current_user`. ## Best Practices​ Here are some examples of use cases for the Doris permission system. 1. Scenario 1 Users of the Doris cluster are divided into administrators (Admin), development engineers (RD), and users (Client). Administrators have all permissions over the entire cluster, primarily responsible for cluster setup and node management. Development engineers are responsible for business modeling, including creating databases and tables, importing, and modifying data. Users access different databases and tables to retrieve data. In this scenario, administrators can be granted ADMIN or GRANT privileges. RDs can be granted CREATE, DROP, ALTER, LOAD, and SELECT permissions for any or specific databases and tables. Clients can be granted SELECT permissions for any or specific databases and tables. Additionally, different roles can be created to simplify the authorization process for multiple users. 2. Scenario 2 A cluster may contain multiple businesses, each potentially using one or more datasets. Each business needs to manage its users. In this scenario, an administrative user can create a user with DATABASE-level GRANT privileges for each database. This user can only authorize users for the specified database. 3. Blacklist Doris itself does not support a blacklist, only a whitelist, but we can simulate a blacklist through certain means. Suppose a user named `user@'192.%'` is created, allowing users from `192.*` to log in. If you want to prohibit a user from `192.168.10.1` from logging in, you can create another user `cmy@'192.168.10.1'` with a new password. Since `192.168.10.1` has higher priority than `192.%`, the user from `192.168.10.1` will no longer be able to log in with the old password. On This Page * Glossary * Authentication and Authorization Framework * Authentication * Doris Built-in Authentication Scheme * Authorization * Permission Operations * Types of Permissions * Permission Levels * Data Masking * Doris Built-in Authorization Scheme * Authorization Scheme Based on Apache Ranger * Common Questions * Explanation of Permissions * Additional Information * Best Practices --- # Source: https://docs.velodb.io/cloud/4.x/security/encryption/encryption-function Version: 4.x # Encryption and Masking Function Doris provides the following built-in encryption and masking functions. For detailed usage, please refer to the SQL manual. * [AES_ENCRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/aes-encrypt) * [AES_DECRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/aes-decrypt) * [SM4_ENCRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm4-encrypt) * [SM4_DECRYPT](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm4-decrypt) * [MD5](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/md5) * [MD5SUM](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/md5sum) * [SM3](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm3) * [SM3SUM](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sm3sum) * [SHA](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sha) * [SHA2](/cloud/4.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/sha2) * [DIGITAL_MASKING](/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/digital-masking) --- # Source: https://docs.velodb.io/cloud/4.x/security/integrations/aws-authentication-and-authorization Version: 4.x On this page # AWS authentication and authorization Doris supports accessing AWS service resources through two authentication methods: ​​`IAM User`​​ and `​​Assumed Role`​​. This article explains how to configure security credentials for both methods and use Doris features to interact with AWS services. # Authentication Methods Overview ## IAM User Authentication​ Doris enables access to external data sources by configuring `AWS IAM User` credentials(equal to `access_key` and `secret_key`), below are the detailed configuration steps(for more information, refer to the AWS doc [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html)): ### Step1 Create an IAM User and configure policies​ 1. Login to the `AWS Console` and create an `IAM User`​ ![create iam user](/assets/images/create-iam- user-2f5798289db11561f58996f843a57007.png) 2. Enter the IAM User name and attach policies directly​ ![iam user attach policy1](/assets/images/iam-user-attach- policy1-82e3e146a7e676f875e6b8ec7a976eb8.png) 3. Define AWS resource policies in the policy editor​​, below are read/write policy templates for accessing an S3 bucket ![iam user attach policy2](/assets/images/iam-user-attach- policy2-3dd286ac1d3849842ffc68dbe63ff671.png) S3 read policy template​, applies to Doris features requiring read/list access, e.g: S3 Load, TVF, External Catalog **Notes:** 1. **Replace and with actual values.** 2. **Avoid adding extra / separators.** { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:GetObjectVersion", ], "Resource": "arn:aws:s3:::/your-prefix/*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::" } ] } S3 write policy template​​ (Applies to Doris features requiring read/write access, e.g: Export, Storage Vault, Repository) **Notes:** 1. **Replace`your-bucket` and `your-prefix` with actual values.** 2. **Avoid adding extra`/` separators.** { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": "arn:aws:s3::://*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation", "s3:GetBucketVersioning", "s3:GetLifecycleConfiguration" ], "Resource": "arn:aws:s3:::" } ] } 4. After successfully creating the IAM User, create access/secret key pair ![iam user create ak sk](/assets/images/iam-user-create-ak- sk-6085deeaca04ff0daec92673dda55f45.png) ### Step2 Use doris features with access/secret key pair via SQL​ After completing all configurations in Step 1, you will obtain `access_key` and `secret_key`. Use these credentials to access doris features as shown in the following examples: #### S3 Load​ LOAD LABEL s3_load_2022_04_01 ( DATA INFILE("s3://your_bucket_name/s3load_example.csv") INTO TABLE test_s3load COLUMNS TERMINATED BY "," FORMAT AS "CSV" (user_id, name, age) ) WITH S3 ( "provider" = "S3", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.access_key" = "", "s3.secret_key" = "" ) PROPERTIES ( "timeout" = "3600" ); #### TVF​ SELECT * FROM S3 ( 'uri' = 's3://your_bucket/path/to/tvf_test/test.parquet', 'format' = 'parquet', 's3.endpoint' = 's3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', "s3.access_key" = "", "s3.secret_key"="" ) #### External Catalog​ CREATE CATALOG iceberg_catalog PROPERTIES ( 'type' = 'iceberg', 'iceberg.catalog.type' = 'hadoop', 'warehouse' = 's3://your_bucket/dir/key', 's3.endpoint' = 's3.us-east-1.amazonaws.com', 's3.region' = 'us-east-1', "s3.access_key" = "", "s3.secret_key"="" ); #### Storage Vault​ CREATE STORAGE VAULT IF NOT EXISTS s3_demo_vault PROPERTIES ( "type" = "S3", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.bucket" = "", "s3.access_key" = "", "s3.secret_key"="", "s3.root.path" = "s3_demo_vault_prefix", "provider" = "S3", "use_path_style" = "false" ); #### Export​ EXPORT TABLE s3_test TO "s3://your_bucket/a/b/c" PROPERTIES ( "column_separator"="\\x07", "line_delimiter" = "\\x07" ) WITH S3 ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.access_key" = "", "s3.secret_key"="", ) #### Repository​ CREATE REPOSITORY `s3_repo` WITH S3 ON LOCATION "s3://your_bucket/s3_repo" PROPERTIES ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.access_key" = "", "s3.secret_key"="" ); #### Resource​ CREATE RESOURCE "remote_s3" PROPERTIES ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.bucket" = "", "s3.access_key" = "", "s3.secret_key"="" ); You can specify different IAM User credentials (`access_key` and `secret_key`) across different business logic to implement access control for external data. ## Assumed Role Authentication​ Assumed Role allows accessing external data sources by assuming an AWS IAM Role(for details, refer to AWS documentation [assume role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage- assume.html)), the following diagram illustrates the configuration workflow: ![assumed role flow](/assets/images/assumed-role- flow-b0b29edd8bd38d2280ff7b1780cfa697.png) Terminology: `Source Account`: The AWS account initiating the Assume Role action (where Doris FE/BE EC2 instances reside); `Target Account`: The AWS account owning the target S3 bucket; `ec2_role`: A role created in the source account, attached to EC2 instances running Doris FE/BE; `bucket_role`: A role created in the target account with permissions to access the target bucket; **Notes:** 1. **The source and target accounts can be the same AWS account;** 2. **Ensure All EC2 instances which Doris FE/BE deployed have been attached on`ec_role​`​, especially during scaling operations.** ​​More detailed configuration steps are as follows:​ ### Step1 Prerequisites​ 1. Ensure the source account has created an `ec2_role` and attached it to all `EC2 instances` running Doris FE/BE; 2. Ensure the target account has created a `bucket_role` and corresponding bucket; After attaching `ec2_role` to `EC2 instances`, you can find the `role_arn` as shown below: ![ec2 instance](/assets/images/ec2-instance-8a07fa1579df9c49f28d1afeb8b43320.png) ### Step2 Configure Permissions for Source Account IAM Role (EC2 Instance Role)​ 1. Log in to the [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iamv2/home#/home),navigate to ​​`Access management` > `Roles`; 2. Find the EC2 instance role and click its name; 3. On the role details page, go to the ​​`Permissions`​​ tab, click ​​`Add permissions`​​, then select `​​Create inline policy`​​; 4. In the ​​`Specify permissions​​ section`, switch to the `​​JSON`​​ tab, paste the following policy, and click ​​`Review policy`​​: ![source role permission](/assets/images/source-role- permission-39ded5d47d26095cb01555e8fd159001.png) { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["sts:AssumeRole"], "Resource": "*" } ] } ### Step3 Configure Trust Policy and Permissions for Target Account IAM Role​ 1. Log in [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iamv2/home#/home), navigate to ​​Access management > Roles​​, find the target role (bucket_role), and click its name; 2. Go to the `​​Trust relationships`​​ tab, click `​​Edit trust policy`​​, and paste the following JSON (replace with your EC2 instance role ARN). Click ​​Update policy ![target role trust policy](/assets/images/target-role-trust- policy-86ffe376e4663a12a675419bfb4e6e70.png) **Note: The`ExternalId` in the `Condition` section is an optional string parameter used to distinguish scenarios where multiple source users need to assume the same role. If configured, include it in the corresponding Doris SQL statements. For a detailed explanation of ExternalId, refer to [aws doc](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_common- scenarios_third-party.html)** { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "1001" } } } ] } 3. On the role details page, go to the ​​`Permissions`​​ tab, click `​​Add permissions`​​, then select `​​Create inline policy`​​. In the `​​JSON`​​ tab, paste one of the following policies based on your requirements; ![target role permission2](/assets/images/target-role- permission2-62b5c0254e966762f0c5f299499b6bdd.png) S3 read policy template​, applies to Doris features requiring read/list access, e.g: S3 Load, TVF, External Catalog **Notes:** 1. **Replace`your-bucket` and `your-prefix` with actual values.** 2. **Avoid adding extra`/` separators.** { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:GetObjectVersion" ], "Resource": "arn:aws:s3::://*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::", } ] } S3 write policy template​​ (Applies to Doris features requiring read/write access, e.g: Export, Storage Vault, Repository) **Notes:** 1. **Replace`your-bucket` and `your-prefix` with actual values.** 2. **Avoid adding extra`/` separators.** { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": "arn:aws:s3::://*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::" } ] } ### Step4 Use doris features with Assumed Role via SQL, according to `role_arn` and `external_id` fields​ After completing the above configurations, obtain the target account's `role_arn` and `external_id` (if applicable). Use these parameters in doris SQL statements as shown below: Common important key parameters:​​ "s3.role_arn" = "", "s3.external_id" = "" -- option parameter #### S3 Load​ LOAD LABEL s3_load_2022_04_01 ( DATA INFILE("s3://your_bucket_name/s3load_example.csv") INTO TABLE test_s3load COLUMNS TERMINATED BY "," FORMAT AS "CSV" (user_id, name, age) ) WITH S3 ( "provider" = "S3", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.role_arn" = "", "s3.external_id" = "" -- option parameter ) PROPERTIES ( "timeout" = "3600" ); #### TVF​ SELECT * FROM S3 ( "uri" = "s3://your_bucket/path/to/tvf_test/test.parquet", "format" = "parquet", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.role_arn" = "", "s3.external_id" = "" -- option parameter ) #### External Catalog​ CREATE CATALOG iceberg_catalog PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hadoop", "warehouse" = "s3://your_bucket/dir/key", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.role_arn" = "", "s3.external_id" = "" -- option parameter ); #### Storage Vault​ CREATE STORAGE VAULT IF NOT EXISTS s3_demo_vault PROPERTIES ( "type" = "S3", "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.bucket" = "", "s3.role_arn" = "", "s3.external_id" = "", -- option parameter "s3.root.path" = "s3_demo_vault_prefix", "provider" = "S3", "use_path_style" = "false" ); #### Export​ EXPORT TABLE s3_test TO "s3://your_bucket/a/b/c" PROPERTIES ( "column_separator"="\\x07", "line_delimiter" = "\\x07" ) WITH S3 ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.role_arn" = "", "s3.external_id" = "" ) #### Repository​ CREATE REPOSITORY `s3_repo` WITH S3 ON LOCATION "s3://your_bucket/s3_repo" PROPERTIES ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.role_arn" = "", "s3.external_id" = "" ); #### Resource​ CREATE RESOURCE "remote_s3" PROPERTIES ( "s3.endpoint" = "s3.us-east-1.amazonaws.com", "s3.region" = "us-east-1", "s3.bucket" = "", "s3.role_arn" = "", "s3.external_id" = "" ); On This Page * IAM User Authentication * Step1 Create an IAM User and configure policies * Step2 Use doris features with access/secret key pair via SQL * Assumed Role Authentication * Step1 Prerequisites * Step2 Configure Permissions for Source Account IAM Role (EC2 Instance Role) * Step3 Configure Trust Policy and Permissions for Target Account IAM Role * Step4 Use doris features with Assumed Role via SQL, according to `role_arn` and `external_id` fields --- # Source: https://docs.velodb.io/cloud/4.x/security/privacy-compliance/security-features Version: 4.x On this page # Security Features VeloDB Cloud provides a complete security mechanism to ensure the security of customer data and services, such as isolation, authentication, authorization, encryption, auditing, disaster recovery, etc. ## Product Architecture​ VeloDB Cloud cloud-native data warehouse contains three key concepts: organization, warehouse and cluster. As the cornerstone of product design, they build independent, isolated, elastic and scalable services to help enterprises quickly and safely build the foundation of big data analysis business. ![](/assets/images/product-architecture-db2080d885facddd0ce9951fdf521f63.png) * **Organization** : An organization represents an enterprise or a relatively independent group. After registering VeloDB Cloud, users use the service as an organization. Organization is the billing settlement object in VeloDB Cloud. The billing, resources and data between different organizations are isolated from each other. * **Warehouse** : A warehouse is a logical concept, which includes computing and storage resources. Each organization can create multiple warehouses to meet the data analysis needs of different businesses, such as orders, advertising, logistics and other businesses. Similarly, the resources and data between different warehouses are also isolated from each other, which can be used to meet the security needs within the organization. * **Cluster** : A cluster is a computing resource in a warehouse, which contains one or more computing nodes and can be elastically expanded and reduced. A warehouse can contain multiple clusters, which share the underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other. From a technical perspective, the core technical architecture of VeloDB Cloud is divided into three layers: ![](/assets/images/tech-architecture-c88710115ac065820b9c3cd0413bc5f6.png) **Service Layer** * Manager: Responsible for the management of computing and storage resources. When a user creates a warehouse, the Manager is responsible for creating a storage bucket; when a user creates a cluster, the Manager is responsible for creating computing resources. * Metadata: Stores metadata such as organizations, users, warehouses, clusters, and database tables. * Security: Responsible for security policy settings, using the principle of least privilege. **Compute Layer** * Data warehouse: It is a logical concept, including physical objects such as warehouse metadata, clusters, and data storage. * Cluster: The cluster only contains computing resources and cached data. Multiple clusters of the same warehouse share data storage. **Storage Layer** * Object storage: The data in the warehouse is stored in the object storage on the cloud service in the form of files. ## Security level​ VeloDB Cloud provides complete and full-link data security features from the dimensions of resource isolation, authentication, data transmission and storage: * Resource isolation: Storage and computing between organizations are isolated from each other. * Identity authentication: Prove the identity of the visitor (user or application). * Access control: Set user access rights to data to ensure that users can control data permissions in a fine-grained manner. * Data protection: Storage and transmission encryption ensure that data will not be leaked through physical disks and network monitoring, and support data disaster recovery protection. * Network security: Public network whitelist, private network links, inter-organization security groups, and optional independent VPC ensure the security of network connections. * Security audit: Transparent and complete audit of operations in the console and warehouse. * Application security: VeloDB cloud service has the ability to prevent attacks. ### Resource isolation​ **SaaS deployment** VeloDB Cloud ensures complete isolation of data between different organizations through storage and computing isolation: Data storage 1. Each organization uses a separate object storage bucket in each cloud service area, and the bucket is set to private access and uses STS authentication. 2. Each warehouse is assigned a cloud service subaccount, and the storage permission of each warehouse is only granted to this subaccount. 3. Cache data is only stored locally in the cluster, and different warehouses cannot access each other. Computing resources 4. Clusters will not be used across warehouses, that is, a cluster will only belong to one warehouse. 5. Each organization's cluster sets strict firewall rules through security groups to ensure that clusters between different organizations cannot connect to each other. **BYOC deployment** In the VeloDB Cloud BYOC deployment form, data storage and computing resources are completely retained in your own VPC, and data does not leave your VPC, ensuring the security and compliance of data and computing. Data storage Data is completely stored in your own VPC, and data does not leave your VPC. Computing resources 1. Computing resources are completely in your own cloud resource pool, providing data warehouse services. 2. A warehouse can contain multiple clusters, which share underlying data. Different clusters can meet different workloads, such as statistical reports, interactive analysis, etc., and the workloads between multiple clusters do not interfere with each other. ### Identity Authentication​ Any access to the VeloDB Cloud control plane or data plane requires identity authentication, which is mainly used to confirm the identity of the visitor. VeloDB Cloud ensures the reliability of authentication through the following mechanisms: * Control plane * Support multi-factor authentication (MFA), and improve security protection capabilities through combined authentication methods such as email password and mobile phone verification code. * Data plane * Connect using the MySQL authentication protocol. * HTTP protocol data interaction requires identity authentication, and the authentication method is consistent with the MySQL protocol. * Support IP blacklist and whitelist mechanism for identity authentication. * Password policy * Prevent setting weak passwords, and use strong passwords. * Prevent brute force password cracking. * User passwords are encrypted and stored. ### Access Control​ VeloDB has three levels of access control entities: organization, user, and user in warehouse. Organization is a billing unit, and the same organization shares the bill. User is used for control, such as creating and deleting data warehouses and clusters. User in warehouse is used for data, and can operate on database tables, similar to users in MySQL. **RBAC permission control** Multiple warehouses can be created under an organization, and the data between each warehouse is isolated. Organization administrators can set different roles for users in the organization, and control the user's permissions to create/delete/edit/view/query/monitor warehouses through roles. For details, please refer to VeloDB Cloud User Management. User in warehouse refers to the permission management mechanism of MySQL, and achieves fine-grained permission control at the table level, role-based permission access control, and supports whitelist mechanism. For details, please refer to VeloDB Cloud Permission Management. **Row-level security** Administrators can perform fine-grained permission control on qualified rows, such as allowing only a certain user to access qualified rows, which is used when multiple users have different permissions for different data rows in a table. Syntax description, row policy documentation CREATE ROW POLICY {NAME} ON {TABLE} AS {RESTRICTIVE|PERMISSIVE} TO {USER} USING {PREDICATE}; Example Create a policy named test_row_policy_1, which prohibits user1 from accessing rows in table1 where the col1 column value is equal to 1 or 2. CREATE ROW POLICY test_row_policy_1 ON db1.table1 AS RESTRICTIVE TO user1 USING (col1 in (1, 2)); Create a policy named test_row_policy_1, which allows user1 to access rows in table1 where the col1 column value is equal to 1 or 2. CREATE ROW POLICY test_row_policy_1 ON db1.table1 AS PERMISSIVE TO user1 USING (col1 in (1, 2)); **Column-level security** Administrators can implement column-level permission control through views. For example, if a user does not have access to a column, a view that does not contain this column can be created for this user. Syntax (The following only shows the basic syntax, please refer to the detailed syntax of view) CREATE VIEW {name} {view_column_list} AS SELECT {table_column_list} FROM {src_table} Example Authorize user user1 to read columns id and name of table t1 create view view2 (id,name) as select id,name from t1 grant SELECT_PRIV to user1 on view2 **Data masking** VeloDB provides a convenient mask function that can mask numbers and strings. Users can use the mask function to create a view, and then manage the view permissions through the access control of users in the warehouse, thereby implementing data masking for users. Syntax Description VARCHAR mask(VARCHAR str, [, VARCHAR upper[, VARCHAR lower[, VARCHAR number]]]) Example Returns a masked version of str. By default, uppercase letters are converted to "X", lowercase letters are converted to "x", and numbers are converted to "n". For example, mask("abcd-EFGH-8765-4321") results in xxxx- XXXX-nnnn-nnnn. You can override the characters used in the mask by providing additional parameters: the second parameter controls the mask character for uppercase letters, the third parameter controls lowercase letters, and the fourth parameter controls numbers. For example, mask("abcd-EFGH-8765-4321", "U", "l", "#") results in llll-UUUU-####-####. // table test +-----------+ | name | +-----------+ | abc123EFG | | NULL | | 456AbCdEf | +-----------+ mysql> select mask(name) from test; +--------------+ | mask(`name`) | +--------------+ | xxxnnnXXX | | NULL | | nnnXxXxXx | +--------------+ mysql> select mask(name, '*', '#', '$') from test; +-----------------------------+ | mask(`name`, '*', '#', '$') | +-----------------------------+ | ###$$$*** | | NULL | | $$$*#*#*# | +-----------------------------+ ### Data Protection​ **Storage Encryption** * Use storage encryption of cloud service object storage to ensure that valid data cannot be directly obtained from object storage or physical disk. * Use cloud service disk encryption to ensure that valid data in cache cannot be directly obtained from disk. * Use the encryption function provided by VeloDB to ensure that valid data cannot be directly obtained from object storage, physical disk, and cache disk. * VeloDB key rotation protection: Each customer uses an independent key, rotates the key periodically, and accesses objects through a secure temporary authorization mechanism (STS or pre-signature mechanism) to avoid the risk of key leakage. * Use RSA encryption algorithm to encrypt data **Transmission Encryption** * MySQL and jdbc protocol access supports TLS encrypted transmission and supports two-way TLS verification (two-way TLS). * HTTPS secure transmission for data interaction. **Disaster Recovery Protection** * Data and metadata storage adopts a multi-availability zone storage architecture to ensure that data can be disaster-tolerant across availability zones. * Versioning is enabled by default in object storage to ensure multi-version redundancy of objects at the application level. * Routine metadata backup to provide disaster recovery capabilities. * Routine metadata and data checks to ensure data correctness and reliability. * Support Warehouse-level TimeTravel (to be released soon). * Cross-region replication CCR. ### Network security​ Under the principle of least privilege, VeloDB strictly restricts the network security rules of VPC, including: * External network access must go through the gateway. * Operation and maintenance must go through VPN. * Organizational isolation. The VeloDB warehouse provides two network connection methods: public network and private network connection: * Public network: Only IPs in the whitelist can access, which can effectively avoid excessive public network permissions. * Private network connection: Users can access VeloDB through private network connection in VPC. Private network connection can ensure that only one-way connection and only the set VPC can be connected, which effectively limits the access source. ### Security Audit​ There is a complete audit mechanism for the control operations of the console and the access operations of the warehouse kernel. Customers can obtain the corresponding audit information through the cloud product console. ### Application Security​ VeloDB uses security products such as cloud firewall, Web Application Firewall (WAF), and database audit to ensure the security of cloud service applications. On This Page * Product Architecture * Security level * Resource isolation * Identity Authentication * Access Control * Data Protection * Network security * Security Audit * Application Security --- # Source: https://docs.velodb.io/cloud/4.x/security/security-overview Version: 4.x # Security Overview Doris provides the following mechanisms to manage data security: **Authentication:** Doris supports both username/password and LDAP authentication methods. * **Built-in Authentication:** Doris includes a built-in username/password authentication method, allowing customization of password policies. * **LDAP Authentication:** Doris can centrally manage user credentials through LDAP services, simplifying access control and enhancing system security. **Permission Management:** Doris supports role-based access control (RBAC) or can inherit Ranger to achieve centralized permission management. * **Role-Based Access Control (RBAC):** Doris can restrict users' access to and operations on database resources based on their roles and permissions. * **Ranger Permission Management:** By integrating with Ranger, Doris enables centralized permission management, allowing administrators to set fine-grained access control policies for different users and groups. **Audit and Logging:** Doris can enable audit logs to record all user actions, including logins, queries, data modifications, and more, facilitating post- audit and issue tracking. **Data Encryption and Masking:** Doris supports encryption and masking of data within tables to prevent unauthorized access and data leakage. **Data Transmission Encryption:** Doris supports SSL encryption protocols to ensure secure data transmission between clients and Doris servers, preventing data from being intercepted or tampered with during transfer. **Fine-Grained Access Control:** Doris allows configuring data row and column access permissions based on rules to control user access at a granular level. **JAVA-UDF Security:** Doris supports user-defined function functionality, so root administrators need to review the implementation of user UDFs to ensure the operations in the logic are safe and prevent high-risk actions in UDFs, such as data deletion and system disruption. **Third-Party Packages:** When using Doris features like JDBC Catalog or UDFs, administrators must ensure that any third-party packages are from trusted and secure sources. To reduce security risks, it is recommended to use dependencies only from official or reputable community sources. --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/basic-element/sql-data-types/data-type-overview Version: 4.x On this page # Overview ## Numeric Types​ Doris supports the following numeric data types: ### BOOLEAN​ There are two possible values: 0 represents false, and 1 represents true. For more info, please refer [BOOLEAN](/cloud/4.x/sql-manual/basic-element/sql- data-types/numeric/BOOLEAN)。 ### Integer​ All are signed integers. The differences among the INT types are the number of bytes occupied and the range of values they can represent: * **[TINYINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/TINYINT)** : 1 byte, [-128, 127] * **[SMALLINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/SMALLINT)** : 2 bytes, [-32768, 32767] * **[INT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/INT)** : 4 bytes, [-2147483648, 2147483647] * **[BIGINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/BIGINT)** : 8 bytes, [-9223372036854775808, 9223372036854775807] * **[LARGEINT](/cloud/4.x/sql-manual/basic-element/sql-data-types/numeric/LARGEINT)** : 16 bytes, [-2^127, 2^127 - 1] ### Floating-point​ Including imprecise floating-point types [FLOAT](/cloud/4.x/sql-manual/basic- element/sql-data-types/numeric/FLOAT) and [DOUBLE](/cloud/4.x/sql- manual/basic-element/sql-data-types/numeric/DOUBLE), corresponding to the `float` and `double` in common programming languages ### Fixed-point​ The precise fixed-point type [DECIMAL](/cloud/4.x/sql-manual/basic- element/sql-data-types/numeric/DECIMAL), used in financial and other cases that require strict accuracy. ## Date Types​ Date types include DATE, TIME and DATETIME, DATE type only stores the date accurate to the day, DATETIME type stores the date and time, which can be accurate to microseconds. TIME type only stores the time, and **does not support the construction of the table storage for the time being, can only be used in the query process**. Do calculation for datetime types or converting them to numeric types, please use functions like [TIME_TO_SEC](/cloud/4.x/sql-manual/sql-functions/scalar- functions/date-time-functions/time-to-sec), [DATE_DIFF](/cloud/4.x/sql- manual/sql-functions/scalar-functions/date-time-functions/datediff), [UNIX_TIMESTAMP](/cloud/4.x/sql-manual/sql-functions/scalar-functions/date- time-functions/unix-timestamp) . The result of directly converting them as numeric types as not guaranteed. For more information refer to [DATE](/cloud/4.x/sql-manual/basic-element/sql- data-types/date-time/DATE), [TIME](/cloud/4.x/sql-manual/basic-element/sql- data-types/date-time/TIME) and [DATETIME](/cloud/4.x/sql-manual/basic- element/sql-data-types/date-time/DATETIME) documents. ## String Types​ Doris supports both fixed-length and variable-length strings, including: * **[CHAR(M)](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/CHAR)** : A fixed-length string, where M is the byte length. The range for M is [1, 255]. * **[VARCHAR(M)](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/VARCHAR)** : A variable-length string, where M is the maximum length. The range for M is [1, 65533]. * **[STRING](/cloud/4.x/sql-manual/basic-element/sql-data-types/string-type/STRING)** : A variable-length string with a default maximum length of 1,048,576 bytes (1 MB). This maximum length can be increased up to 2,147,483,643 bytes (2 GB) by configuring the `string_type_length_soft_limit_bytes`setting. ## Semi-Structured Types​ Doris supports different semi-structured data types for JSON data processing, each tailored to different use cases. * **[ARRAY](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/ARRAY)** / **[MAP](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/MAP)** / **[STRUCT](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/STRUCT)** : They support nested data and fixed schema, making them well-suited for analytical workloads such as user behavior and profile analysis, as well as querying data lake formats like Parquet. Due to the fixed schema, there is no overhead for dynamic schema inference, resulting in high write and analysis performance. * **[VARIANT](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT)** : It supports nested data and flexible schema. It is well-suited for analytical workloads such as log, trace, and IoT data analysis. It can accommodate any legal JSON data, which will be automatically expanded into sub-columns in a columnar storage format. This approach enables high compression rate in storage and high performance in data aggregation, filtering, and sorting. * **[JSON](/cloud/4.x/sql-manual/basic-element/sql-data-types/semi-structured/JSON)** : It supports nested data and flexible schema. It is optimized for high-concurrency point query use cases. The flexible schema allows for ingesting any legal JSON data, which will be stored in a binary format. Extracting fields from this binary JSON format is more than 2X faster than using regular JSON strings. ## Aggregation Types​ The aggregation data types store aggregation results or intermediate results during aggregation. They are used for accelerating aggregation-heavy queries. * **[BITMAP](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/BITMAP)** : It is used for exact deduplication, such as in (UV) statistics and audience segmentation. It works in conjunction with BITMAP functions like `bitmap_union`, `bitmap_union_count`, `bitmap_hash`, and `bitmap_hash64`. * **[HLL](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/HLL)** : It is used for approximate deduplication and provides better performance than `COUNT DISTINCT`. It works in conjunction with HLL functions like `hll_union_agg`, `hll_raw_agg`, `hll_cardinality`, and `hll_hash`. * **[QUANTILE_STATE](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/QUANTILE-STATE)** : It is used for approximate percentile calculations and offers better performance than the `PERCENTILE` function. It works with functions like `QUANTILE_PERCENT`, `QUANTILE_UNION`, and `TO_QUANTILE_STATE`. * **[AGG_STATE](/cloud/4.x/sql-manual/basic-element/sql-data-types/aggregate/AGG-STATE)** : It is used to accelerate aggregations, utilized in combination with aggregation function combinators like state/merge/union. ## IP Types​ IP data types store IP addresses in a binary format, which is faster and more space-efficient for querying compared to storing them as strings. There are two supported IP data types: * **[IPv4](/cloud/4.x/sql-manual/basic-element/sql-data-types/ip/IPV4)** : It stores IPv4 addresses as a 4-byte binary value. It is used in conjunction with the `ipv4_*` family of functions. * **[IPv6](/cloud/4.x/sql-manual/basic-element/sql-data-types/ip/IPV6)** : It stores IPv6 addresses as a 16-byte binary value. It is used in conjunction with the `ipv6_*` family of functions. On This Page * Numeric Types * BOOLEAN * Integer * Floating-point * Fixed-point * Date Types * String Types * Semi-Structured Types * Aggregation Types * IP Types --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/numeric-functions/abs Version: 4.x On this page # ABS ## Description​ Returns the absolute value of `x` ## Syntax​ ABS() ## Parameters​ Parameter| Description| ``| The value for which the absolute value is to be calculated ---|--- ## Return Value​ The absolute value of parameter `x`. ## Example​ select abs(-2); +---------+ | abs(-2) | +---------+ | 2 | +---------+ select abs(3.254655654); +------------------+ | abs(3.254655654) | +------------------+ | 3.254655654 | +------------------+ select abs(-3254654236547654354654767); +---------------------------------+ | abs(-3254654236547654354654767) | +---------------------------------+ | 3254654236547654354654767 | +---------------------------------+ On This Page * Description * Syntax * Parameters * Return Value * Example --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/numeric-functions/acos Version: 4.x On this page # ACOS ## Description​ Returns the arc cosine of `x`, or `NULL` if `x` is not in the range `-1` to `1`. ## Syntax​ ACOS() ## Parameters​ Parameter| Description| ``| The value for which the acos value is to be calculated ---|--- ## Return Value​ The acos value of parameter `x`. ## Example​ select acos(1); +-----------+ | acos(1.0) | +-----------+ | 0 | +-----------+ select acos(0); +--------------------+ | acos(0.0) | +--------------------+ | 1.5707963267948966 | +--------------------+ select acos(-2); +------------+ | acos(-2.0) | +------------+ | nan | +------------+ On This Page * Description * Syntax * Parameters * Return Value * Example --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/concat Version: 4.x On this page # CONCAT ## Description​ Concatenates multiple strings. Special cases: * If any of the parameter values ​​is NULL, the result returned is NULL ## Syntax​ CONCAT ( [ , ... ] ) ## Parameters​ Parameter| Description| ``| The strings to be concatenated ---|--- ## Return value​ Parameter list `` The strings to be concatenated. Special cases: * If any of the parameter values ​​is NULL, the result returned is NULL ## Example​ SELECT CONCAT("a", "b"),CONCAT("a", "b", "c"),CONCAT("a", null, "c") +------------------+-----------------------+------------------------+ | concat('a', 'b') | concat('a', 'b', 'c') | concat('a', NULL, 'c') | +------------------+-----------------------+------------------------+ | ab | abc | NULL | +------------------+-----------------------+------------------------+ On This Page * Description * Syntax * Parameters * Return value * Example --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-functions/scalar-functions/string-functions/length Version: 4.x On this page # LENGTH ## Description​ Returns the number of bytes in a string. ## Syntax​ LENGTH ( ) ## Parameters​ Parameter| Description| ``| The string whose bytes need to be calculated ---|--- ## Return Value​ The number of bytes in the string ``. ## Example​ SELECT LENGTH("abc"),length("中国") +---------------+------------------+ | length('abc') | length('中国') | +---------------+------------------+ | 3 | 6 | +---------------+------------------+ On This Page * Description * Syntax * Parameters * Return Value * Example --- # Source: https://docs.velodb.io/cloud/4.x/sql-manual/sql-statements/data-query/SELECT Version: 4.x On this page # SELECT ## Description​ Mainly introduces the use of Select syntax grammar: SELECT [hint_statement, ...] [ALL | DISTINCT | DISTINCTROW | ALL EXCEPT ( col_name1 [, col_name2, col_name3, ...] )] select_expr [, select_expr ...] [FROM table_references [PARTITION partition_list] [TABLET tabletid_list] [TABLESAMPLE sample_value [ROWS | PERCENT] [REPEATABLE pos_seek]] [WHERE where_condition] [GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}] [HAVING where_condition] [ORDER BY {col_name | expr | position} [ASC | DESC], ...] [LIMIT {[offset,] row_count | row_count OFFSET offset}] [INTO OUTFILE 'file_name'] 1. **Syntax Description:** 1. select_expr, ... Columns retrieved and displayed in the result, when using an alias, as is optional. 2. select_expr, ... Retrieved target table (one or more tables (including temporary tables generated by subqueries) 3. where_definition retrieves the condition (expression), if there is a WHERE clause, the condition filters the row data. where_condition is an expression that evaluates to true for each row to be selected. Without the WHERE clause, the statement selects all rows. In WHERE expressions, you can use any MySQL supported functions and operators except aggregate functions 4. `ALL | DISTINCT ` : to refresh the result set, all is all, distinct/distinctrow will refresh the duplicate columns, the default is all 5. `ALL EXCEPT`: Filter on the full (all) result set, except specifies the name of one or more columns to be excluded from the full result set. All matching column names will be ignored in the output. This feature is supported since the Apache Doris 1.2 version 6. `INTO OUTFILE 'file_name' ` : save the result to a new file (which did not exist before), the difference lies in the save format. 7. `Group by having`: Group the result set, and brush the result of group by when having appears. `Grouping Sets`, `Rollup`, `Cube` are extensions of group by, please refer to [GROUPING SETS DESIGN](https://doris.apache.org/community/design/grouping_sets_design) for details. 8. `Order by`: Sort the final result, Order by sorts the result set by comparing the size of one or more columns. Order by is a time-consuming and resource-intensive operation, because all data needs to be sent to 1 node before it can be sorted, and the sorting operation requires more memory than the non-sorting operation. If you need to return the top N sorted results, you need to use the LIMIT clause; in order to limit memory usage, if the user does not specify the LIMIT clause, the first 65535 sorted results are returned by default. 9. `Limit n`: limit the number of lines in the output result, `limit m,n` means output n records starting from the mth line.You should use `order by` before you use `limit m,n`, otherwise the data may be inconsistent each time it is executed. 10. The `Having` clause does not filter the row data in the table, but filters the results produced by the aggregate function. Typically `having` is used with aggregate functions (eg :`COUNT(), SUM(), AVG(), MIN(), MAX()`) and `group by` clauses. 11. SELECT supports explicit partition selection using PARTITION containing a list of partitions or subpartitions (or both) following the name of the table in `table_reference` 12. `[TABLET tids] TABLESAMPLE n [ROWS | PERCENT] [REPEATABLE seek]`: Limit the number of rows read from the table in the FROM clause, select a number of Tablets pseudo-randomly from the table according to the specified number of rows or percentages, and specify the number of seeds in REPEATABLE to return the selected samples again. In addition, you can also manually specify the TableID, Note that this can only be used for OLAP tables. 13. `hint_statement`: hint in front of the selectlist indicates that hints can be used to influence the behavior of the optimizer in order to obtain the desired execution plan. Details refer to [joinHint using document] () **Syntax constraints:** 1. SELECT can also be used to retrieve calculated rows without referencing any table. 2. All clauses must be ordered strictly according to the above format, and a HAVING clause must be placed after the GROUP BY clause and before the ORDER BY clause. 3. The alias keyword AS is optional. Aliases can be used for group by, order by and having 4. Where clause: The WHERE statement is executed to determine which rows should be included in the GROUP BY section, and HAVING is used to determine which rows in the result set should be used. 5. The HAVING clause can refer to the total function, but the WHERE clause cannot refer to, such as count, sum, max, min, avg, at the same time, the where clause can refer to other functions except the total function. Column aliases cannot be used in the Where clause to define conditions. 6. Group by followed by with rollup can count the results one or more times. **Join query:** Doris supports JOIN syntax JOIN table_references: table_reference [, table_reference] … table_reference: table_factor | join_table table_factor: tbl_name [[AS] alias] [{USE|IGNORE|FORCE} INDEX (key_list)] | ( table_references ) | { OJ table_reference LEFT OUTER JOIN table_reference ON conditional_expr } join_table: table_reference [INNER | CROSS] JOIN table_factor [join_condition] | table_reference LEFT [OUTER] JOIN table_reference join_condition | table_reference NATURAL [LEFT [OUTER]] JOIN table_factor | table_reference RIGHT [OUTER] JOIN table_reference join_condition | table_reference NATURAL [RIGHT [OUTER]] JOIN table_factor join_condition: ON conditional_expr **UNION Grammar:** SELECT ... UNION [ALL| DISTINCT] SELECT ...... [UNION [ALL| DISTINCT] SELECT ...] `UNION` is used to combine the results of multiple `SELECT` statements into a single result set. The column names in the first `SELECT` statement are used as the column names in the returned results. The selected columns listed in the corresponding position of each `SELECT` statement should have the same data type. (For example, the first column selected by the first statement should be of the same type as the first column selected by other statements.) The default behavior of `UNION` is to remove duplicate rows from the result. The optional `DISTINCT` keyword has no effect other than the default, since it also specifies duplicate row removal. With the optional `ALL` keyword, no duplicate row removal occurs, and the result includes all matching rows in all `SELECT` statements **WITH statement** : To specify common table expressions, use the `WITH` clause with one or more comma-separated clauses. Each subclause provides a subquery that generates the result set and associates the name with the subquery. The following example defines `WITH` clauses in CTEs named `cte1` and `cte2`, and refers to the `WITH` clause below their top-level `SELECT`: WITH cte1 AS (SELECT a,b FROM table1), cte2 AS (SELECT c,d FROM table2) SELECT b,d FROM cte1 JOIN cte2 WHERE cte1.a = cte2.c; In a statement containing the `WITH` clause, each CTE name can be referenced to access the corresponding CTE result set. CTE names can be referenced in other CTEs, allowing CTEs to be defined based on other CTEs. Recursive CTE is currently not supported. ## Example​ 1. Query the names of students whose ages are 18, 20, 25 select Name from student where age in (18,20,25); 2. ALL EXCEPT Example -- Query all information except the students' age select * except(age) from student; 3. GROUP BY Example --Query the tb_book table, group by type, and find the average price of each type of book, select type,avg(price) from tb_book group by type; 4. DISTINCT Use --Query the tb_book table to remove duplicate type data select distinct type from tb_book; 5. ORDER BY Example Sort query results in ascending (default) or descending (DESC) order. Ascending NULL is first, descending NULL is last --Query all records in the tb_book table, sort them in descending order by id, and display three records select * from tb_book order by id desc limit 3; 6. LIKE fuzzy query Can realize fuzzy query, it has two wildcards: `%` and `_`, `%` can match one or more characters, `_` can match one character --Find all books whose second character is h select * from tb_book where name like('_h%'); 7. LIMIT limits the number of result rows --1. Display 3 records in descending order select * from tb_book order by price desc limit 3; --2. Display 4 records from id=1 select * from tb_book where id limit 1,4; 8. CONCAT join multiple columns --Combine name and price into a new string output select id,concat(name,":",price) as info,type from tb_book; 9. Using functions and expressions --Calculate the total price of various books in the tb_book table select sum(price) as total,type from tb_book group by type; --20% off price select *,(price * 0.8) as "20%" from tb_book; 10. UNION Example SELECT a FROM t1 WHERE a = 10 AND B = 1 ORDER by LIMIT 10 UNION SELECT a FROM t2 WHERE a = 11 AND B = 2 ORDER by LIMIT 10; 11. WITH clause example WITH cte AS ( SELECT 1 AS col1, 2 AS col2 UNION ALL SELECT 3, 4 ) SELECT col1, col2 FROM cte; 12. JOIN Exampel SELECT * FROM t1 LEFT JOIN (t2, t3, t4) ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c) Equivalent to SELECT * FROM t1 LEFT JOIN (t2 CROSS JOIN t3 CROSS JOIN t4) ON (t2.a = t1.a AND t3.b = t1.b AND t4.c = t1.c) 13. INNER JOIN SELECT t1.name, t2.salary FROM employee AS t1 INNER JOIN info AS t2 ON t1.name = t2.name; SELECT t1.name, t2.salary FROM employee t1 INNER JOIN info t2 ON t1.name = t2.name; 14. LEFT JOIN SELECT left_tbl.* FROM left_tbl LEFT JOIN right_tbl ON left_tbl.id = right_tbl.id WHERE right_tbl.id IS NULL; 15. RIGHT JOIN mysql> SELECT * FROM t1 RIGHT JOIN t2 ON (t1.a = t2.a); +------+------+------+------+ | a | b | a | c | +------+------+------+------+ | 2 | y | 2 | z | | NULL | NULL | 3 | w | +------+------+------+------+ 16. TABLESAMPLE --Pseudo-randomly sample 1000 rows in t1. Note that several Tablets are actually selected according to the statistics of the table, and the total number of selected Tablet rows may be greater than 1000, so if you want to explicitly return 1000 rows, you need to add Limit. SELECT * FROM t1 TABLET(10001) TABLESAMPLE(1000 ROWS) REPEATABLE 2 limit 1000; ## Keywords​ SELECT ## Best Practice​ 1. ome additional knowledge about the SELECT clause * An alias can be specified for select_expr using AS alias_name. Aliases are used as column names in expressions and can be used in GROUP BY, ORDER BY or HAVING clauses. The AS keyword is a good habit to use when specifying aliases for columns. * table_references after FROM indicates one or more tables participating in the query. If more than one table is listed, a JOIN operation is performed. And for each specified table, you can define an alias for it * The selected column after SELECT can be referenced in ORDER IN and GROUP BY by column name, column alias or integer (starting from 1) representing the column position SELECT college, region, seed FROM tournament ORDER BY region, seed; SELECT college, region AS r, seed AS s FROM tournament ORDER BY r, s; SELECT college, region, seed FROM tournament ORDER BY 2, 3; * If ORDER BY appears in a subquery and also applies to the outer query, the outermost ORDER BY takes precedence. * If GROUP BY is used, the grouped columns are automatically sorted in ascending order (as if there was an ORDER BY statement followed by the same columns). If you want to avoid the overhead of GROUP BY due to automatic sorting, adding ORDER BY NULL can solve it: SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL; * When sorting columns in a SELECT using ORDER BY or GROUP BY, the server sorts values using only the initial number of bytes indicated by the max_sort_length system variable. * Having clauses are generally applied last, just before the result set is returned to the MySQL client, and is not optimized. (while LIMIT is applied after HAVING) The SQL standard requires: HAVING must refer to a column in the GROUP BY list or used by an aggregate function. However, MySQL extends this by allowing HAVING to refer to columns in the Select clause list, as well as columns from outer subqueries. A warning is generated if the column referenced by HAVING is ambiguous. In the following statement, col2 is ambiguous: SELECT COUNT(col1) AS col2 FROM t GROUP BY col2 HAVING col2 = 2; * Remember not to use HAVING where WHERE should be used. HAVING is paired with GROUP BY. * The HAVING clause can refer to aggregate functions, while WHERE cannot. SELECT user, MAX(salary) FROM users GROUP BY user HAVING MAX(salary) > 10; * The LIMIT clause can be used to constrain the number of rows returned by a SELECT statement. LIMIT can have one or two arguments, both of which must be non-negative integers. /*Retrieve 6~15 rows in the result set*/ SELECT * FROM tbl LIMIT 5,10; /*Then if you want to retrieve all rows after a certain offset is set, you can set a very large constant for the second parameter. The following query fetches all data from row 96 onwards */ SELECT * FROM tbl LIMIT 95,18446744073709551615; /*If LIMIT has only one parameter, the parameter specifies the number of rows that should be retrieved, and the offset defaults to 0, that is, starting from the first row*/ * SELECT...INTO allows query results to be written to a file 2. Modifiers of the SELECT keyword * deduplication The ALL and DISTINCT modifiers specify whether to deduplicate rows in the result set (should not be a column). ALL is the default modifier, that is, all rows that meet the requirements are to be retrieved. DISTINCT removes duplicate rows. 3. The main advantage of subqueries * Subqueries allow structured queries so that each part of a statement can be isolated. * Some operations require complex unions and associations. Subqueries provide other ways to perform these operations 4. Speed up queries * Use Doris's partition and bucket as data filtering conditions as much as possible to reduce the scope of data scanning * Make full use of Doris's prefix index fields as data filter conditions to speed up query speed 5. UNION * Using only the union keyword has the same effect as using union disitnct. Since the deduplication work is more memory-intensive, the query speed using the union all operation will be faster and the memory consumption will be less. If users want to perform order by and limit operations on the returned result set, they need to put the union operation in the subquery, then select from subquery, and finally put the subquery and order by outside the subquery. select * from (select age from student_01 union all select age from student_02) as t1 order by age limit 4; +-------------+ | age | +-------------+ | 18 | | 19 | | 20 | | 21 | +-------------+ 4 rows in set (0.01 sec) 6. JOIN * In the inner join condition, in addition to supporting equal-valued joins, it also supports unequal-valued joins. For performance reasons, it is recommended to use equal-valued joins. * Other joins only support equivalent joins On This Page * Description * Example * Keywords * Best Practice --- # Source: https://docs.velodb.io/cloud/4.x/use-cases/observability/overview Version: 4.x On this page # Overview ## What Is Observability?​ Observability refers to the ability to infer a system's internal state through its external output data. An observability platform collects, stores, and visualizes three core data: Logs, Traces, and Metrics. This helps teams gain a comprehensive understanding of the operational status of distributed systems, supports resource optimization, fault prediction, root cause analysis, improves system reliability, and enhances user experience. ## Why Observability Is Becoming Increasingly Important​ Observability platforms have several critical use cases that are vital for improving system stability, optimizing operations efficiency, and enabling business innovation. 1. **Fault Diagnosis and Root Cause Analysis** : Real-time monitoring, anomaly detection, and tracing capabilities enable quick identification and analysis of faults. For example, in the financial industry, combining observability with transaction tracing and AI technologies can shorten recovery time and ensure business continuity. It also supports chaos engineering to simulate failure scenarios and validate system fault tolerance. 2. **Performance Optimization and Resource Planning** : Analyzing system resource utilization and response times helps identify performance bottlenecks and dynamically adjust configurations (e.g., load balancing, auto-scaling). Historical data can be used to predict resource needs, optimize cloud resource allocation, and reduce costs. 3. **Business Decision Support** : Correlating IT performance data with business outcomes (such as user retention rates and transaction volumes) helps formulate business strategies. For instance, analyzing user experience metrics can guide product feature improvements. 4. **Security and Compliance Monitoring** : Detects abnormal behaviors (e.g., zero-day attacks) and triggers automated responses to enhance system security. At the same time, log auditing ensures compliance with regulatory requirements. 5. **DevOps Collaboration** : During canary releases, traffic tagging enables tracking of new version behavior. Combined with call chain analysis, it informs release progression and helps developers optimize code performance, reducing production incidents. **The growing importance of observability in recent years is mainly driven by two factors:** 1. **Increasing Complexity of Business and IT Systems** : With the development of cloud computing and microservices, business systems are becoming increasingly complex. For example, a GenAI application request might involve dozens of services such as App, service gateway, authentication service, billing service, RAG engine, Agent engine, vector database, business database, distributed cache, message queue, and large model APIs. Traditional methods like checking server status via SSH and analyzing logs are no longer effective in such complex environments. Observability platforms unify Log, Trace, and Metric data collection and storage, providing centralized visualization and rapid issue investigation. 2. **Higher Requirements for Business Reliability** : System failures have increasingly high impacts on user experience. Therefore, the efficiency of fault detection and recovery has become more critical. Observability provides full data visibility and panoramic analytics, allowing teams to quickly locate root causes, reduce downtime, and ensure service availability. Moreover, with global data analytics and forecasting, potential resource bottlenecks can be identified early, preventing failures before they occur. ## How to Choose an Observability Solution​ Observability data has several characteristics, and addressing the challenges of massive data storage and analysis is key to any observability solution. 1. **High Storage Volume and Cost Sensitivity** : Observability data, especially Logs and Traces, are typically enormous in volume and generated continuously. In medium-to-large enterprises, daily data generation often reaches terabytes or even petabytes. To meet business or regulatory requirements, data must often be stored for months or even years, leading to storage volumes reaching the PB or EB scale and resulting in significant storage costs. Over time, the value of this data diminishes, making cost efficiency increasingly important. 2. **High Throughput Writes with Real-Time Requirements** : Handling daily ingestion of TB or PB-scale data offen requires write throughput ranging from 1–10 GB/s or millions to tens of millions of records per second. Simultaneously, due to the need for real-time troubleshooting and security investigations, platforms must support sub-second write latencies to ensure real-time data availability. 3. **Real-Time Analysis and Full-Text Search Capabilities** : Logs and Traces contain large amounts of textual data. Quickly searching for keywords and phrases is essential. Traditional full-scan and string-matching approaches often fail to deliver real-time performance, especially at this scale—especially under high-throughput, low-latency ingestion conditions. Thus, building inverted indexes tailored for text becomes crucial for achieving sub-second query responsiveness. 4. **Dynamic Data Schema and Frequent Expansion Needs** : Logs originally existed as unstructured free-text logs but evolved into semi-structured JSON formats. Producers frequently modify JSON fields, making schema flexibility essential. Traditional databases and data warehouses struggle to handle such dynamic schemas efficiently, while datalake systems offer storage flexibility but fall short in real-time analytical performance. 5. **Integration with Multiple Data Sources and Analysis Tools** : There are many observability ecosystem tools for data collection and visualization. The storage and analysis engine must integrate seamlessly with these diverse tools. Given options like Elasticsearch, ClickHouse, Doris, and logging services provided by Cloud vendors, how should one choose? Here are the key evaluation criteria: ### 1\. **Performance: Includes Write and Query Performance**​ Since observability is often used in urgent situations like troubleshooting, queries must respond quickly—especially for textual content in Logs and Traces, which require real-time full-text search to support iterative exploration. Additionally, users must be able to query near real-time data—queries limited to data from hours or minutes ago are insufficient; fresh data from the past few seconds is needed. * **Elasticsearch** is known for inverted indexing and full-text search, offering sub-second retrieval. However, it struggles with high-throughput writes, often rejecting writes or experiencing high latency during peak loads. Its aggregation and statistical analysis performance is also relatively weak. * **Cloud Logging Services** provide sufficient performance through rich resources but come with higher costs. * **ClickHouse** delivers high write throughput and high aggregation query performance using columnar storage and vectorized execution. However, its full-text search performance lags behind Elasticsearch and Doris by multiples and remains experimental and unsuitable for production use. * **Doris** , leveraging columnar storage and vectorized execution, optimizes inverted indexing for observability scenarios. It offers better performance than Elasticsearch, with ~5x faster writes and ~2x faster queries. Aggregation performance is up to 6–21x better than Elasticsearch. ### 2\. **Cost: Includes Storage and Compute Costs**​ Observability data volumes are huge, especially Logs and Traces. Medium-to- large enterprises generate TBs or even PBs of data daily. Due to business or regulatory needs, data must be retained for months or years, pushing storage requirements into the PB or even EB range. Compared to business-critical data, observability data has lower value density, and its value decreases over time, making cost sensitivity critical. Additionally, processing massive volumes of data incurs substantial compute costs. * **Elasticsearch** suffers from high costs. Its storage model combines row-based raw data, inverted indexes, and docvalue columnar storage, with typical compression ratios around 1.5:1. High CPU overhead from JVM and index construction further increases compute costs. * **Doris** includes numerous optimizations for observability scenarios. Compared to Elasticsearch, it reduces total cost by 50–80%. These include simplified inverted indexing, columnar storage with ZSTD compression (5:1–10:1), cold-hot tiered storage, single-replica writes, time-series compaction to reduce write amplification, and vectorized index building. * **ClickHouse** uses columnar storage and vectorized engines, delivering lower storage and write costs. * **Cloud Logging Services** are expensive as Elasticsearch. ### 3\. **Openness: Includes Open Source and Multi-Cloud Neutrality**​ When selecting an observability platform, consider openness, including whether it's open source and multi-cloud neutral. * **Elasticsearch** is an open-source project maintained by Elastic, available on multiple clouds. Its ELK ecosystem is self-contained and difficult to integrate with other ecosystems, eg. Kibana only supports Elasticsearch and is hard to extend. * **Doris** is an Apache Top-Level open-source project, supported by major global cloud providers. It integrates well with OpenTelemetry, Grafana, and ELK, maintaining openness and neutrality. * **ClickHouse** is an open-source project maintained by ClickHouse Inc., available across clouds. While it supports OpenTelemetry and Grafana, its acquisition of an observability company raises concerns about future neutrality. * **Cloud Logging Services** are tied to their respective clouds, not open source, and differ between vendors, limiting consistent experiences and migration flexibility. ### 4\. **Ease of Use: Includes Manageability and Usability**​ Due to the volume of data, observability platforms usually adopt distributed architectures. Ease of deployment, scaling, upgrades, and other management tasks significantly affects scalability. The interface provided by the system determines developer efficiency and user experience. * **Elasticsearch** 's Kibana web UI is very user-friendly and manageable. However, its DSL query language is complex and hard to learn, posing integration and development challenges. * **Doris** provides an interactive analysis interface similar to Kibana and integrates natively with Grafana and Kibana (comming soon). Its SQL is standard and MySQL-compatible, making it developer- and analyst-friendly. Doris has a simple architecture that’s easy to deploy and maintain, supports online scaling without service interruption, automatic load balancing, and includes a visual Cluster Manager. * **ClickHouse** provides SQL interfaces but uses its own syntax. Maintenance is challenging due to exposed concepts like local tables vs. distributed tables and lack of automatic rebalancing during scaling. Typically, developing a custom cluster management system is required. * **Cloud Logging Services** offer SaaS convenience—users don't manage infrastructure and enjoy ease of use. Based on the above analysis, **Doris** achieves high-performance ingestion and queries while keeping costs low. Its SQL interface is easy to use, and its architecture is simple to maintain and scale. It also ensures consistent experiences across multiple clouds, making it an optimal choice for building an observability platform. ## Observability Solution Based on Doris​ ### System Architecture​ Apache Doris is a modern data warehouse with an MPP distributed architecture, integrating vectorized execution engines, CBO optimizers, advanced indexing, and materialized views. It supports ultra-fast querying and analysis on large- scale real-time datasets, delivering an exceptional analytical experience. Through continuous technical innovation, Doris has achieved top rankings in authoritative benchmarks such as ClickBench (single table), TPC-H, and TPC-DS (multi tables). For observability scenarios, Doris introduces inverted indexing and ultra-fast full-text search capabilities, achieving optimized write performance and storage efficiency. This allows users to build high-performance, low-cost, and open observability platforms based on Doris. A Doris-based observability platform consists of three core components: * **Data Collection and Preprocessing** : Supports various observability data collection tools, including OpenTelemetry and ELK ecosystem tools like Logstash and Filebeat. Log, Trace, and Metric data are ingested into Doris via HTTP APIs. * **Data Storage and Analysis Engine** : Doris provides unified, high-performance, low-cost storage for observability data and exposes powerful search and analysis capabilities via SQL interfaces. * **Query Analysis and Visualization** : Integrates with popular observability visualization tools such as Grafana and Kibana (from the ELK stack), offering intuitive interfaces for searching, analyzing, alerting, and achieving real-time monitoring and rapid response. ![doris-observabiltiy-architecture](/assets/images/observability-architecture- doris-861ac9c6b18a7e6e1070f10e1d398984.png) ### Key Features and Advantages​ #### **High Performance**​ * **High Throughput, Low Latency Writes** : Supports stable ingestion of PB-scale (10GB/s) Log, Trace, and Metric data daily with sub-second latency. * **High-Performance Inverted Index and Full-Text Search** : Supports inverted indexing and full-text search, delivering sub-second response times for common log keyword searches—3–10x faster than ClickHouse. * **High-Performance Aggregation Analysis** : Utilizing MPP distributed architecture and vectorized pipeline execution engines, Doris excels in trend analysis and alerting in observability scenarios, leading globally in ClickBench tests. #### **Low Cost**​ * **High Compression Ratio and Low-Cost Storage** : Supports PB-scale storage with compression ratios of 5:1 – 10:1 (including indexes), reducing storage costs by 50–80% compared to Elasticsearch. Cold data can be offloaded to S3/HDFS, cutting storage costs by another 50%. * **Low-Cost Writes** : Consumes 70% less CPU than Elasticsearch for the same write throughput. #### **Flexible Schema**​ * **Schema Changes at the Top Level** : Users can use Light Schema Change to add or drop columns or indexes (ADD/DROP COLUMN/INDEX), and schema modifications can be completed in seconds. When designing an observability platform, users only need to consider which fields and indexes are needed at the current stage. * **Internal Field Changes** : A semi-structured data type called VARIANT is specially designed for scalable JSON data. It can automatically identify field names and types within JSON, and further split frequently occurring fields into columnar storage, improving compression ratio and analytical performance. Compared to Elasticsearch’s Dynamic Mapping, VARIANT allows changes in the data type of a single field. #### **User-Friendly**​ * **Standard SQL Interface** : Doris supports standard SQL and is compatible with MySQL protocols and syntax, making it accessible to engineers and analysts. * **Integration with Observability Ecosystems** : Compatible with OpenTelemetry and ELK ecosystems, supporting Grafana and Kibana (comming soon) visualization tools for seamless data collection and analysis. * **Easy Operations** : Supports online scaling, automatic load balancing, and visual management via Cluster Manager. #### **Openness**​ * **Open Source** : Apache Doris is a top-level open-source project adopted by over 5000 companies worldwide, supporting OpenTelemetry, Grafana, and other observability ecosystems. * **Multi-Cloud Neutral** : Major cloud providers offer Doris SaaS services, ensuring consistent experiences across clouds. ### Demo & Screenshots​ We demonstrate the Doris-based observability platform using a comprehensive [demo](https://github.com/apache/doris-opentelemetry-demo) from the OpenTelemetry community. The observed business system simulates an [e-commerce website] () composed of frontend, authentication, cart, payment, logistics, advertising, recommendation, risk control, and more than ten modules, reflecting a high level of system complexity, thus presenting significant challenges for observability data collection, storage, and analysis. The Load Generator tool sends continuous requests to the entry service, generating vast volumes of observability data (Logs, Traces, Metrics). These data are collected using OpenTelemetry SDKs in various languages, sent to the OpenTelemetry Collector, preprocessed by Processors, and finally written into Doris via the OpenTelemetry Doris Exporter. Observability visualization tools such as Grafana connects to Doris through the MySQL interface, providing visualized query and analysis capabilities. [![Doris OpenTelemetry Demo](/assets/images/otel-demo- doris-e5b69b639127ec2936bb8154e2738d6f.png)](https://youtu.be/LrR4SNyAlg8) Grafana connects to Doris via MySQL datasource, offering unified visualization and analysis of Logs, Traces, and Metrics, including cross-analysis between Logs and Traces. * **Log** ![log-visualization](/assets/images/log-visualization-4d229267456f4500ab7773288b00831f.png) * **Trace** ![log-visualization](/assets/images/trace-visualization-f89d819cc08b54ab4e5844d913ca543d.png) * **Metrics** ![metrics-visualization](/assets/images/metrics-visualization-f333290f209fa044fdafdee89e08659a.png) While Grafana's log visualization and analysis capabilities are relatively basic compared to Kibana, third-party vendors have implemented Kibana-like Discover features. These will soon be integrated into Grafana's Doris datasource, enhancing unified observability visualization. Future enhancements will include Elasticsearch protocol compatibility, enabling native Kibana connections to Doris. For ELK users, replacing Elasticsearch with Doris maintains existing logging and visualization habits while significantly reducing costs and improving efficiency. ![studio-visualization](/assets/images/studio- discover-8c09c2212aa3b8a0662b30ef0816ce0f.jpeg) On This Page * What Is Observability? * Why Observability Is Becoming Increasingly Important * How to Choose an Observability Solution * 1\. **Performance: Includes Write and Query Performance** * 2\. **Cost: Includes Storage and Compute Costs** * 3\. **Openness: Includes Open Source and Multi-Cloud Neutrality** * 4\. **Ease of Use: Includes Manageability and Usability** * Observability Solution Based on Doris * System Architecture * Key Features and Advantages * Demo & Screenshots --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/admin-manual/system-tables/overview Version: 4.x On this page # Overview Apache Doris cluster has multiple built-in system databases to store metadata information about the Doris system itself. ### information_schema​ All tables under the `information_schema` database are virtual tables and do not have physical entities. These system tables contain metadata about the Doris cluster and all its database objects, including databases, tables, columns, permissions, etc. They also include functional status information like Workload Group, Task, etc. There is an `information_schema` database under each Catalog, containing metadata only for the corresponding Catalog's databases and tables. All tables in the `information_schema` database are read-only, and users cannot modify, drop, or create tables in this database. By default, all users have read permissions for all tables in this database, but the query results will vary based on the user's actual permission. For example, if User A only has permissions for `db1.table1`, querying the `information_schema.tables` table will only return information related to `db1.table1`. ### mysql​ All tables under the `mysql` database are virtual tables and do not have physical entities. These system tables contain information such as permissions and are mainly used for MySQL ecosystem compatibility. There is a `mysql` database under each Catalog, but the content of tables is identical. All tables in the `mysql` database are read-only, and users cannot modify, delete, or create tables in this database. ### __internal_schema​ All tables under the `__internal_schema` database are actual tables in Doris, stored similarly to user-created data tables. When a Doris cluster is created, all system tables under this database are automatically created. By default, common users have read-only permissions for tables in this database. However, once granted, they can modify, delete, or create tables under this database. On This Page * information_schema * mysql * __internal_schema --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/compute-storage-decoupled/before-deployment Version: 4.x On this page # Doris Compute-Storage Decoupled Deployment Preparation ## 1\. Overview​ This document describes the deployment preparation work for the Apache Doris compute-storage decoupled mode. The decoupled architecture aims to improve system scalability and performance, suitable for large-scale data processing scenarios. ## 2\. Architecture Components​ The Doris compute-storage decoupled architecture consists of three main modules: 1. **Frontend (FE)** : Handles user requests and manages metadata. 2. **Backend (BE)** : Stateless compute nodes that execute query tasks. 3. **Meta Service (MS)** : Manages metadata operations and data recovery. ## 3\. System Requirements​ ### 3.1 Hardware Requirements​ * Minimum configuration: 3 servers * Recommended configuration: 5 or more servers ### 3.2 Software Dependencies​ * FoundationDB (FDB) version 7.1.38 or higher * OpenJDK 17 ## 4\. Deployment Planning​ ### 4.1 Testing Environment Deployment​ Deploy all modules on a single machine, not suitable for production environments. ### 4.2 Production Deployment​ * Deploy FDB on 3 or more machines * Deploy FE and Meta Service on 3 or more machines * Deploy BE on 3 or more machines When machine configurations are high, consider mixing FDB, FE, and Meta Service, but do not mix disks. ## 5\. Installation Steps​ ### 5.1 Install FoundationDB​ This section provides a step-by-step guide to configure, deploy, and start the FoundationDB (FDB) service using the provided scripts `fdb_vars.sh` and `fdb_ctl.sh`. You can download [doris tools](http://apache-doris-releases.oss- accelerate.aliyuncs.com/apache-doris-3.0.2-tools.tar.gz) and get `fdb_vars.sh` and `fdb_ctl.sh` from `fdb` directory. #### 5.1.1 Machine Requirements​ Typically, at least 3 machines equipped with SSDs are required to form a FoundationDB cluster with dual data replicas and allow for single machine failures. If SSDs are not available, at least standard cloud disks or local disks with a standard POSIX-compliant file system must be used for data storage. Otherwise, FoundationDB may fail to operate properly - for instance, storage solutions like JuiceFS should not be used as the underlying storage for FoundationDB. tip If only for development/testing purposes, a single machine is sufficient. #### 5.1.2 `fdb_vars.sh` Configuration​ ##### Required Custom Settings​ Parameter| Description| Type| Example| Notes| `DATA_DIRS`| Specify the data directory for FoundationDB storage| Comma-separated list of absolute paths| `/mnt/foundationdb/data1,/mnt/foundationdb/data2,/mnt/foundationdb/data3`| \- Ensure directories are created before running the script \- SSD and separate directories are recommended for production environments| `FDB_CLUSTER_IPS`| Define cluster IPs| String (comma-separated IP addresses)| `172.200.0.2,172.200.0.3,172.200.0.4`| \- At least 3 IP addresses for production clusters \- The first IP will be used as the coordinator \- For high availability, place machines in different racks| `FDB_HOME`| Define the main directory for FoundationDB| Absolute path| `/fdbhome`| \- Default path is /fdbhome \- Ensure this path is absolute| `FDB_CLUSTER_ID`| Define the cluster ID| String| `SAQESzbh`| \- Each cluster ID must be unique \- Can be generated using `mktemp -u XXXXXXXX`| `FDB_CLUSTER_DESC`| Define the description of the FDB cluster| String| `dorisfdb`| \- It is recommended to change this to something meaningful for the deployment ---|---|---|---|--- ##### Optional Custom Settings​ Parameter| Description| Type| Example| Notes| `MEMORY_LIMIT_GB`| Define the memory limit for FDB processes in GB| Integer| `MEMORY_LIMIT_GB=16`| Adjust this value based on available memory resources and FDB process requirements| `CPU_CORES_LIMIT`| Define the CPU core limit for FDB processes| Integer| `CPU_CORES_LIMIT=8`| Set this value based on the number of available CPU cores and FDB process requirements ---|---|---|---|--- #### 5.1.3 Deploy FDB Cluster​ After configuring the environment with `fdb_vars.sh`, you can deploy the FDB cluster on each node using the `fdb_ctl.sh` script. ./fdb_ctl.sh deploy This command initiates the deployment process of the FDB cluster. ### 5.1.4 Start FDB Service​ Once the FDB cluster is deployed, you can start the FDB service on each node using the `fdb_ctl.sh` script. ./fdb_ctl.sh start This command starts the FDB service, making the cluster operational and obtaining the FDB cluster connection string, which can be used for configuring the MetaService. ### 5.2 Install OpenJDK 17​ 1. Download [OpenJDK 17](https://download.java.net/java/GA/jdk17.0.1/2a2082e5a09d4267845be086888add4f/12/GPL/openjdk-17.0.1_linux-x64_bin.tar.gz) 2. Extract and set the environment variable JAVA_HOME. ## 6\. Next Steps​ After completing the above preparations, please refer to the following documents to continue the deployment: 1. [Deployment](/cloud/4.x/user-guide/compute-storage-decoupled/compilation-and-deployment) 2. [Managing Compute Group](/cloud/4.x/user-guide/compute-storage-decoupled/managing-compute-cluster) 3. [Managing Storage Vault](/cloud/4.x/user-guide/compute-storage-decoupled/managing-storage-vault) ## 7\. Notes​ * Ensure time synchronization across all nodes * Regularly back up FoundationDB data * Adjust FoundationDB and Doris configuration parameters based on actual load * Use standard cloud disks or local disks with a POSIX-compliant file system for data storage; otherwise, FoundationDB may not function properly. * For example, storage solutions like JuiceFS should not be used as FoundationDB's storage backend. ## 8\. References​ * [FoundationDB Official Documentation](https://apple.github.io/foundationdb/index.html) * [Apache Doris Official Website](https://doris.apache.org/) On This Page * 1\. Overview * 2\. Architecture Components * 3\. System Requirements * 3.1 Hardware Requirements * 3.2 Software Dependencies * 4\. Deployment Planning * 4.1 Testing Environment Deployment * 4.2 Production Deployment * 5\. Installation Steps * 5.1 Install FoundationDB * 5.1.4 Start FDB Service * 5.2 Install OpenJDK 17 * 6\. Next Steps * 7\. Notes * 8\. References --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/export/export-overview Version: 4.x On this page # Export Overview The data export function is used to write the query result set or Doris table data into the specified storage system in the specified file format. The differences between the export function and the data backup function are as follows: | Data Export| Data Backup| Final Storage Location| HDFS, Object Storage, Local File System| HDFS, Object Storage| Data Format| Open file formats such as Parquet, ORC, CSV| Doris internal storage format| Execution Speed| Moderate (requires reading data and converting to the target data format)| Fast (no parsing and conversion required, directly upload Doris data files)| Flexibility| Can flexibly define the data to be exported through SQL statements| Only supports table-level full backup| Use Cases| Result set download, data exchange between different systems| Data backup, data migration between Doris clusters ---|---|--- ## Choosing Export Methods​ Doris provides three different data export methods: * **SELECT INTO OUTFILE** : Supports the export of any SQL result set. * **EXPORT** : Supports the export of partial or full table data. * **MySQL DUMP** : Compatible with the MySQL dump command for data export. The similarities and differences between the three export methods are as follows: | SELECT INTO OUTFILE| EXPORT| MySQL DUMP| Synchronous/Asynchronous| Synchronous| Asynchronous (submit EXPORT tasks and check task progress via SHOW EXPORT command)| Synchronous| Supports any SQL| Yes| No| No| Export specific partitions| Yes| Yes| No| Export specific tablets| Yes| No| No| Concurrent export| Supported with high concurrency (depends on whether the SQL statement has operators such as ORDER BY that need to be processed on a single node)| Supported with high concurrency (supports tablet-level concurrent export)| Not supported, single-threaded export only| Supported export data formats| Parquet, ORC, CSV| Parquet, ORC, CSV| MySQL Dump proprietary format| Supports exporting external tables| Yes| Partially supported| No| Supports exporting views| Yes| Yes| Yes| Supported export locations| S3, HDFS| S3, HDFS| LOCAL ---|---|---|--- ### SELECT INTO OUTFILE​ Suitable for the following scenarios: * Data needs to be exported after complex calculations, such as filtering, aggregation, joins, etc. * Suitable for scenarios that require synchronous tasks. ### EXPORT​ Suitable for the following scenarios: * Large-scale single table export, with simple filtering conditions. * Scenarios that require asynchronous task submission. ### MySQL Dump​ Suitable for the following scenarios: * Compatible with the MySQL ecosystem, requires exporting both table structure and data. * Only for development testing or scenarios with very small data volumes. ## Export File Column Type Mapping​ Parquet and ORC file formats have their own data types. Doris's export function can automatically map Doris's data types to the corresponding data types in Parquet and ORC file formats. The CSV format does not have types, all data is output as text. The following table shows the mapping between Doris data types and Parquet, ORC file format data types: * ORC Doris Type| Orc Type| boolean| boolean| tinyint| tinyint| smallint| smallint| int| int| bigint| bigint| largeInt| string| date| string| datev2| string| datetime| string| datetimev2| timestamp| float| float| double| double| char / varchar / string| string| decimal| decimal| struct| struct| map| map| array| array| json| string| variant| string| bitmap| binary| quantile_state| binary| hll| binary ---|--- * Parquet When Doris is exported to the Parquet file format, the Doris memory data is first converted to the Arrow memory data format, and then written out to the Parquet file format by Arrow. Doris Type| Arrow Type| Parquet Physical Type| Parquet Logical Type| boolean| boolean| BOOLEAN| | tinyint| int8| INT32| INT_8| smallint| int16| INT32| INT_16| int| int32| INT32| INT_32| bigint| int64| INT64| INT_64| largeInt| utf8| BYTE_ARRAY| UTF8| date| utf8| BYTE_ARRAY| UTF8| datev2| date32| INT32| DATE| datetime| utf8| BYTE_ARRAY| UTF8| datetimev2| timestamp| INT96/INT64| TIMESTAMP(MICROS/MILLIS/SECONDS)| float| float32| FLOAT| | double| float64| DOUBLE| | char / varchar / string| utf8| BYTE_ARRAY| UTF8| decimal| decimal128| FIXED_LEN_BYTE_ARRAY| DECIMAL(scale, precision)| struct| struct| | Parquet Group| map| map| | Parquet Map| array| list| | Parquet List| json| utf8| BYTE_ARRAY| UTF8| variant| utf8| BYTE_ARRAY| UTF8| bitmap| binary| BYTE_ARRAY| | quantile_state| binary| BYTE_ARRAY| | hll| binary| BYTE_ARRAY| ---|---|---|--- > Note: In versions 2.1.11 and 3.0.7, you can specify the > `parquet.enable_int96_timestamps` property to determine whether Doris's > datetimev2 type uses Parquet's INT96 storage or INT64. INT96 is used by > default. However, INT96 has been deprecated in the Parquet standard and is > only used for compatibility with some older systems (such as versions before > Hive 4.0). On This Page * Choosing Export Methods * SELECT INTO OUTFILE * EXPORT * MySQL Dump * Export File Column Type Mapping --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/import/load-manual Version: 4.x On this page # Loading Overview Apache Doris offers various methods for importing and integrating data, allowing you to import data from various sources into the database. These methods can be categorized into four types: * **Real-Time Writing** : Data is written into Doris tables in real-time via HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying. * For small amounts of data (once every 5 minutes), you can use [JDBC INSERT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual). * For higher concurrency or frequency (more than 20 concurrent writes or multiple writes per minute), you can enable [Group Commit](/cloud/4.x/user-guide/data-operate/import/group-commit-manual) and use JDBC INSERT or Stream Load. * For high throughput, you can use [Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual) via HTTP. * **Streaming Synchronization** : Real-time data streams (e.g., Flink, Kafka, transactional databases) are imported into Doris tables, ideal for real-time analysis and querying. * You can use Flink Doris Connector to write Flink’s real-time data streams into Doris. * You can use [Routine Load](/cloud/4.x/user-guide/data-operate/import/import-way/routine-load-manual) or Doris Kafka Connector for Kafka’s real-time data streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, CSV, and Protobuf formats. * You can use Flink CDC or Datax to write transactional database CDC data streams into Doris. * **Batch Import** : Data is batch-loaded from external storage systems (e.g., Object Storage, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data import needs. * You can use [Broker Load](/cloud/4.x/user-guide/data-operate/import/import-way/broker-load-manual) to write files from Object Storage and HDFS into Doris. * You can use [INSERT INTO SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual) to synchronously load files from Object Storage, HDFS, and NAS into Doris, and you can perform the operation asynchronously using a [JOB](/cloud/4.x/user-guide/admin-manual/workload-management/job-scheduler). * You can use [Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load-manual) or Doris Streamloader to write local files into Doris. * **External Data Source Integration** : Query and partially import data from external sources (e.g., Hive, JDBC, Iceberg) into Doris tables. * You can create a [Catalog](/cloud/4.x/user-guide/lakehouse/lakehouse-overview) to read data from external sources and use [INSERT INTO SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into-manual) to synchronize this data into Doris, with asynchronous execution via [JOB](/cloud/4.x/user-guide/admin-manual/workload-management/job-scheduler). Each import method in Doris is an implicit transaction by default. For more information on transactions, refer to [Transactions](/cloud/4.x/user- guide/data-operate/transaction). ### Quick Overview of Import Methods​ Doris import process mainly involves various aspects such as data sources, data formats, import methods, error handling, data transformation, and transactions. You can quickly browse the scenarios suitable for each import method and the supported file formats in the table below. Import Method| Use Case| Supported File Formats| Import Mode| [Stream Load](/cloud/4.x/user-guide/data-operate/import/import-way/stream-load- manual)| Importing local files or push data in applications via HTTP.| csv, json, parquet, orc| Synchronous| [Broker Load](/cloud/4.x/user-guide/data- operate/import/import-way/broker-load-manual)| Importing from object storage, HDFS, etc.| csv, json, parquet, orc| Asynchronous| [INSERT INTO VALUES](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into- manual)| Writing data via JDBC.| SQL| Synchronous| [INSERT INTO SELECT](/cloud/4.x/user-guide/data-operate/import/import-way/insert-into- manual)| Importing from an external source like a table in a catalog or files in Object Storage, HDFS.| SQL| Synchronous, Asynchronous via Job| [Routine Load](/cloud/4.x/user-guide/data-operate/import/import-way/routine-load- manual)| Real-time import from Kafka| csv, json| Asynchronous| [MySQL Load](/cloud/4.x/user-guide/data-operate/import/import-way/mysql-load-manual)| Importing from local files.| csv| Synchronous| [Group Commit](/cloud/4.x/user- guide/data-operate/import/group-commit-manual)| Writing with high frequency.| Depending on the import method used| - ---|---|---|--- On This Page * Quick Overview of Import Methods --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/data-operate/update/update-overview Version: 4.x On this page # Data Update Overview In today's data-driven decision-making landscape, data "freshness" has become a core competitive advantage for enterprises to stand out in fierce market competition. Traditional T+1 data processing models, due to their inherent latency, can no longer meet the stringent real-time requirements of modern business. Whether it's achieving millisecond-level synchronization between business databases and data warehouses, dynamically adjusting operational strategies, or correcting erroneous data within seconds to ensure decision accuracy, robust real-time data update capabilities are crucial. Apache Doris, as a modern real-time analytical database, has one of its core design goals to provide ultimate data freshness. Through its powerful data models and flexible update mechanisms, it successfully compresses data analysis latency from day-level and hour-level to second-level, providing a solid foundation for users to build real-time, agile business decision-making loops. This document serves as an official guide that systematically explains Apache Doris's data update capabilities, covering its core principles, diverse update and deletion methods, typical application scenarios, and performance best practices under different deployment modes, aiming to help you comprehensively master and efficiently utilize Doris's data update functionality. ## 1\. Core Concepts: Table Models and Update Mechanisms​ In Doris, the **Data Model** of a data table determines its data organization and update behavior. To support different business scenarios, Doris provides three table models: Unique Key Model, Aggregate Key Model, and Duplicate Key Model. Among these, **the Unique Key Model is the core for implementing complex, high-frequency data updates**. ### 1.1. Table Model Overview​ **Table Model**| **Key Features**| **Update Capability**| **Use Cases**| **Unique Key Model**| Built for real-time updates. Each data row is identified by a unique Primary Key, supporting row-level UPSERT (Update/Insert) and partial column updates.| Strongest, supports all update and deletion methods.| Order status updates, real-time user tag computation, CDC data synchronization, and other scenarios requiring frequent, real-time changes.| **Aggregate Key Model**| Pre-aggregates data based on specified Key columns. For rows with the same Key, Value columns are merged according to defined aggregation functions (such as SUM, MAX, MIN, REPLACE).| Limited, supports REPLACE-style updates and deletions based on Key columns.| Scenarios requiring real-time summary statistics, such as real-time reports, advertisement click statistics, etc.| **Duplicate Key Model**| Data only supports append-only writes, without any deduplication or aggregation operations. Even identical data rows are retained.| Limited, only supports conditional deletion through DELETE statements.| Log collection, user behavior tracking, and other scenarios that only need appending without updates. ---|---|---|--- ### 1.2. Data Update Methods​ Doris provides two major categories of data update methods: **updating through data load** and **updating through DML statements**. #### 1.2.1. Updating Through Load (UPSERT)​ This is Doris's **recommended high-performance, high-concurrency** update method, primarily targeting the **Unique Key Model**. All load methods (Stream Load, Broker Load, Routine Load, `INSERT INTO`) naturally support `UPSERT` semantics. When new data is loaded, if its primary key already exists, Doris will overwrite the old row data with the new row data; if the primary key doesn't exist, it will insert a new row. ![img](/assets/images/update-by-loading-a4f5f7538d78640a3fbe52eb8035412e.png) #### 1.2.2. Updating Through `UPDATE` DML Statements​ Doris supports standard SQL `UPDATE` statements, allowing users to update data based on conditions specified in the `WHERE` clause. This method is very flexible and supports complex update logic, such as cross-table join updates. ![img](/assets/images/update-self-2789f15899570f8744635f5b9d6ab91d.png) -- Simple update UPDATE user_profiles SET age = age + 1 WHERE user_id = 1; -- Cross-table join update UPDATE sales_records t1 SET t1.user_name = t2.name FROM user_profiles t2 WHERE t1.user_id = t2.user_id; **Note** : The execution process of `UPDATE` statements involves first scanning data that meets the conditions, then rewriting the updated data back to the table. It's suitable for low-frequency, batch update tasks. **High- concurrency operations on** **`UPDATE`** **statements are not recommended** because concurrent `UPDATE` operations involving the same primary keys cannot guarantee data isolation. #### 1.2.3. Updating Through `INSERT INTO SELECT` DML Statements​ Since Doris provides UPSERT semantics by default, using `INSERT INTO SELECT` can also achieve similar update effects as `UPDATE`. ### 1.3. Data Deletion Methods​ Similar to updates, Doris also supports deleting data through both load and DML statements. #### 1.3.1. Mark Deletion Through Load​ This is an efficient batch deletion method, primarily used for the **Unique Key Model**. Users can add a special hidden column `DORIS_DELETE_SIGN` when loading data. When the value of this column for a row is `1` or `true`, Doris will mark the corresponding data row with that primary key as deleted (the principle of delete sign will be explained in detail later). // Stream Load load data, delete row with user_id = 2 // curl --location-trusted -u user:passwd -H "columns:user_id, __DORIS_DELETE_SIGN__" -T delete.json http://fe_host:8030/api/db_name/table_name/_stream_load // delete.json content [ {"user_id": 2, "__DORIS_DELETE_SIGN__": "1"} ] #### 1.3.2. Deletion Through `DELETE` DML Statements​ Doris supports standard SQL `DELETE` statements that can delete data based on `WHERE` conditions. * **Unique Key Model** : `DELETE` statements will rewrite the primary keys of rows meeting the conditions with deletion marks. Therefore, its performance is proportional to the amount of data to be deleted. The execution principle of `DELETE` statements on Unique Key Models is very similar to `UPDATE` statements, first reading the data to be deleted through queries, then writing it once more with deletion marks. Compared to `UPDATE` statements, DELETE statements only need to write Key columns and deletion mark columns, making them relatively lighter. * **Duplicate/Aggregate Models** : `DELETE` statements are implemented by recording a delete predicate. During queries, this predicate serves as a runtime filter to filter out deleted data. Therefore, `DELETE` operations themselves are very fast, almost independent of the amount of deleted data. However, note that **high-frequency** **`DELETE`** **operations on Duplicate/Aggregate Models will accumulate many runtime filters, severely affecting subsequent query performance**. DELETE FROM user_profiles WHERE last_login < '2022-01-01'; The following table provides a brief summary of using DML statements for deletion: | **Unique Key Model**| **Aggregate Model**| **Duplicate Model**| Implementation| Delete Sign| Delete Predicate| Delete Predicate| Limitations| None| Delete conditions only for Key columns| None| Deletion Performance| Moderate| Fast| Fast ---|---|---|--- ## 2\. Deep Dive into Unique Key Model: Principles and Implementation​ The Unique Key Model is the cornerstone of Doris's high-performance real-time updates. Understanding its internal working principles is crucial for fully leveraging its performance. ### 2.1. Merge-on-Write (MoW) vs. Merge-on-Read (MoR)​ The Unique Key Model has two data merging strategies: Merge-on-Write (MoW) and Merge-on-Read (MoR). **Since Doris 2.1, MoW has become the default and recommended implementation**. **Feature**| **Merge-on-Write (MoW)**| **Merge-on-Read (MoR) - (Legacy)**| **Core Concept**| Completes data deduplication and merging during data writing, ensuring only one latest record per primary key in storage.| Retains multiple versions during data writing, performs real-time merging during queries to return the latest version.| **Query Performance**| Extremely high. No additional merge operations needed during queries, performance approaches that of non-updated detail tables.| Poor. Requires data merging during queries, taking about 3-10 times longer than MoW and consuming more CPU and memory.| **Write Performance**| Has merge overhead during writing, with some performance loss compared to MoR (about 10-20% for small batches, 30-50% for large batches).| Fast writing speed, approaching detail tables.| **Resource Consumption**| Consumes more CPU and memory during writing and background Compaction.| Consumes more CPU and memory during queries.| **Use Cases**| Most real-time update scenarios. Especially suitable for read-heavy, write- light businesses, providing ultimate query analysis performance.| Suitable for write-heavy, read-light scenarios, but no longer mainstream recommended. ---|---|--- The MoW mechanism trades a small cost during the writing phase for tremendous improvement in query performance, perfectly aligning with the OLAP system's "read-heavy, write-light" characteristics. ### 2.2. Conditional Updates (Sequence Column)​ In distributed systems, out-of-order data arrival is a common problem. For example, an order status changes sequentially to "Paid" and "Shipped", but due to network delays, data representing "Shipped" might arrive at Doris before data representing "Paid". To solve this problem, Doris introduces the **Sequence Column** mechanism. Users can specify a column (usually a timestamp or version number) as the Sequence column when creating tables. When processing data with the same primary key, Doris will compare their Sequence column values and **always retain the row with the largest Sequence value** , thus ensuring eventual consistency even when data arrives out of order. CREATE TABLE order_status ( order_id BIGINT, status_name STRING, update_time DATETIME ) UNIQUE KEY(order_id) DISTRIBUTED BY HASH(order_id) PROPERTIES ( "function_column.sequence_col" = "update_time" -- Specify update_time as Sequence column ); -- 1. Write "Shipped" record (larger update_time) -- {"order_id": 1001, "status_name": "Shipped", "update_time": "2023-10-26 12:00:00"} -- 2. Write "Paid" record (smaller update_time, arrives later) -- {"order_id": 1001, "status_name": "Paid", "update_time": "2023-10-26 11:00:00"} -- Final query result, retains record with largest update_time -- order_id: 1001, status_name: "Shipped", update_time: "2023-10-26 12:00:00" ### 2.3. Deletion Mechanism (`DORIS_DELETE_SIGN`) Workflow​ The working principle of `DORIS_DELETE_SIGN` can be summarized as "logical marking, background cleanup". 1. **Execute Deletion** : When users delete data through load or `DELETE` statements, Doris doesn't immediately remove data from physical files. Instead, it writes a new record for the primary key to be deleted, with the `DORIS_DELETE_SIGN` column marked as `1`. 2. **Query Filtering** : When users query data, Doris automatically adds a filter condition `WHERE DORIS_DELETE_SIGN = 0` to the query plan, thus hiding all data marked for deletion from query results. 3. **Background Compaction** : Doris's background Compaction process periodically scans data. When it finds a primary key with both normal records and deletion mark records, it will physically remove both records during the merge process, eventually freeing storage space. This mechanism ensures quick response to deletion operations while asynchronously completing physical cleanup through background tasks, avoiding performance impact on online business. The following diagram shows how `DORIS_DELETE_SIGN` works: ![img](/assets/images/delete-sign-en-34b859e4f09107a0abc78d6a8036e34b.png) ### 2.4 Partial Column Update​ Starting from version 2.0, Doris supports powerful partial column update capabilities on Unique Key Models (MoW). When loading data, users only need to provide the primary key and columns to be updated; unprovided columns will maintain their original values unchanged. This greatly simplifies ETL processes for scenarios like wide table joining and real-time tag updates. To enable this functionality, you need to enable Merge-on-Write (MoW) mode when creating Unique Key Model tables and set the `enable_unique_key_partial_update` property to `true`, or configure the `"partial_columns"` parameter during data load. CREATE TABLE user_profiles ( user_id BIGINT, name STRING, age INT, last_login DATETIME ) UNIQUE KEY(user_id) DISTRIBUTED BY HASH(user_id) PROPERTIES ( "enable_unique_key_partial_update" = "true" ); -- Initial data -- user_id: 1, name: 'Alice', age: 30, last_login: '2023-10-01 10:00:00' -- load partial update data through Stream Load, only updating age and last_login -- {"user_id": 1, "age": 31, "last_login": "2023-10-26 18:00:00"} -- Updated data -- user_id: 1, name: 'Alice', age: 31, last_login: '2023-10-26 18:00:00' **Partial Column Update Principle Overview** Unlike traditional OLTP databases, Doris's partial column update is not in- place data update. To achieve better write throughput and query performance in Doris, partial column updates in Unique Key Models adopt an **"load-time missing field completion followed by full-row writing"** implementation approach. Therefore, using Doris's partial column update has **"read amplification"** and **"write amplification"** effects. For example, updating 10 fields in a 100-column wide table requires Doris to complete the missing 90 fields during the write process. Assuming each field has similar size, a 1MB 10-field update will generate approximately 9MB of data reading (completing missing fields) and 10MB of data writing (writing the complete row to new files) in the Doris system, resulting in about 9x read amplification and 10x write amplification. **Partial Column Update Performance Recommendations** Due to read and write amplification in partial column updates, and since Doris is a columnar storage system, the data reading process may generate significant random I/O, requiring high random read IOPS from storage. Since traditional mechanical disks have significant bottlenecks in random I/O, **if you want to use partial column update functionality for high-frequency writes, SSD drives are recommended, preferably NVMe interface** , which can provide the best random I/O support. Additionally, **if the table is very wide, enabling row storage is also recommended to reduce random I/O**. After enabling row storage, Doris will store an additional copy of row-based data alongside columnar storage. Since row-based data stores each row continuously, it can read entire rows with a single I/O operation (columnar storage requires N I/O operations to read all missing fields, such as the previous example of a 100-column wide table updating 10 columns, requiring 90 I/O operations per row to read all fields). ## 3\. Typical Application Scenarios​ Doris's powerful data update capabilities enable it to handle various demanding real-time analysis scenarios. ### 3.1. CDC Real-time Data Synchronization​ Capturing change data (Binlog) from upstream business databases (such as MySQL, PostgreSQL, Oracle) through tools like Flink CDC and writing it in real-time to Doris Unique Key Model tables is the most classic scenario for building real-time data warehouses. * **Whole Database Synchronization** : Flink Doris Connector internally integrates Flink CDC, enabling automated, end-to-end whole database synchronization from upstream databases to Doris without manual table creation and field mapping configuration. * **Ensuring Consistency** : Utilizes the Unique Key Model's `UPSERT` capability to handle upstream `INSERT` and `UPDATE` operations, uses `DORIS_DELETE_SIGN` to handle `DELETE` operations, and combines with Sequence columns (such as timestamps in Binlog) to handle out-of-order data, perfectly replicating upstream database states and achieving millisecond-level data synchronization latency. ![img](/assets/images/flink-1bf58914cfc4a8eedcd1617123f50e76.png) ### 3.2. Real-time Wide Table Joining​ In many analytical scenarios, data from different business systems needs to be joined into user-wide tables or product-wide tables. Traditional approaches use offline ETL tasks (such as Spark or Hive) for periodic (T+1) joining, which has poor real-time performance and high maintenance costs. Alternatively, using Flink for real-time wide table join calculations and writing joined data to databases typically requires significant computational resources. Using Doris's **partial column update** capability can greatly simplify this process: 1. Create a Unique Key Model wide table in Doris. 2. Write data streams from different sources (such as user basic information, user behavior data, transaction data, etc.) to this wide table in real-time through Stream Load or Routine Load. 3. Each data stream only updates its relevant fields. For example, user behavior data streams only update `page_view_count`, `last_login_time`, and other fields; transaction data streams only update `total_orders`, `total_amount`, and other fields. This approach not only transforms wide table construction from offline ETL to real-time stream processing, greatly improving data freshness, but also reduces I/O overhead by only writing changed columns, improving write performance. ## 4\. Best Practices​ Following these best practices can help you use Doris's data update functionality more stably and efficiently. ### 4.1. General Performance Practices​ 1. **Prioritize load Updates** : For high-frequency, large-volume update operations, prioritize load methods like Stream Load and Routine Load over `UPDATE` DML statements. 2. **Batch Writes** : Avoid using `INSERT INTO` statements for individual high-frequency writes (such as > 100 TPS), as each `INSERT` incurs transaction overhead. If necessary, consider enabling Group Commit functionality to merge multiple small batch commits into one large transaction. 3. **Use High-frequency** **`DELETE`** **Carefully** : On Duplicate and Aggregate models, avoid high-frequency `DELETE` operations to prevent query performance degradation. 4. **Use** **`TRUNCATE PARTITION`** **for Partition Data Deletion** : If you need to delete entire partition data, use `TRUNCATE PARTITION`, which is much more efficient than `DELETE`. 5. **Execute** **`UPDATE`** **Serially** : Avoid concurrent execution of `UPDATE` tasks that might affect the same data rows. ### 4.2. Unique Key Model Practices in Compute-Storage Separation Architecture​ Doris 3.0 introduces an advanced compute-storage separation architecture, bringing ultimate elasticity and lower costs. In this architecture, since BE nodes are stateless, a global state needs to be maintained through MetaService during the Merge-on-Write process to resolve write-write conflicts between load/compaction/schema change operations. The MoW implementation of Unique Key Models relies on a distributed table lock based on Meta Service to ensure write operation consistency, as shown in the following diagram: ![img](/assets/images/cloud-mow-2aa2313f76c4dca805206f3fe6f368d6.png) High-frequency loads and Compaction lead to frequent competition for table locks, so special attention should be paid to the following points: 1. **Control Single Table load Frequency** : It's recommended to control the load frequency of a single Unique Key table to within **60 times/second**. This can be achieved by batching and adjusting load concurrency. 2. **Reasonable Partition and Bucket Design** : 1. **Partitions** : Using time partitioning (such as by day or hour) ensures that single loads only update a few partitions, reducing the scope of lock competition. 2. **Buckets** : The number of buckets (Tablet count) should be reasonably set based on data volume, typically between 8-64. Too many Tablets will intensify lock competition. 3. **Adjust Compaction Strategy** : In scenarios with very high write pressure, Compaction strategies can be appropriately adjusted to reduce Compaction frequency, thereby reducing lock conflicts between Compaction and load tasks. 4. **Upgrade to Latest Version** : The Doris community is continuously optimizing Unique Key Model performance under compute-storage separation architecture. For example, the upcoming 3.1 release significantly optimizes the distributed table lock implementation. **Always recommend using the latest stable version** for optimal performance. ## Conclusion​ Apache Doris, with its powerful, flexible, and efficient data update capabilities centered on the Unique Key Model, truly breaks through the bottleneck of traditional OLAP systems in terms of data freshness. Whether through high-performance loads implementing `UPSERT` and partial column updates, or using Sequence columns to ensure consistency of out-of-order data, Doris provides complete solutions for building end-to-end real-time analytical applications. By deeply understanding its core principles, mastering the applicable scenarios for different update methods, and following the best practices provided in this document, you will be able to fully unleash Doris's potential, making real-time data truly become a powerful engine driving business growth. On This Page * 1\. Core Concepts: Table Models and Update Mechanisms * 1.1. Table Model Overview * 1.2. Data Update Methods * 1.3. Data Deletion Methods * 2\. Deep Dive into Unique Key Model: Principles and Implementation * 2.1. Merge-on-Write (MoW) vs. Merge-on-Read (MoR) * 2.2. Conditional Updates (Sequence Column) * 2.3. Deletion Mechanism (`DORIS_DELETE_SIGN`) Workflow * 2.4 Partial Column Update * 3\. Typical Application Scenarios * 3.1. CDC Real-time Data Synchronization * 3.2. Real-time Wide Table Joining * 4\. Best Practices * 4.1. General Performance Practices * 4.2. Unique Key Model Practices in Compute-Storage Separation Architecture * Conclusion --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/db-connect/database-connect Version: 4.x On this page # Connecting by MySQL Protocol Apache Doris adopts the MySQL network connection protocol. It is compatible with command-line tools, JDBC/ODBC drivers, and various visualization tools within the MySQL ecosystem. Additionally, Apache Doris comes with a built-in, easy-to-use Web UI. This guide is about how to connect to Doris using MySQL Client, MySQL JDBC Connector, DBeaver, and the built-in Doris Web UI. ## MySQL Client​ Download MySQL Client from the [MySQL official website](https://dev.mysql.com/downloads/mysql/) for Linux. Currently, Doris is primarily compatible with MySQL 5.7 and later clients. Extract the downloaded MySQL client. In the `bin/` directory, find the `mysql` command-line tool. Execute the following command to connect to Doris: # FE_IP represents the listening address of the FE node, while FE_QUERY_PORT represents the port of the MySQL protocol service of the FE. This corresponds to the query_port parameter in fe.conf and it defaults to 9030. mysql -h FE_IP -P FE_QUERY_PORT -u USER_NAME After login, the following message will be displayed. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 236 Server version: 5.7.99 Doris version doris-2.0.3-rc06-37d31a5 Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> ## MySQL JDBC Connector​ Download the corresponding JDBC Connector from the official MySQL website. Example of connection code: String user = "user_name"; String password = "user_password"; String newUrl = "jdbc:mysql://FE_IP:FE_PORT/demo?useUnicode=true&characterEncoding=utf8&useTimezone=true&serverTimezone=Asia/Shanghai&useSSL=false&allowPublicKeyRetrieval=true"; try { Connection myCon = DriverManager.getConnection(newUrl, user, password); Statement stmt = myCon.createStatement(); ResultSet result = stmt.executeQuery("show databases"); ResultSetMetaData metaData = result.getMetaData(); int columnCount = metaData.getColumnCount(); while (result.next()) { for (int i = 1; i <= columnCount; i++) { System.out.println(result.getObject(i)); } } } catch (SQLException e) { log.error("get JDBC connection exception.", e); } If you need to initially change session variables when connecting, you can use the following format: jdbc:mysql://FE_IP:FE_PORT/demo?sessionVariables=key1=val1,key2=val2 ## DBeaver​ Create a MySQL connection to Apache Doris: ![database-connect-dbeaver](/assets/images/database-connect- dbeaver-e74120612bdbc9d4a14b79a5819ba6d5.png) Query in DBeaver: ![query-in-dbeaver](/assets/images/query-in- dbeaver-11f3e80e04942de7bd200a685655da3c.png) ## Built-in Web UI of Doris​ Doris FE has a built-in Web UI. It allows users to perform SQL queries and view other related information without the need to install the MySQL client To access the Web UI, simply enter the URL in a web browser: http://fe_ip:fe_port, for example, `http://172.20.63.118:8030`. This will open the built-in Web console of Doris. The built-in Web console is primarily intended for use by the root account of the cluster. By default, the root account password is empty after installation. ![web-login-username-password](/assets/images/web-login-username- password-0e96b0a7f82ba3609666352a6f56b26a.png) For example, you can execute the following command in the Playground to add a BE node. ALTER SYSTEM ADD BACKEND "be_host_ip:heartbeat_service_port"; ![Doris-Web-UI-Playground-en](/assets/images/Doris-Web-UI-Playground-en- ce00cb539e0dc6a110a17e5bd057a10b.png) tip For successful execution of statements that are not related to specific databases/tables in the Playground, it is necessary to randomly select a database from the left-hand database panel. This limitation will be removed later. The current built-in web console cannot execute SET type SQL statements. Therefore, the web console does not support statements like SET PASSWORD FOR 'user' = PASSWORD('user_password'). On This Page * MySQL Client * MySQL JDBC Connector * DBeaver * Built-in Web UI of Doris --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/lakehouse/lakehouse-overview Version: 4.x On this page # Lakehouse Overview **The lakehouse is a modern big data solution that combines the advantages of data lakes and data warehouses**. It integrates the low cost and high scalability of data lakes with the high performance and strong data governance capabilities of data warehouses, enabling efficient, secure, and quality- controlled storage and processing analysis of various data in the big data era. Through standardized open data formats and metadata management, it unifies **real-time** and **historical** data, **batch processing** , and **stream processing** , gradually becoming the new standard for enterprise big data solutions. ## Doris Lakehouse Solution​ Doris provides an excellent lakehouse solution for users through an extensible connector framework, a compute-storage decoupled architecture, a high- performance data processing engine, and data ecosystem openness. ![doris lakehouse architecture](/assets/images/lakehouse- arch-1-6ca0925c968f19a4074b52e289e12a99.jpeg) ### Flexible Data Access​ Doris supports mainstream data systems and data format access through an extensible connector framework and provides unified data analysis capabilities based on SQL, allowing users to easily perform cross-platform data queries and analysis without moving existing data. For details, refer to [Catalog Overview](/cloud/4.x/user-guide/lakehouse/catalog-overview) ### Data Source Connectors​ Whether it's Hive, Iceberg, Hudi, Paimon, or database systems supporting the JDBC protocol, Doris can easily connect and efficiently access data. For lakehouse systems, Doris can obtain the structure and distribution information of data tables from metadata services such as Hive Metastore, AWS Glue, and Unity Catalog, perform reasonable query planning, and utilize the MPP architecture for distributed computing. For details, refer to each catalog document, such as [Iceberg Catalog](/cloud/4.x/user-guide/lakehouse/catalogs/iceberg-catalog) #### Extensible Connector Framework​ Doris provides a good extensibility framework to help developers quickly connect to unique data sources within enterprises, achieving fast data interoperability. Doris defines three levels of standard Catalog, Database, and Table, allowing developers to easily map to the required data source levels. Doris also provides standard interfaces for metadata service and storage service accessing, and developers only need to implement the corresponding interface to complete the data source connection. Doris is compatible with the Trino Connector plugin, allowing the Trino plugin package to be directly deployed to the Doris cluster, and with minimal configuration, the corresponding data source can be accessed. Doris has already completed connections to data sources such as [Kudu](/cloud/4.x/user- guide/lakehouse/catalogs/kudu-catalog), [BigQuery](/cloud/4.x/user- guide/lakehouse/catalogs/bigquery-catalog), and [Delta Lake](/cloud/4.x/user- guide/lakehouse/catalogs/delta-lake-catalog). You can also [adapt new plugins yourself](https://doris.apache.org/community/how-to-contribute/trino- connector-developer-guide). #### Convenient Cross-Source Data Processing​ Doris supports creating multiple data catalogs at runtime and using SQL to perform federated queries on these data sources. For example, users can associate query fact table data in Hive with dimension table data in MySQL: SELECT h.id, m.name FROM hive.db.hive_table h JOIN mysql.db.mysql_table m ON h.id = m.id; Combined with Doris's built-in [job scheduling](/cloud/4.x/user-guide/admin- manual/workload-management/job-scheduler) capabilities, you can also create scheduled tasks to further simplify system complexity. For example, users can set the result of the above query as a routine task executed every hour and write each result into an Iceberg table: CREATE JOB schedule_load ON SCHEDULE EVERY 1 HOUR DO INSERT INTO iceberg.db.ice_table SELECT h.id, m.name FROM hive.db.hive_table h JOIN mysql.db.mysql_table m ON h.id = m.id; ### High-Performance Data Processing​ As an analytical data warehouse, Doris has made numerous optimizations in lakehouse data processing and computation and provides rich query acceleration features: * Execution Engine The Doris execution engine is based on the MPP execution framework and Pipeline data processing model, capable of quickly processing massive data in a multi-machine, multi-core distributed environment. Thanks to fully vectorized execution operators, Doris leads in computing performance in standard benchmark datasets like TPC-DS. * Query Optimizer Doris can automatically optimize and process complex SQL requests through the query optimizer. The query optimizer deeply optimizes various complex SQL operators such as multi-table joins, aggregation, sorting, and pagination, fully utilizing cost models and relational algebra transformations to automatically obtain better or optimal logical and physical execution plans, greatly reducing the difficulty of writing SQL and improving usability and performance. * Data Cache and IO Optimization Access to external data sources is usually network access, which can have high latency and poor stability. Apache Doris provides rich caching mechanisms and has made numerous optimizations in cache types, timeliness, and strategies, fully utilizing memory and local high-speed disks to enhance the analysis performance of hot data. Additionally, Doris has made targeted optimizations for network IO characteristics such as high throughput, low IOPS, and high latency, providing external data source access performance comparable to local data. * Materialized Views and Transparent Acceleration Doris provides rich materialized view update strategies, supporting full and partition-level incremental refresh to reduce construction costs and improve timeliness. In addition to manual refresh, Doris also supports scheduled refresh and data-driven refresh, further reducing maintenance costs and improving data consistency. Materialized views also have transparent acceleration capabilities, allowing the query optimizer to automatically route to appropriate materialized views for seamless query acceleration. Additionally, Doris's materialized views use high-performance storage formats, providing efficient data access capabilities through column storage, compression, and intelligent indexing technologies, serving as an alternative to data caching and improving query efficiency. As shown below, on a 1TB TPCDS standard test set based on the Iceberg table format, Doris's overall execution of 99 queries is only 1/3 of Trino's. ![doris-tpcds](/assets/images/tpcds1000-90ad20232767c6c0b4cfc8cd78cbbd52.jpeg) In actual user scenarios, Doris reduces average query latency by 20% and 95th percentile latency by 50% compared to Presto while using half the resources, significantly reducing resource costs while enhancing user experience. ![doris- performance](/assets/images/performance-5a8fe4e6e3a9eb21cfbd6b4079449caa.jpeg) ### Convenient Service Migration​ In the process of integrating multiple data sources and achieving lakehouse transformation, migrating SQL queries to Doris is a challenge due to differences in SQL dialects across systems in terms of syntax and function support. Without a suitable migration plan, the business side may need significant modifications to adapt to the new system's SQL syntax. To address this issue, Doris provides a [SQL Dialect Conversion Service](/cloud/4.x/user-guide/lakehouse/sql-convertor/sql-convertor- overview), allowing users to directly use SQL dialects from other systems for data queries. The conversion service converts these SQL dialects into Doris SQL, greatly reducing user migration costs. Currently, Doris supports SQL dialect conversion for common query engines such as Presto/Trino, Hive, PostgreSQL, and Clickhouse, achieving a compatibility of over 99% in some actual user scenarios. ### Modern Deployment Architecture​ Since version 3.0, Doris supports a cloud-native compute-storage separation architecture. This architecture, with its low cost and high elasticity, effectively improves resource utilization and enables independent scaling of compute and storage. ![compute-storage-decouple](/assets/images/compute-storage- decouple-4c547fb6d1155758f989a23b2fd443ce.png) The above diagram shows the system architecture of Doris's compute-storage separation, decoupling compute and storage. Compute nodes no longer store primary data, and the underlying shared storage layer (HDFS and object storage) serves as the unified primary data storage space, supporting independent scaling of compute and storage resources. The compute-storage separation architecture brings significant advantages to the lakehouse solution: * **Low-Cost Storage** : Storage and compute resources can be independently scaled, allowing enterprises to increase storage capacity without increasing compute resources. Additionally, by using cloud object storage, enterprises can enjoy lower storage costs and higher availability, while still using local high-speed disks for caching relatively low-proportion hot data. * **Single Source of Truth** : All data is stored in a unified storage layer, allowing the same data to be accessed and processed by different compute clusters, ensuring data consistency and integrity, and reducing the complexity of data synchronization and duplicate storage. * **Workload Diversity** : Users can dynamically allocate compute resources based on different workload needs, supporting various application scenarios such as batch processing, real-time analysis, and machine learning. By separating storage and compute, enterprises can more flexibly optimize resource usage, ensuring efficient operation under different loads. In addition, under the storage-computing coupled architecture, [elastic computing nodes](/cloud/4.x/user-guide/lakehouse/compute-node) can still be used to provide elastic computing capabilities in lake warehouse data query scenarios. ### Openness​ Doris not only supports access to open lake table formats but also has good openness for its own stored data. Doris provides an open storage API and [implements a high-speed data link based on the Arrow Flight SQL protocol](/cloud/4.x/user-guide/db-connect/arrow-flight-sql-connect), offering the speed advantages of Arrow Flight and the ease of use of JDBC/ODBC. Based on this interface, users can access data stored in Doris using Python/Java/Spark/Flink's ABDC clients. Compared to open file formats, the open storage API abstracts the specific implementation of the underlying file format, allowing Doris to accelerate data access through advanced features in its storage format, such as rich indexing mechanisms. Additionally, upper-layer compute engines do not need to adapt to changes or new features in the underlying storage format, allowing all supported compute engines to simultaneously benefit from new features. ## Lakehouse Best Practices​ In the lakehouse solution, Doris is mainly used for **lakehouse query acceleration** , **multi-source federated analysis** , and **lakehouse data processing**. ### Lakehouse Query Acceleration​ In this scenario, Doris acts as a **compute engine** , accelerating query analysis on lakehouse data. ![lakehouse query acceleration](/assets/images/query-acceleration- ce044ce39d5c66c1b91a997323588158.jpeg) #### Cache Acceleration​ For lakehouse systems like Hive and Iceberg, users can configure local disk caching. Local disk caching automatically stores query-designed data files in local cache directories and manages cache eviction using the LRU strategy. For details, refer to the [Data Cache](/cloud/4.x/user-guide/lakehouse/data-cache) document. #### Materialized Views and Transparent Rewrite​ Doris supports creating materialized views for external data sources. Materialized views store pre-computed results as Doris internal table formats based on SQL definition statements. Additionally, Doris's query optimizer supports a transparent rewrite algorithm based on the SPJG (SELECT-PROJECT- JOIN-GROUP-BY) pattern. This algorithm can analyze the structure information of SQL, automatically find suitable materialized views for transparent rewrite, and select the optimal materialized view to respond to query SQL. This feature can significantly improve query performance by reducing runtime computation. It also allows access to data in materialized views through transparent rewrite without business awareness. For details, refer to the [Materialized Views](/cloud/4.x/user-guide/query-acceleration/materialized- view/async-materialized-view/overview) document. ### Multi-Source Federated Analysis​ Doris can act as a **unified SQL query engine** , connecting different data sources for federated analysis, solving data silos. ![federated analysis](/assets/images/federation- query-9ff6bf1e41e42a9954486492182e6e36.png) Users can dynamically create multiple catalogs in Doris to connect different data sources. They can use SQL statements to perform arbitrary join queries on data from different data sources. For details, refer to the [Catalog Overview](/cloud/4.x/user-guide/lakehouse/catalog-overview). ### Lakehouse Data Processing​ In this scenario, **Doris acts as a data processing engine** , processing lakehouse data. ![lakehouse data processing](/assets/images/data- management-d2ca536b36cb7ccc43200e3d56eba8b1.jpeg) #### Task Scheduling​ Doris introduces the Job Scheduler feature, enabling efficient and flexible task scheduling, reducing dependency on external systems. Combined with data source connectors, users can achieve periodic processing and storage of external data. For details, refer to the [Job Scheduler](/cloud/4.x/user- guide/admin-manual/workload-management/job-scheduler). #### Data Modeling​ User typically use data lakes to store raw data and perform layered data processing on this basis, making different layers of data available to different business needs. Doris's materialized view feature supports creating materialized views for external data sources and supports further processing based on materialized views, reducing system complexity and improving data processing efficiency. #### Data Write-Back​ The data write-back feature forms a closed loop of Doris's lakehouse data processing capabilities. Users can directly create databases and tables in external data sources through Doris and write data. Currently, JDBC, Hive, and Iceberg data sources are supported, with more data sources to be added in the future. For details, refer to the documentation of the corresponding data source. On This Page * Doris Lakehouse Solution * Flexible Data Access * Data Source Connectors * High-Performance Data Processing * Convenient Service Migration * Modern Deployment Architecture * Openness * Lakehouse Best Practices * Lakehouse Query Acceleration * Multi-Source Federated Analysis * Lakehouse Data Processing --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/query-acceleration/performance-tuning-overview/tuning-overview Version: 4.x # Tuning Overview Query performance tuning is a systematic process that requires multi-level and multi-dimensional adjustments to the database system. Below is an overview of the tuning process and methodology: 1. Firstly, business personnel and database administrators (DBAs) need to have a comprehensive understanding of the database system being used, including the hardware utilized by the business system, the scale of the cluster, the version of the database software being used, as well as the features provided by the specific software version. 2. Secondly, an effective performance diagnostic tool is a necessary prerequisite for identifying performance issues. Only by efficiently and quickly locating problematic SQL queries or slow SQL queries can subsequent specific performance tuning processes be carried out. 3. After entering the performance tuning phase, a range of commonly used performance analysis tools are indispensable. These include specialized tools provided by the currently running database system, as well as general tools at the operating system level. 4. With these tools in place, specialized tools can be used to obtain detailed information about SQL queries running on the current database system, aiding in the identification of performance bottlenecks. Meanwhile, general tools can serve as auxiliary analysis methods to assist in locating issues. In summary, performance tuning requires evaluating the current system's performance status from a holistic perspective. Firstly, it is necessary to identify business SQL queries with performance issues, then utilize analysis tools to discover performance bottlenecks, and finally implement specific tuning operations. Based on the aforementioned tuning process and methodology, Apache Doris provides corresponding tools at each of these levels. The following sections will introduce the performance [diagnostic tools](/cloud/4.x/user-guide/query- acceleration/performance-tuning-overview/diagnostic-tools), [analysis tools](/cloud/4.x/user-guide/query-acceleration/performance-tuning- overview/analysis-tools), and [tuning process](/cloud/4.x/user-guide/query- acceleration/performance-tuning-overview/tuning-process) respectively. --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/query-data/mysql-compatibility Version: 4.x On this page # MySQL Compatibility Doris is highly compatible with MySQL syntax and supports standard SQL. However, there are several differences between Doris and MySQL, as outlined below. ## Data Types​ ### Numeric Types​ Type| MySQL| Doris| Boolean| \- Supported \- Range: 0 represents false, 1 represents true| \- Supported \- Keyword: Boolean \- Range: 0 represents false, 1 represents true| Bit| \- Supported \- Range: 1 to 64| Not supported| Tinyint| \- Supported \- Supports signed and unsigned \- Range: signed range from -128 to 127, unsigned range from 0 to 255 | \- Supported \- Only supports signed \- Range: -128 to 127| Smallint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^15 to 2^15-1, unsigned range from 0 to 2^16-1| \- Supported \- Only supports signed \- Range: -32768 to 32767| Mediumint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^23 to 2^23-1, unsigned range from 0 to 2^24-1| \- Not supported| Int| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^31 to 2^31-1, unsigned range from 0 to 2^32-1| \- Supported \- Only supports signed \- Range: -2147483648 to 2147483647| Bigint| \- Supported \- Supports signed and unsigned \- Range: signed range from -2^63 to 2^63-1, unsigned range from 0 to 2^64-1| \- Supported \- Only supports signed \- Range: -2^63 to 2^63-1| Largeint| \- Not supported| \- Supported \- Only supports signed \- Range: -2^127 to 2^127-1| Decimal| \- Supported \- Supports signed and unsigned (deprecated after 8.0.17) \- Default: Decimal(10, 0)| \- Supported \- Only supports signed \- Default: Decimal(9, 0)| Float/Double| -Supported \- Supports signed and unsigned (deprecated after 8.0.17)| \- Supported \- Only supports signed ---|---|--- ### Date Types​ Type| MySQL| Doris| Date| \- Supported \- Range: ['1000-01-01', '9999-12-31'] \- Format: YYYY-MM-DD| \- Supported \- Range: ['0000-01-01', '9999-12-31'] \- Format: YYYY-MM-DD| DateTime| \- Supported \- DATETIME([P]), where P is an optional parameter defined precision \- Range: '1000-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999' \- Format: YYYY-MM-DD hh:mm.fraction| \- Supported \- DATETIME([P]), where P is an optional parameter defined precision \- Range: ['0000-01-01 00:00:00[.000000]', '9999-12-31 23:59:59[.999999]'] \- Format: YYYY-MM-DD hh:mm.fraction| Timestamp| \- Supported \- Timestamp[(p)], where P is an optional parameter defined precision \- Range: ['1970-01-01 00:00:01.000000' UTC, '2038-01-19 03:14:07.999999' UTC] \- Format: YYYY-MM-DD hh:mm.fraction| \- Not supported| Time| \- Supported \- Time[(p)] \- Range: ['-838:59:59.000000' to '838:59:59.000000'] \- Format: hh:mm.fraction| \- Not supported| Year| \- Supported \- Range: 1901 to 2155, or 0000 \- Format: yyyy| \- Not supported ---|---|--- ### String Types​ Type| MySQL| Doris| Char| -Supported - CHAR[(M)], where M is the character length. If omitted, default length is 1 \- Fixed-length \- Range: [0, 255] bytes| \- Supported \- CHAR[(M)], where M is the byte length \- Variable- length \- Range: [1, 255]| Varchar| \- Supported \- VARCHAR(M), where M is the character length \- Range: [0, 65535] bytes| \- Supported \- VARCHAR(M), where M is the byte length \- Range: [1, 65533]| String| \- Not supported| \- Supported \- 1,048,576 bytes (1MB), can be increased to 2,147,483,643 bytes (2GB)| Binary| \- Supported \- Similar to Char| \- Not supported| Varbinary| \- Supported \- Similar to Varchar| \- Not supported| Blob| \- Supported \- TinyBlob, Blob, MediumBlob, LongBlob| \- Not supported| Text| \- Supported \- TinyText, Text, MediumText, LongText| \- Not supported| Enum| \- Supported \- Supports up to 65,535 elements| \- Not supported| Set| \- Supported \- Supports up to 64 elements| \- Not supported ---|---|--- ### JSON Type​ Type| MySQL| Doris| JSON| Supported| Supported ---|---|--- ### Doris unique data type​ Doris has several unique data types. Here are the details: * **HyperLogLog** HLL (HyperLogLog) is a data type that cannot be used as a key column. In an aggregate model table, the corresponding aggregation type for HLL is HLL_UNION. The length and default value do not need to be specified. The length is controlled internally based on the data aggregation level. HLL columns can only be queried or used with `HLL_UNION_AGG`, `HLL_RAW_AGG`, `HLL_CARDINALITY`, `HLL_HASH`, and other related functions. HLL is used for approximate fuzzy deduplication and performs better than count distinct when dealing with large amounts of data. The typical error rate of HLL is around 1%, sometimes reaching up to 2%. * **Bitmap** Bitmap is a data type that cannot be used as a key column. In aggregate model table, the corresponding aggregation type for BITMAP is BITMAP_UNION. Similar to HLL, the length and default values do not need to be specified, and the length is controlled internally based on the data aggregation level. Bitmap columns can only be queried or used with functions like `BITMAP_UNION_COUNT`, `BITMAP_UNION`, `BITMAP_HASH`, `BITMAP_HASH64` and others. Using BITMAP in traditional scenarios may impact loading speed, but it generally performs better than Count Distinct when dealing with large amounts of data. Please note that in real-time scenarios, using BITMAP without a global dictionary and with bitmap_hash() function may introduce an error of around 0.1%. If this error is not acceptable, you can use bitmap_hash64 instead. * **QUANTILE_PERCENT** QUANTILE_STATE is a data type that cannot be used as a key column. In an aggregate model table, the corresponding aggregation type for QUANTILE_STATE is QUANTILE_UNION. The length and default value do not need to be specified, and the length is controlled internally based on the data aggregation level. QUANTILE_STATE columns can only be queried or used with functions like `QUANTILE_PERCENT`, `QUANTILE_UNION`, `TO_QUANTILE_STATE` and others. QUANTILE_STATE is used for calculating approximate quantile values. During import, it performs pre-aggregation on the same key with different values. When the number of values does not exceed 2048, it stores all the data in detail. When the number of values exceeds 2048, it uses the TDigest algorithm to aggregate (cluster) the data and save the centroids of the clusters. * **Array ** Array is a data type in Doris that represents an array composed of elements of type T. It cannot be used as a key column. * **MAP ** MAP is a data type in Doris that represents a map composed of elements of types K and V. * **STRUCT ** A structure (STRUCT) is composed of multiple fields. It can also be identified as a collection of multiple columns. * field_name: The identifier of the field, which must be unique. * field_type: The type of field. * **Agg_State** AGG_STATE is a data type in Doris that cannot be used as a key column. During table creation, the signature of the aggregation function needs to be declared. The length and default value do not need to be specified, and the actual storage size depends on the implementation of the function. AGG_STATE can only be used in combination with [STATE](/cloud/4.x/sql- manual/sql-functions/combinators/state) / [MERGE](/cloud/4.x/sql-manual/sql- functions/combinators/merge)/ [UNION](/cloud/4.x/sql-manual/sql- functions/combinators/union) functions from the SQL manual for aggregators. ## Syntax​ ### DDL​ #### 01 Create Table Syntax in Doris​ CREATE TABLE [IF NOT EXISTS] [database.]table ( column_definition_list [, index_definition_list] ) [engine_type] [keys_type] [table_comment] [partition_info] distribution_desc [rollup_list] [properties] [extra_properties] #### 02 Differences with MySQL​ Parameter| Differences from MySQL| Column_definition_list| \- Field list definition: The basic syntax is similar to MySQL but includes an additional operation for aggregate types. \- The aggregate type operation primarily supports Aggregate. \- When creating a table, MySQL allows adding constraints like Index (e.g., Primary Key, Unique Key) after the field list definition, while Doris supports these constraints and computations by defining data models.| Index_definition_list| \- Index list definition: The basic syntax is similar to MySQL, supporting bitmap indexes, inverted indexes, and N-Gram indexes, but Bloom filter indexes are set through properties. \- MySQL supports B+Tree and Hash indexes.| Engine_type| \- Table engine type: Optional. \- The currently supported table engine is mainly the OLAP native engine. \- MySQL supports storage engines such as Innodb, MyISAM, etc.| Keys_type| \- Data model: Optional. \- Supported types include: 1) DUPLICATE KEY (default): The specified columns are sort columns. 2) AGGREGATE KEY: The specified columns are dimension columns. 3) UNIQUE KEY: The specified columns are primary key columns. \- MySQL does not have the concept of a data model.| Table_comment| Table comment| Partition_info| \- Partitioning algorithm: Optional. Doris supported partitioning algorithms include: \- LESS THAN: Only defines the upper bound of partitions. The lower bound is determined by the upper bound of the previous partition. \- FIXED RANGE: Defines left-closed and right-open intervals for partitions. \- MULTI RANGE: Creates multiple RANGE partitions in bulk, defining left- closed and right-open intervals, setting time units and steps. Time units support years, months, days, weeks, and hours. MySQL supports algorithms such as Hash, Range, List, Key. MySQL also supports subpartitions, with only Hash and Key supported for subpartitions.| Distribution_desc| \- Bucketing algorithm: Required. Includes: 1) Hash bucketing syntax: DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num|auto]. Description: Uses specified key columns for hash bucketing. 2) Random bucketing syntax: DISTRIBUTED BY RANDOM [BUCKETS num|auto]. Description: Uses random numbers for bucketing. \- MySQL does not have a bucketing algorithm.| Rollup_list| \- Multiple sync materialized views can be created while creating the table. \- Syntax: `rollup_name (col1[, col2, ...]) [DUPLICATE KEY(col1[, col2, ...])][PROPERTIES("key" = "value")]` \- MySQL does not support this.| Properties| Table properties: They differ from MySQL's table properties, and the syntax for defining table properties also differs from MySQL. ---|--- #### 03 CREATE INDEX​ CREATE INDEX [IF NOT EXISTS] index_name ON table_name (column [, ...],) [USING BITMAP]; * Doris currently supports Bitmap index, Inverted index, and N-Gram index. BloomFilter index are supported as well, but they have a separate syntax for setting them. * MySQL supports index algorithms such as B+Tree and Hash. #### 04 CREATE VIEW​ CREATE VIEW [IF NOT EXISTS] [db_name.]view_name (column1[ COMMENT "col comment"][, column2, ...]) AS query_stmt CREATE MATERIALIZED VIEW (IF NOT EXISTS)? mvName=multipartIdentifier (LEFT_PAREN cols=simpleColumnDefs RIGHT_PAREN)? buildMode? (REFRESH refreshMethod? refreshTrigger?)? (KEY keys=identifierList)? (COMMENT STRING_LITERAL)? (PARTITION BY LEFT_PAREN partitionKey = identifier RIGHT_PAREN)? (DISTRIBUTED BY (HASH hashKeys=identifierList | RANDOM) (BUCKETS (INTEGER_VALUE | AUTO))?)? propertyClause? AS query * The basic syntax is consistent with MySQL. * Doris supports logical view and supports two types of materialized views: synchronous materialized views and asynchronous materialized views * MySQL do not supports asynchronous materialized views. #### 05 ALTER TABLE / ALTER INDEX​ The syntax of Doris ALTER is basically the same as that of MySQL. ### DROP TABLE / DROP INDEX​ The syntax of Doris DROP is basically the same as MySQL. ### DML​ #### INSERT​ INSERT INTO table_name [ PARTITION (p1, ...) ] [ WITH LABEL label] [ (column [, ...]) ] [ [ hint [, ...] ] ] { VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query } The Doris INSERT syntax is basically the same as MySQL. #### UPDATE​ UPDATE target_table [table_alias] SET assignment_list WHERE condition assignment_list: assignment [, assignment] ... assignment: col_name = value value: {expr | DEFAULT} The Doris UPDATE syntax is basically the same as MySQL, but it should be noted that the **`WHERE` condition must be added.** #### Delete​ DELETE FROM table_name [table_alias] [PARTITION partition_name | PARTITIONS (partition_name [, partition_name])] WHERE column_name op { value | value_list } [ AND column_name op { value | value_list } ...]; The syntax can only specify filter predicates DELETE FROM table_name [table_alias] [PARTITION partition_name | PARTITIONS (partition_name [, partition_name])] [USING additional_tables] WHERE condition This syntax can only be used on the UNIQUE KEY model table. The DELETE syntax in Doris is basically the same as in MySQL. However, since Doris is an analytical database, deletions cannot be too frequent. #### SELECT​ SELECT [hint_statement, ...] [ALL | DISTINCT] select_expr [, select_expr ...] [EXCEPT ( col_name1 [, col_name2, col_name3, ...] )] [FROM table_references [PARTITION partition_list] [TABLET tabletid_list] [TABLESAMPLE sample_value [ROWS | PERCENT] [REPEATABLE pos_seek]] [WHERE where_condition] [GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}] [HAVING where_condition] [ORDER BY {col_name | expr | position} [ASC | DESC], ...] [LIMIT {[offset_count,] row_count | row_count OFFSET offset_count}] [INTO OUTFILE 'file_name'] The Doris SELECT syntax is basically the same as MySQL. ## SQL Function​ Doris Function covers most MySQL functions. On This Page * Data Types * Numeric Types * Date Types * String Types * JSON Type * Doris unique data type * Syntax * DDL * DROP TABLE / DROP INDEX * DML * SQL Function --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/studio/overview Version: 4.x On this page # Introduce VeloDB Studio VeloDB Studio is a GUI tool tailored for Apache Doris and its compatible databases to simplify data development and management. VeloDB Studio has two versions: Server and Desktop: * Server versions are built-in to provide enterprise-level user services in VeloDB Cloud and Enterprise. * Desktop version is a desktop application that can be installed directly on your computer and supports Mac, Windows and Linux (future). ## Core features​ ### SQL Editor​ A SQL editor specially designed for Apache Doris supports SQL syntax highlighting, automatic completion, formatting and other functions to improve SQL writing efficiency. ### Log Retrieval and Visual Analysis​ Provides log search and visualization functions, and you can use Apache Doris to work with Studio's log search capabilities to replace Elastic Search and Kibana Discover for log storage, querying and visualization, achieving 10 times the cost reduction and more efficient analysis of data. ### Query Audit​ Query Audit is used to audit and analyze query history executed in Doris. It allows you to filter slow queries or filter through users, hosts, SQL statements, etc. to meet audit needs. ### Permission Management​ Visually manage Apache Doris user rights to ensure the security of database data and operations, and meet the needs of enterprise-level applications. ### Multiple connections and SSH tunnels (Desktop version only)​ Supports multi-database connections and provides SSH tunneling function, which facilitates users to remotely manage Doris databases under secure channels, improving compatibility across network environments. ## VeloDB Studio Server Version​ info The server version supports Chrome 90 or above browsers, and it is recommended to use the latest version of the browser. VeloDB Studio Server version is built into VeloDB Cloud and Enterprise, and is provided to enterprise users in the form of a web. ### Features of VeloDB Studio Server Edition​ **1\. Deep integration** : The Server version is deeply integrated in VeloDB Cloud and Enterprise, and has different adaptive functions according to different Manager versions. **2\. Network Isolation** : Enterprise Studio is deployed in your enterprise- level network environment, and VeloDB Cloud versions of Studio are deployed in your VPC to provide a secure network environment. **3\. Higher quality and stability** : The Server version focuses more on stability and has stricter quality requirements for new functions. **4\. Security Updates** : Server version provides more instant security updates and vulnerability responses, and we will update and deliver vulnerabilities and security issues separately. **5\. Enterprise-level support** : The team provides professional technical support and faster feature requests, and problems in the Server version will always be responded to as soon as possible. **6\. Team Collaboration** : The Server version is more suitable for team collaboration, with the same access address, and multiple users can share a Studio. You can also embed Studio into your enterprise management system. ## VeloDB Studio Desktop Version​ ### Why launch desktop application?​ In the past, we provided the Web version of Studio WebUI in VeloDB Enterprise Manager, VeloDB Cloud, and Alibaba Cloud. However, these versions need to be deployed on the server or fully hosted on the cloud. They are designed for VeloDB kernel, require login to the management system account, require payment, require complex network permissions, require administrator permission to update, etc., which are more suitable for enterprise users. These designs bring a lot of inconvenience to ordinary users. To facilitate Apache Doris users, we have launched the VeloDB Studio Desktop version, which is a GUI designed and developed specifically for Apache Doris. It has the following main advantages: ### Features of VeloDB Studio Desktop Edition​ info Mac version only supports 64-bit macOS version 13.0 (Ventura) above system version info Windows version only supports 64-bit Windows 10 above system version, Windows8, 8.1 and Windows Server 2012 does not supported **1\. No server deployment required** * You don't need to find a server to deploy Studio alone. Just download the VeloDB Studio Desktop installation package and double-click to use it. * You don't need to log in to another account, just open the app and enter the connection information to connect to the Doris database. **2\. Completely free** * Unlike other versions, VeloDB Studio Desktop is permanently free, no license purchase or no payment. **3\. Design for Apache Doris** * Other versions of Studio run on the VeloDB kernel, Apache Doris cannot be used or has limited compatibility. * VeloDB Studio Desktop is designed for Apache Doris, supports Apache Doris, and supports compatible databases derived from Apache Doris. **4\. Better user experience** * More convenient: The desktop application is on your computer, without opening the browser, entering the address, logging in to the account, just double-clicking the icon to open it. * More efficient: The desktop supports stronger shortcut key system and smoother window management. Your connection will be saved to your computer, allowing multiple connections to be saved without entering connection information every time. **5\. Native tools that replace Navicat and DBeaver** * Stronger management capabilities: Unlike more queries-focused tools such as Navicat and DBeaver, VeloDB Studio supports more features of Apache Doris, including session management, log retrieval, permission management, query auditing, etc. * Better user support and response: VeloDB Studio team can respond faster to your feature requests and problem feedback, and launch new features based on Apache Doris faster. On This Page * Core features * SQL Editor * Log Retrieval and Visual Analysis * Query Audit * Permission Management * Multiple connections and SSH tunnels (Desktop version only) * VeloDB Studio Server Version * Features of VeloDB Studio Server Edition * VeloDB Studio Desktop Version * Why launch desktop application? * Features of VeloDB Studio Desktop Edition --- # Source: https://docs.velodb.io/cloud/4.x/user-guide/table-design/overview Version: 4.x On this page # Overview ## Creating tables​ Users can use the CREATE TABLE statement to create a table in Doris. You can also use the CREATE TABLE LIKE or CREATE TABLE AS clause to derive the table definition from another table. ## Table name​ In Doris, table names are case-sensitive by default. You can configure lower_case_table_namesto make them case-insensitive during the initial cluster setup. The default maximum length for table names is 64 bytes, but you can change this by configuring table_name_length_limit. It is not recommended to set this value too high. For syntax on creating tables, please refer to CREATE TABLE. [Dynamic partitions](/cloud/4.x/user-guide/table-design/data- partitioning/dynamic-partitioning) can have these properties set individually. ## Table property​ In Doris, the CREATE TABLE statement can specify table properties, including: * **buckets** : Determines the distribution of data within the table. * **storage_medium** : Controls the storage method for data, such as using HDD, SSD, or remote shared storage. * **replication_num** : Controls the number of data replicas to ensure redundancy and reliability. * **storage_policy** : Controls the migration strategy for cold and hot data separation storage. These properties apply to partitions, meaning that once a partition is created, it will have its own properties. Modifying table properties will only affect partitions created in the future and will not affect existing partitions. For more information about table properties, refer to ALTER TABLE PROPERTY. ## Notes​ 1. **Choose an appropriate data model** : The data model cannot be changed, so you need to select an appropriate [data model](/cloud/4.x/user-guide/table-design/data-model/overview) when creating the table. 2. **Choose an appropriate number of buckets** : The number of buckets in an already created partition cannot be modified. You can modify the number of buckets by [replacing the partition](/cloud/4.x/user-guide/data-operate/delete/table-temp-partition), or you can modify the number of buckets for partitions that have not yet been created in dynamic partitions. 3. **Column addition operations** : Adding or removing VALUE columns is a lightweight operation that can be completed in seconds. Adding or removing KEY columns or modifying data types is a heavyweight operation, and the completion time depends on the amount of data. For large datasets, it is recommended to avoid adding or removing KEY columns or modifying data types. 4. **Optimize storage strategy** : You can use tiered storage to store cold data on HDD or S3/HDFS. On This Page * Creating tables * Table name * Table property * Notes