# Aporia > To add a new model to Aporia using the Python SDK, you'll need to: --- # Source: https://docs.aporia.com/ml-monitoring-as-code/adding-new-models.md # Adding new models ## Overview To add a new model to Aporia using the Python SDK, you'll need to: 1. **Define serving dataset** - This will include the SQL query or path to your serving / inference data. 2. **Define training dataset *****(optional)***** -** This will include the SQL query or path to your model's training set. 3. **Define a model resource** - The model resource will include the display name and type of the model in Aporia, as well as the link to different versions and their serving / training datasets. ### Initialization Start by creating a new Python file with the following initialization code: ```python import datetime import os from aporia import Aporia, MetricDataset, MetricParameters, TimeRange import aporia.as_code as aporia aporia_token = os.environ["APORIA_TOKEN"] aporia_account = os.environ["APORIA_ACCOUNT"] aporia_workspace = os.environ["APORIA_WORKSPACE"] stack = aporia.Stack( host="https://platform.aporia.com", # or "https://platform-eu.aporia.com" token=aporia_token, account=aporia_account, workspace=aporia_workspace, ) # stack.apply(yes=True, rollback=False, config_path="config.json") ``` ## Defining Datasets To add a new model to Aporia, start by defining a dataset. Datasets can be used to specify the SQL query or file path for model monitoring. There are currently two types of datasets in Aporia: * **Serving dataset** - Includes the features and predictions of your model in production, as well as any other metadata you'd like to add for observability. * The serving dataset can also include delayed labels / actuals, and Aporia will make sure to refresh this data when it's updated. This is used to calculate performance metrics such as AUC ROC, nDCG\@k, and so on. * **Training dataset (optional)** - Includes the features, predictions, and labels of your model during training set. ```python serving_dataset = aporia.Dataset( "my-model-serving", # Dataset type - can be "serving" or "training" type="serving", # Data source name from the "Integrations" page in Aporia # If you prefer to define data source as code, use the aporia.DataSource(...) API. data_source_name="MY_SNOWFLAKE", # SQL query or S3 path connection_data={ "query": "SELECT * FROM model_predictions" }, # Column to be used as a unique prediction ID id_column="prediction_id", # Column to be used as the prediction timestamp timestamp_column="prediction_timestamp" # Raw inputs are used to represent any metadata about the prediction. # Optional raw_inputs={ "prediction_latency": "numeric", "raw_text": "text", }, # Features features={ "age": "numeric", "gender": "categorical", "text_embedding": "embedding", "image_embedding": "embedding", }, # Predictions predictions={ "score": "numeric", }, # Delayed labels actuals={ "purchased": "boolean", }, actual_mapping={"purchased": "score"}, ) ``` While the dataset represents a specific query or file that's relevant to a specific model, a **data source** includes the connection string data (e.g user, role, etc.). A data source can be shared across many different datasets. A data source is often created once, while datasets are added every time a new model is added. The name of the data source should be identical to a data source that exists in the Integrations page in Aporia.

If you're using an SQL-based data source such as Databricks, Snowflake, Postgres, Glue Data Catalog, Athena, Redshift, or BigQuery, then the format of `connection_data` should be a dict with a `query` key as shown in the code example above. If you're using a file data source like S3, Azure Blob Stroage, or Google Cloud Storage, the `connection_data` dictionary should look like this: ```python aporia.Dataset( ..., connection_data={ # Files to read "regex": "my-model/v1/*.parquet", # Format of the file "object_format": "parquet" # Can also be "csv" / "delta" / "json", # For CSV, Read the first line of the file as column names? (optional) # "header": true } ) ``` ### Column Mapping Aporia uses a simple dictionary format to map between column names to features, predictions, raw inputs, and actuals. Here's a table to describe the different type of field groups that exist within Aporia:

Group	Description	Required
Features	Inputs to the model	Yes
Predictions	Outputs from the model	Yes
Raw Inputs	Any metadata about the prediction. Examples: * Prediction latency * Raw text * Gender - might not be a feature of the model, but you still want to monitor for bias & fairness, so this is a good fit for raw inputs	No
Actuals	Delayed feedback after the prediction, used to calculate performance metrics.	No

You can specify each of these field groups as a Python dictionary in the `aporia.Dataset(...)` parameters. The key represents the column name from the file / SQL query, and the value represents the data type: ``` aporia.Dataset( ..., features={ # columnName -> dataType "age": "numeric", } ) ``` ### Data Types Each column can be one of the following data types:

Data Type	Description	Value examples
numeric	Any continuous variable (e.g score, age, and so on).	53.4, 0.05, 20
categorical	Any discrete variable (e.g gender, country, state, etc.).	"US", "California", "5"
boolean	Any boolean value	true, false, 0, 1
datetime	Any datetime value	timestamp objects
text	Raw text	"Hello, how are you?"
array	List of discrete categories	["flight1911", "flight2020"]
embedding	Numeric vectors	[0.58201, 0.293948, ...]
image_url	Image URLs	https://my-website.com/img.png

### Actuals / Delayed Labels To calculate performance metrics in Aporia, you can add actuals (or delayed labels) to the prediction data. While usually this data is not in the same table as the prediction data, you can use a SQL `JOIN` query to merge between the feature / prediction data and actuals. Aporia will take care of refreshing the data when it is updated. If you don't have actuals for a prediction yet, the value for the acshould be NULL. Therefore, it's often very common to use a `LEFT JOIN` query like this: ```sql SELECT * FROM model_predictions LEFT JOIN model_actuals USING prediction_id ``` Then, you can use the `actuals` and `actual_mapping` parameters when creating a dataset: ```python serving_dataset = aporia.Dataset( predictions={ "recommended_items": "array", }, actuals={ "relevant_items": "array", }, actual_mapping={ # Actual name -> Prediction name "relevant_items": "recommended_items" }, ) ``` ## Defining models Next, to define a model simply create an `aporia.Model` object with links to the relevant datasets, and add it to your stack: ```python model_version = aporia.Version( "model_version_v1.0.0", serving=serving_dataset, training=training_dataset, name="v1.0.0", ) model = aporia.Model( "My Model", type=aporia.ModelType.RANKING, versions=[model_version], ) stack.add(model) stack.apply(yes=True, rollback=False, config_path="model.json") ``` --- # Source: https://docs.aporia.com/monitors-and-alerts/alerts-consolidation.md # Alerts Consolidation In the following guide we'll explain how you consolidate alerts within Aporia in order to avoid unnecessary noise when multiple alerts originate from the same monitor. In the following example, we have created a monitor to detect drift across all model's features. Out of the 15 features in this model, 8 are drifting.

## Consolidating alerts over time Let's assume that you don't want to be notified every time the monitor runs that your features are drifting but rather get a new alert if your features are still drifting after a week has gone by. In such case, you can tick the cadence limit checkbox as done in the following image:

That way you'll have have 8 alerts, one per each drifting feature, and every new instance of a specific feature alert will be consolidated with its already existing alert. In the following image you can see that each alert has 3 occurrences as this monitor has been running for 3 weeks.

When clicking on "View all occurrences" you will be able to see all of the consolidated alerts separately

## Consolidating multiple features/segment/versions into one alert Let's assume that you don't want to be notified separately per each drifting feature in this monitor but rather get notified once for all the drifting features. In such case, you can tick the grouping limit checkbox as done in the following image:

That way you'll get only one alert if any of the 8 features are drifting, as you can see in the following image:

When clicking on "View all occurrences" you will be able to see visualization & explanation for each of the drifting features separately.

Please note that the consolidation by fields is only supported as an addition to the consolidation by time, and can't be used without the time consolidation. --- # Source: https://docs.aporia.com/core-concepts/analyzing-performance.md # Source: https://docs.aporia.com/v1/core-concepts/analyzing-performance.md # Analyzing Performance ### Your model's success is your success Hooray! Your model is running in production, making predictions in order to improve your business KPIs. Unfortunately, when encountering real world pipelines and data our model might not perform as well as it did in our training process. That's why we would like to analyze our model's performance over-time in order to make sure we catch possible degradation in time. ![It's Performance Review Time](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FXemV5x5HfbdRr4xymkIH%2Fits-performance-review-time.jpeg?alt=media) ### Measuring Model Performance To measure how well your model performs in production, you can use a variety of **performance metrics**. Each metric teaches us about different aspects of our model's performance. While some people might care about not missing potential leads (e.g. focus on **recall** score) others might prefer to reduce dead ends to minimum costs (e.g. focus on **precision** score). In addition, no matter which use case are you trying to solve with your model, you'll probably want to analyze its activity over time and ensure there are no anomalous events or ongoing trends with the model's usage. {% hint style="info" %} #### How often should I carry out performance analysis? As models can vary dramatically in their purpose, usage, or production pipelines, the answer isn't unequivocal. However, here are some questions you should consider while deciding - What is the frequency of the predictions? How frequently do we get the actuals? Are concept drifts common in this domain? {% endhint %} ### Common Performance metrics Depending on your use case, you might want to use different performance metrics in order to decide how well our model performs. For example, nDCG is common when you want to understand the quality of your ranking model. AUC-ROC is useful when you want to evaluate you binary classification model. You can read more about all the different metrics and the use cases which you will find useful in our [metric glossary.](https://docs.aporia.com/v1/api-reference/metrics-glossary)
## Actuals / Ground Truth In some cases, you will have access to the *actual* value of the prediction - the ground truth from real-world data. For example, if your model predicted that a client will buy insurance, and a few days later the client actually does so, then the actual value of that prediction would be `True`. In these scenarios, we can compare our predictions to the actual values and then calculate performance metrics like Precision, Recall, MSE, etc. - just like in training. By connecting Aporia with your actual values, the system will be able to calculate performance metrics in real-time for you. In this example, you can see the Precision metric across two model versions in production:

--- # Source: https://docs.aporia.com/api-reference/api-extended-reference.md # API Extended Reference ## Create Dataset Complete API documentation can be found [here](https://platform.aporia.com/api/v1/docs#tag/Datasets/operation/connect_dataset_api_v1__account_name___workspace_name__datasets_post) The `config` parameter can include the following keys: * **header**: Applicable only to AWS S3, Google Cloud Storage, and Azure Blob. This flag indicates whether Spark should interpret the first line of the file as column names. Example: `"header": false`. Only relevant for CSV files. * **infer\_schema**: Applicable only to AWS S3, Google Cloud Storage, and Azure Blob for CSV data. This flag determines whether the column types should be automatically inferred when reading the file. Example: `"infer_schema": false`. In most cases, this can be left out. * **object\_format**: Applicable only to AWS S3, Google Cloud Storage, and Azure Blob. Defines the format of the data files. Possible values are `parquet`, `csv`, `json`, `delta`. Example: `"object_format": "parquet"`. * **query**: Specifies the dataset query used to retrieve data. Example: `"query": "SELECT f1, f2 FROM data_table"`. Applicable to all data sources. Note: Some data sources may use different SQL dialects. For blob-storage data sources, this SQL is SparkSQL and you can access the file data using `{data}` as a table name. * **regex**: Applicable only to AWS S3, Google Cloud Storage, and Azure Blob. Defines a regular expression for the data bucket/path. Example: `"regex": "demo-data/demo-fraud-model-data.parquet"`. ## Create Monitor Complete API documentation can be found [here](https://platform.aporia.com/api/v1/docs#tag/Monitors-$Experimental$/operation/create_monitor_api_v1__account_name___workspace_name__monitors_post) * **type**: Specifies the monitor type. Possible values are: `"model_activity"`, `"missing_values"`, `"data_drift"`, `"prediction_drift"`, `"values_range"`, `"new_values"`, `"model_staleness"`, `"performance_degradation"`, `"metric_change"`, `"custom_metric"`, `"code_based_metric"`. * **scheduling**: A CRON string that determines the monitor's execution schedule. For example setting the **scheduling string to** `0 0 * * *` will trigger the monitor every day at midnight. * **is\_active**: A flag indicating whether the monitor is active or not. * **configuration**: Contains all settings for the monitor. This parameter may include the following keys: * **identification**: Specifies the data on which the monitor should run. It may include: * **models**: Defines the model and its versions for monitoring. Includes `"id"` for the model ID and `"version"` for the model version ID. To monitor the "per active version," set it to `null`; for all versions, use `"all_versions"`; for the latest version, use `"latest"`. Example: `"models": {"id": "929ec979-4108-4397-ba59-ad639b7271e8", "version": "338dc00f-8e56-44b9-af90-f0e6507c9b09"}`. This field must always be set. * **segment**: Specifies the data segments to monitor. Includes `group` for the segment group ID and `value` for specific segment values. To include all segment values, set to `null`; for categorical or boolean values, specify the value (e.g., `"NY"`); for numeric values, provide the lower bound (e.g., for age range 18-23, use `"18"`). Example: `"group": "f2ad3ba0-9ff8-4b95-8d20-2490a06d026a", "value": 23}`. * **features, raw\_inputs, predictions, actuals**: Specifies which features, raw inputs, predictions, or actuals to monitor. The value should be an array of fields, each represented by an array containing the field's name and type. Example: `"features": [["age", "numeric"], ["is_insured", "boolean"], ["country", "categorical"]]`. For monitors which inspect different fields (drift, metric change, performance degradation, value ranges, new values), these fields must be set. **Example of a complete identification configuration**: ```json { "identification": { "models": { "id": "7cf1a165-f321-452c-b8d7-2062e215cf37", "version": null }, "segment": { "group": "f2ad3ba0-9ff8-4b95-8d20-2490a06d026a", "value": "AZ" }, "features": [ "age", "is_insured", "country" ] } } ``` * **configuration**: Used to define monitor settings. It may include: * **focal**: Configurations for the data being analyzed. Potential keys include: * **source**: Data source, which can be `"SERVING"`, `"TRAINING"`, or `"TEST"`. Example: `"source": "SERVING"`. * **skipPeriod**: A time period, starting from the monitor execution time, to skip data calculations, formatted as a time string (e.g., `"2h"`, `"1w"`). Example: `"skipPeriod": "3h"` will skip all data records from the last 3 hours. * **timePeriod**: The time frame for data calculation, formatted as a time string (e.g., `"2h"`, `"1w"`). Example: `"timePeriod": "1w"` recalculates all data within the relevant week. * **baseline**: Configurations for the baseline data being compared against. Potential keys include: * **source**: Data source, which can be `"SERVING" or` `"TRAINING"`. Example: `"source": "SERVING"`. * **skipPeriod**: Similar to `focal`, specifies a time period from NOW to skip data calculations. Example: `"skipPeriod": "3h"`.\ Note that it is common practice to set the skip period of the baseline equal to the time period of the focal, to match the timeframes. * **timePeriod**: Time frame for data calculation, formatted as a time string (e.g., `"2h"`, `"1w"`). The time period is starting after the skip period (see sketch below). Example: `"timePeriod": "1w"`. * **segmentGroupId**: When comparing data between two segments, this is the segment group ID for the baseline data. Example: `"segmentGroupId": "f2ad3ba0-9ff8-4b95-8d20-2490a06d026a"`. * **segmentValue**: When comparing data between segments, this specifies the segment value for the baseline data. For categorical or boolean values, provide the value (e.g., `"NY"`); for numeric values, provide the lower bound (e.g., for an age range of 18-23, use `"18"`). Example: `"segmentValue": 0`. * **aggregationPeriod**: Defines the data aggregation period for creating a comparison timeline. Example: `"aggregationPeriod": "1d"` aggregates data into daily buckets.

This is needed for anomaly detection monitors for simple metrics (i.e: activity, performance degradation, metric change, custom metric, code-based-metric) * **logicEvaluations**: Defines the monitor's logic. Should be an array containing a single object with possible keys: * **name**: Name of the detection. Valid options are `APORIA_DRIFT_SCORE`, `MODEL_STALENESS`, `RANGE`, `RATIO`, `TIME_SERIES_ANOMALY`, `TIME_SERIES_RATIO_ANOMALY`, `VALUES_RANGE`. Example: `"name": "APORIA_DRIFT_SCORE"`. * **min**/**max**: Relevant for `MODEL_STALENESS, RANGE, RATIO, VALUES_RANGE.` Minimum or maximum threshold for the detection, if applicable. Example: `"min": 0.5`, `"max": 1.5`. * **thresholds**: Relevant for `APORIA_DRIFT_SCORE`. Specifies thresholds for drift detection, which can include different thresholds for numeric, categorical, or vector drifts. Example: `"thresholds": {"vector": 0.2, "numeric": 1, "categorical": 1}`. * **sensitivity**: Relevant for `TIME_SERIES_ANOMALY, TIME_SERIES_RATIO_ANOMALY.` Sensitivity threshold for time series anomaly detection. Example: `"sensitivity": 0.15`. * **testOnlyIncrease**: Relevant for `TIME_SERIES_ANOMALY`. A flag to alert only if the anomaly value exceeds the expected range (commonly set for missing values ratio detections). Default is `false`. Example: `"testOnlyIncrease": true`. * **new\_values\_count\_threshold**: Relevant for `VALUES_RANGE` detections. Sets the maximum allowed number of new values. Example: `"new_values_count_threshold": 1`. * **new\_values\_ratio\_threshold**: Relevant for `VALUES_RANGE` detections. Defines the maximum allowed ratio between new values and previously observed values. Example: `"new_values_ratio_threshold": 0.01`. * **distance**: Relevant for `VALUES_RANGE` detections. Defines the maximum gap between focal and baseline minimum & maximum values. For example: `"distance": 0.2.` * values: Relevant for VALUES\_RANGE detections. Define a list of allowed values for categorical fields. For example `"values": ["a", "b", "c"].` * **metric**: Configurations for the metrics being monitored. Potential keys include: * **type**: Specifies the metric type. Acceptable values are `count`, `column_count`, `mean`, `min`, `max`, `sum`, `squared_sum`, `missing_count`, `missing_ratio`, `squared_deviation_sum`, `histogram`, `ks_distance`, `js_distance`, `hellinger_distance`, `accuracy`, `precision`, `recall`, `f1`, `mse`, `rmse`, `mae`, `tp_count`, `fp_count`, `tn_count`, `fn_count`, `custom_metric`, `absolute_sum`, `absolute_error_sum`, `squared_error_sum`, `accuracy_at_k`, `precision_at_k`, `recall_at_k`, `mrr_at_k`, `map_at_k`, `ndcg_at_k`, `psi`, `tp_count_per_class`, `tn_count_per_class`, `fp_count_per_class`, `fn_count_per_class`, `accuracy_per_class`, `precision_per_class`, `recall_per_class`, `f1_per_class`, `min_length`, `max_length`, `mean_length`, `sketch_histogram`, `euclidean_distance`, `unique_values`, `variance`, `median`, `value_count`, `auc_roc`, `code`, `auuc`. \ Example: `"type": "missing_ratio"`. * **id**: Relevant only for`code & custom_metric` type metrics. Specifies the ID of the code-based metric. Example: `"id": "929ec979-4108-4397-ba59-ad639b7271e8"`. * **threshold**: Applicable for metrics like `accuracy`, `f1`, `precision`, `recall`, `tp_count`, `fp_count`, `tn_count`, `fn_count`. Sets a threshold value for numeric columns. Example: `"threshold": 0.5`. * **metricAtK**: Applicable for metrics like `accuracy_at_k`, `precision_at_k`, `recall_at_k`, `mrr_at_k`, `map_at_k`, `ndcg_at_k`. Specifies the `k` value for the metric. Example: `"metricAtK": 3`. * **metricPerClass**: Relevant for metrics like `tp_count_per_class`, `tn_count_per_class`, `fp_count_per_class`, `fn_count_per_class`, `accuracy_per_class`, `precision_per_class`, `recall_per_class`, `f1_per_class`. Specifies the class name to calculate the metric on. Example: `"metricPerClass": "0"`. * **average**: Relevant for metrics like `recall`, `precision`, `f1` * **preConditions**: Defines preconditions that the data must satisfy before the logic evaluation runs. Can include a list of objects where each is a pre-condition to verify, with potential keys: * **name**: Specifies the name of the precondition. Options include: * `BASELINE_DATA_VALUE_IN_RANGE - verifies the value of the baseline data is in between given range.` * `FOCAL_DATA_VALUE_IN_RANGE - verifies the value of the focal is in between given range.` * `BASELINE_MIN_BUCKETS - verifies that the number of buckets with data in the baseline data exceed minimum quantity.` * `MIN_BASELINE_DATA_POINTS - verifies that the number of data points in the baseline data exceed minimum quantity.` * `MIN_FOCAL_DATA_POINTS - verifies that the number of data points in the focal data exceed minimum quantity.` * `IGNORE_TRAILING_ZEROS - removes trailing empty data buckets from baseline data and verifies the number of number of buckets with data exceed minimum quantity.` Example: `"name": "MIN_FOCAL_DATA_POINTS"`. * **min / max**: Applicable for `BASELINE_DATA_VALUE_IN_RANGE`, `FOCAL_DATA_VALUE_IN_RANGE`. Defines the minimum or maximum values for the data. Example: `"min": 0.01`. * **value**: Relevant for `BASELINE_MIN_BUCKETS`, `MIN_BASELINE_DATA_POINTS`, `MIN_FOCAL_DATA_POINTS`. Specifies the minimum value for buckets with data, baseline data points, or focal data points. Example: `"value": 20`. * **minimumTimeWindowsInBaseline**: Relevant for `IGNORE_TRAILING_ZEROS`. Specifies the minimum required number of buckets with data in the baseline data, ignoring trailing zeros. Example: `"minimumTimeWindowsInBaseline": 3`. * **Example of a complete pre-conditions object**: `[{"name": "MIN_FOCAL_DATA_POINTS", "value": 20}, {"min": 0.01, "name": "FOCAL_DATA_VALUE_IN_RANGE"}, {"name": "MIN_BASELINE_DATA_POINTS", "value": 100}]`. * **actions**: Defines the action notifications triggered when an alert is detected. This should be an array containing a single element, with potential keys: * **alertType**: Specifies the alert type, affecting how it is displayed on Aporia's dashboard. Options include `feature_missing_values_threshold`, `metric_change`, `model_activity_anomaly`, `model_activity_change`, `model_activity_threshold`, `model_staleness`, `new_values`, `prediction_drift_anomaly`, `prediction_drift_segment_change`, `prediction_drift_training`, `values_range`. Example: `"alertType": "prediction_drift_anomaly"`. * **alertGroupByEntity**: Flag indicating whether the alert should be grouped by version, data segment, etc., or just by the monitor. Defaults to `true`. * **description**: A text template for the alert description displayed in the Aporia dashboard. The description can include HTML tags and placeholders, which will be replaced with actual alert data. Available placeholders include `model_id`, `model_name`, `model`, `model_version`, `field`, `importance`, `min_threshold`, `max_threshold`, `last_upper_bound`, `last_lower_bound`, `focal_value`, `baseline_value`, `focal_time_period`, `focal_times`, `baseline_time_period`, `baseline_times`, `baseline_segment`, `focal_segment`, `value_thresholds`, `unexpected_values`, `last_deployment_time`, `time_threshold`, `metric`, `drift_score`, `drift_score_text`. \ Example: `"description": "An anomaly in the value of the '{metric}' of {field} {importance} was detected.
The anomaly was observed in the {model} model, in version {model_version} for the last {focal_time_period} ({focal_times}) {focal_segment}.

Based on metric history average and on defined threshold, the value was expected to be below {max_threshold}, but {focal_value} was received."`. * **alertGroupTimeUnit**: Defines the time unit to group alerts by. Possible values are `"h"`, `"d"`, or `"w"`. Example: `"alertGroupTimeUnit": "h"`. * **alertGroupTimeQuantity**: Specifies the time units for alert grouping (corresponding to `alertGroupTimeUnit`). Example: `"alertGroupTimeQuantity": 24"`. * **schema**: Should always be `"v1"`. * **notification**: Configurations for sending notifications for new alerts. Contains an array of objects, each with possible keys: * **type**: Notification type, which can be `"EMAIL"`, `"SLACK"`, `"TEAMS", "WEBHOOK"` * **integration\_id**: For Slack, Teams and Webhook notifications, this specifies the Slack integration ID. Example: `"integration_id": "a4b67272-2cf3-4fe1-82ff-2ac99c125215"`. * **emails**: For email notifications, this lists the email addresses to send the alert to. Example: `"emails": ["ben@org.com", "john@org.com"]`. * **Example configuration**: `[{"name": "Aporia - Teams", "type": "TEAMS", "webhook_url": "https://org.webhook.office.com/webhookb2/1234", "integration_id": "deff3bf8-a4bd-465a-887b-72f312f8d511"}, {"type": "EMAIL", "emails": ["jane@org.com"]}]`. * **visualization**: Specifies the graph type for visualization on Aporia's dashboard. Options include `null` (for no graph display), `'value_over_time'`, `'range_line_chart'`, `'embedding_drift_chart'`, `'values_candlestick_chart'`, `'distribution_compare_chart'`. Example: `"visualization": "distribution_compare_chart"`. * **severity**: Specifies the alert's severity level, which can be `"HIGH"`, `"MEDIUM"`, or `"LOW"`. Example: `"severity": "LOW"`. * **alertGroupByTime**: A flag indicating whether to merge similar alerts by time. Defaults to `true`. * **maxAlertsPerDay**: If set, defines the maximum number of alerts that can be generated per day from this monitor. Example: `"maxAlertsPerDay": 1"`. * **type**: Should always be `"ALERT"`. **Example of a complete configuration object**:

{
  "focal": {
    "source": "SERVING",
    "skipPeriod": "1d",
    "timePeriod": "1d"
  },
  "metric": {
    "type": "js_distance"
  },
  "actions": [
    {
      "type": "ALERT",
      "schema": "v1",
      "severity": "MEDIUM",
      "alertType": "prediction_drift_anomaly",
      "description": "A prediction drift was detected in prediction <b>'{field}'</b> {importance}.{drift_score_text}<br /> The drift was observed in the <b>{model}</b> model, in version <b>{model_version}</b> for the <b>last {focal_time_period} ({focal_times})</b> <b>{focal_segment}</b> compared to the <b>last {baseline_time_period} ({baseline_times})</b>. <br /><br /> Prediction drift indicates a significant change in model's behavior. In some cases, it is a strong indicator for concept drift.<br /><br /> Prediction drift might occur because: <ul><li>Natural changes in data</li><li>Data store / provider schema changes</li><li>Data store / provider issues</li><li>Data processing issues</li></ul>",
      "notification": [
        {
          "type": "SLACK",
          "integration_id": "df81469d-b78d-4010-94a5-8769c4f8ed3b"
        }
      ],
      "visualization": "distribution_compare_chart",
      "alertGroupByTime": true,
      "alertGroupByEntity": false,
      "alertGroupTimeUnit": "d",
      "alertGroupTimeQuantity": 1
    }
  ],
  "baseline": {
    "source": "SERVING",
    "skipPeriod": "2d",
    "timePeriod": "3w"
  },
  "preConditions": [
    {
      "name": "MIN_FOCAL_DATA_POINTS",
      "value": 1000
    }
  ],
  "logicEvaluations": [
    {
      "name": "APORIA_DRIFT_SCORE",
      "thresholds": {
        "numeric": 0.8
      }
    }
  ]
}

## Create Model Complete API documentation can be found [here](https://platform.aporia.com/api/v1/docs#tag/Models/operation/create_model_api_v1__account_name___workspace_name__models_post) The `recalculation_schedules` parameter defines the periods for data recalculation, allowing for the configuration of multiple schedules. You can set up to 5 recalculation schedules per model. Each schedule consists of the following components: * **skip\_period**: Specifies the duration, starting from the current time, during which data will be excluded from the calculation. The period is defined using a time string format (e.g., `"2h"` for 2 hours, `"1w"` for 1 week). For example, setting `"skip_period": "3h"` will skip data records from the last 3 hours. For daily models, hours are not applicable. * **calculation\_window**: Defines the time frame over which the data should be recalculated. This is also specified using a time string format (e.g., `"2h"` for 2 hours, `"1w"` for 1 week). For instance, setting `"calculation_window": "1w"` will trigger a recalculation of all relevant data for the past week. For daily models, hours are not applicable. * **calculation\_schedule**: A CRON expression that specifies when the recalculation process should be initiated. For example, `"0 0 * * *"` sets the recalculation to start every day at midnight. For instance, setting the parameters as `{"skip_period": "3h", "calculation_window": "1w", "calculation_schedule": "0 0 * * *"}` results in a recalculation process that is triggered daily at midnight. This process will recalculate data of a week starting from 3 hours backwards. So, if the recalculation is triggered on August 20 at midnight, it will process data from August 12 at 21:00 (3 hours skipped) up to August 19 at 21:00. --- # Source: https://docs.aporia.com/aporia-docs.md # Aporia Docs Data Science and ML teams rely on Aporia to **visualize** their models in production, as well as **detect and resolve** data drift, model performance degradation, and data integrity issues. Aporia offers quick and simple deployment and can monitor billions of predictions with low cloud costs. We understand that use cases vary and each model is unique, that’s why we’ve cemented **customization** at our core, to allow our users to tailor their dashboards, monitors, metrics, and data segments to their needs.

## Monitor your models in 3 easy steps


Learn	Learn about data drift, measuring model performance in production across various data segments, and other ML monitoring concepts.	why-monitor-ml-models
Connect	Connect to an existing database where you already store the predictions of your models.	data-sources
Monitor	Build a dashboard to visualize your model in production and create alerts to notify you when something bad happens.	monitors-and-alerts

--- # Source: https://docs.aporia.com/data-sources/athena.md # Source: https://docs.aporia.com/v1/data-sources/athena.md # Athena This guide describes how to connect Aporia to an Athena data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with Athena SQL. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a workgroup for Aporia queries Create a workgroup for Aporia to use to perform queries, see instructions [here](https://docs.aws.amazon.com/athena/latest/ug/workgroups-procedure.html). An S3 location (bucket and folder) to which query results will be written must be designated. It is recommended that the bucket be in the same region as the catalog that Athena uses. ### Create a IAM role for Athena access In order to provide access to Athena, create a IAM role with the necessary API permissions. First, create a JSON file on your computer with the following content: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::" ] }, { "Effect": "Allow", "Action": "s3:GetObject", "Resource": [ "arn:aws:s3:::/*", "arn:aws:s3:::/*" ] }, { "Effect": "Allow", "Action": "s3:PutObject", "Resource": [ "arn:aws:s3:::/*" ] }, { "Effect": "Allow", "Action": [ "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:GetQueryResults" ], "Resource": "arn:aws:athena:::workgroup/" }, { "Effect": "Allow", "Action": "athena:ListWorkGroups", "Resource": "*" }, { "Effect": "Allow", "Action": "athena:ListDatabases", "Resource": [ "arn:aws:athena:::datacatalog/*" ] }, { "Effect": "Allow", "Action": "glue:GetDatabases", "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/" ] }, { "Effect": "Allow", "Action": [ "athena:GetQueryExecution", "athena:BatchGetQueryExecution", "athena:ListQueryExecutions", "athena:GetWorkGroup" ], "Resource": [ "arn:aws:athena:::workgroup/*", "arn:aws:athena:::datacatalog/*" ] }, { "Effect": "Allow", "Action": [ "glue:GetTables", "glue:GetTable", "glue:GetPartitions", "glue:GetPartition" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/", "arn:aws:glue:::table//*" ] } ] } ``` Make sure to replace the following placeholders: * ``: You can specify the Athena AWS region or `*` for the default region. * ``: The Athena AWS account ID. * ``: The S3 bucket storing the data for your Athena tables - if more than one bucket, just add the others to the resource list as well. * ``: You can specify one or more database names or use `*` to give Aporia access to all Athena databases. * ``: The workgroup created on the previous step. * ``: The bucket configured for the workgroup. Next, create a new user in AWS with programmatic access only, and grant it the role you've just created. Create security credentials for it (access and secret keys) and use them in the next section. {% hint style="info" %} **IAM Authentication** For authentication without security credentials (access key and secret key), please contact your Aporia account manager. {% endhint %} ### Creating an Athena data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Athena query. By creating a feature named `amount` or a prediction named `proba`, for example, the Athena data source will expect a column in the Athena query named `amount` or `proba`, respectively. Next, create an instance of `AthenaDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = AthenaDataSource( url="jdbc:awsathena://AwsRegion=us-east-1", query='SELECT * FROM "my_db"."model_predictions"', user="", password="", s3_output_location="s3://my-athena-bucket", # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/storing-your-predictions/batch-models.md # Source: https://docs.aporia.com/v1/storing-your-predictions/batch-models.md # Batch Models If your model runs periodically every X days, we refer to it as a **batch model** (as opposed to a real-time model). Typically, storing the predictions of batch models is straightforward. The code examples that follow are naive "illustrations" of how to do so. ### Example: Pandas to Parquet on S3 If you use Pandas, you can append any `DataFrame` to a Parquet file on S3 or other cloud storages by using the [fastparquet](https://fastparquet.readthedocs.io/en/latest/) library: ```python import fastparquet # Preprocess & predict X = preprocess(...) y = model.predict(X_pred) # Concatenate features, predictions and any other metadata df = ... # Store predictions fastparquet.write( filename=f"s3://my-models/{MODEL_ID}/{MODEL_VERSION}/serving.parquet", data=df, append=True, ) ``` ### Example: Pyspark to Delta Lake This example is especially useful on [Databricks](https://www.databricks.com/), but can you can use it on [Delta Lake](https://delta.io/) + [Spark on K8s operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) for example: ```python # Predict on SparkML y = model.transform(X) # Concatenate features, predictions and any other metadata df = ... # Append to a Delta table df.write.format("delta").mode("append").saveAsTable("my_model_serving") ``` --- # Source: https://docs.aporia.com/data-sources/big-query.md # BigQuery This guide describes how to connect Aporia to a BigQuery data source in order to monitor your ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with SQL. This data source may also be used to connect to your model's training set to be used as a baseline for model monitoring. ### Create a materialization dataset for Aporia queries Create a materialization dataset for Aporia to use to perform queries, see instructions [here](https://cloud.google.com/bigquery/docs/datasets#create-dataset). A separate materialization dataset location, to which query results will be written, must be designated for each project from which you want to query. ### Update the Aporia Service Account for BigQuery access In order to provide access to BigQuery, you'll need to update your Aporia service account with the necessary API permissions. #### Step 1: Obtain your aporia service account Use the same service account used for the Aporia deployment. If someone else on your team has deployed Aporia, please reach out to them to obtain it. #### Step 2: Grant read access to the relevant project 1. Go to the [IAM console](https://console.cloud.google.com/iam-admin/) and login. 2. Find the Aporia service account you obtain in the previous step and click on 🖋 **Edit Principle**

3. In the "Edit access" window click on **ADD ANOTHER ROLE**

4. Add the `BigQuery Data Viewer` and `BigQuery Job User` roles and click **Save**

#### Step 3: Grant access to the materialization dataset 1. Go to the [BigQuery console](https://console.cloud.google.com/bigquery) and login. 2. In the left-hand panel, expand the relevant project and find the materialization dataset you created in the previous steps.

3. Click on "**...**" by the dataset name, then click on **Share** 4. In the "Share permissions" window click on **Add Principal**

5. In the "New principal" box, enter the email of the Aporia service account you have obtained. Choose the `BigQuery Data Editor` role and click **Save**.

Now Aporia has the permission it needs to connect to the BigQuery datasets and tables you have specified in the policy. ### Create a BigQuery data source in Aporia 1. Go to [Aporia platform](https://platform.aporia.com/) and login to your account. 2. Go to **Integrations** page and click on the **Data Connectors** tab 3. Scroll to **Connect New Data Source** section 4. Click **Connect** on the BigQuery card and follow the instructions Bravo! :clap: now you can use the data source you've created across all your models in Aporia. --- # Source: https://docs.aporia.com/v1/data-sources/bigquery.md # BigQuery This guide describes how to connect Aporia to a BigQuery data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals are stored in a BigQuery table, or can be queried with a BigQuery view. The BigQuery data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Creating a service account First, create a read-only service account for Aporia: 1. Under *IAM & Admin*, go to the *Service Accounts* section in your Google Cloud Platform console. 2. Click the *Create Service Account* button at the top of the tab. 3. Give the account a name and continue. We recommend naming the account "aporia". 4. Assign the `roles/bigquery.jobUser` role to the service account. 5. Click the *Create Key* button, select JSON as the type and click *Create*. A JSON file will be downloaded – please keep it safe. 6. Click *Done* to complete the creation of Aporia’s service account. Next, add permissions to the relevant tables / views: 1. Go to the BigQuery service in your Google Cloud Platform console. 2. In the *Explorer* panel, expand your project and select a dataset. 3. Expand the dataset and select a table or view. 4. Click *Share*. 5. On the Share tab, Click *Add Principal*. 6. In *New principals*, enter the name of the Service Account you've created for Aporia in the previous step. 7. Select the `roles/bigquery.dataViewer` role. 8. Click *Save* to save the changes for the new user. {% hint style="info" %} **ServiceAccount credentials** For authentication without service account credentials, please contact your Aporia account manager. {% endhint %} ### Creating a BigQuery data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the BigQuery table or view. By creating a feature named `amount` or a prediction named `proba`, for example, the BigQuery data source will expect a column in the BigQuery table named `amount` or `proba`, respectively. If your data format does not fit exactly, you can use [BigQuery Views](https://cloud.google.com/bigquery/docs/views) to shape it in any way you want. Next, create an instance of `BigQueryDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = BigQueryDataSource( credentials_base64=base64.b64encode(""), # Instead of table, you can also use a BigQuery view for custom queries table="my_model", dataset="", # Optional project="", # Optional parent_project="", # Optional # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/model-types/binary.md # Source: https://docs.aporia.com/v1/model-types/binary.md # Binary Classification Binary classification models predict a binary outcome (one of two possible classes). In Aporia, these models are represented by the binary model type. Examples of binary classification problems: * Will the customer `buy` this product or `not_buy` this product? * Is this email `spam` or `not_spam`? * Is this review written by a `customer` or a `robot`? Frequently, binary models output not only a yes/no answer, but also a *probability*. ### Example: Boolean Decision without Probability If you have a model with a yes/no decision but without a probability value, then your database may look like the following:

id	feature1 (numeric)	feature2 (boolean)	decision (boolean)	label (boolean)	timestamp (datetime)
1	13.5	True	True	True	2014-10-19 10:23:54
2	-8	False	False	True	2014-10-19 10:24:24

To monitor this model, we will create a new model version with a schema that include a `boolean` prediction: ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" features={ ... }, predictions={ "decision": "boolean", }, ) ``` To connect this model to Aporia from your data source, call the `connect_serving(...)` API: ```python apr_model.connect_serving( data_source=my_data_source, id_column="id", timestamp_column="timestamp", # Map the "label" column as the label for the "decision" prediction. labels={ # Prediction name -> Column name "decision": "label" } ) ``` Check out the [Data Sources](https://docs.aporia.com/v1/data-sources) section for further reading on the available data sources and how to connect to each one of them. ### Example: Boolean Decision with Probability If you have a model with a yes/no decision *and* a probability / confidence value for it, then your database may look like the following:

id	feature1 (numeric)	feature2 (boolean)	proba (numeric)	decision (boolean)	label (boolean)	timestamp (datetime)
1	13.5	True	0.8	True	True	2014-10-19 10:23:54
2	-8	False	0.5	False	True	2014-10-19 10:24:24

To monitor this model, it's recommended to create a new model version with a schema that includes the final decision as `boolean` field, and the probability as a `numeric` field: ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" features={ ... }, predictions={ "decision": "boolean", "proba": "numeric", }, ) ``` To connect the model to Aporia from a data source, call the `connect_serving(...)` API: ```python apr_model.connect_serving( data_source=my_data_source, id_column="id", timestamp_column="timestamp", # Map the "label" column as the label for "decision" and "proba". labels={ # Prediction name -> Column name representing "decision": "label", "proba": "label", } ) ``` Check out the [Data Sources](https://docs.aporia.com/v1/data-sources) section for further reading on the available data sources and how to connect to each one of them. ### Example: Probability Only In cases when there is no threshold for your boolean prediction, and the final business result is actually a probability, you may simply omit the `decision` field from the examples in the previous section and only include the `proba` field for your prediction. {% hint style="info" %} **Don't want to connect to a database?** Don't worry - you can [log your predictions directly to Aporia.](https://docs.aporia.com/v1/storing-your-predictions/logging-to-aporia-directly) {% endhint %} --- # Source: https://docs.aporia.com/v1/integrations/bodywork.md # Bodywork ![Bodywork](https://github.com/aporia-ai/docs2/blob/main/images/bodywork.png#center) [Bodywork](https://bodywork.readthedocs.io/en/latest/) deploys machine learning projects developed in Python to Kubernetes. \ It helps you: * serve models as microservices * execute batch jobs * run reproducible pipelines On demand, or on schedule, Bodywork automates repetitive DevOps tasks and frees machine learning engineers to focus on what they do best - solve data problems with machine learning. ### Aporia & Bodywork Integration This integration enables you to easily monitor models deployed with Bodywork for issues such as model drift, performance degradation, and more. Check out [the example project](https://github.com/bodywork-ml/bodywork-pipeline-with-aporia-monitoring) on GitHub to learn more. ![Bodywork](https://github.com/aporia-ai/docs2/blob/main/images/bodywork-aporia.png#center) --- # Source: https://docs.aporia.com/integrations/cisco.md # Cisco You can integrate Aporia with Cisco's Full-Stack Observability Platform to receive alerts and notifications directly to the platform, and view your models health status in a centralized place. ### Setting up the FSOP Integration 1. Create a service principal in Cisco's platform 1. Login to 2. Click on **Access Management** and go to **Service Principals**

3. Click on **Add**

4. Define the new service principal. Note to pick **Basic** for **Authentication Type** and add the **Agent** default role access (Under **Edit Role Access**). Then click on **Create**.

5. Save the output service principal details - They will be needed for the Aporia integration.

2. Log into Aporia’s console. On the navbar on the left, click on **Integrations,** switch to the **Applications** tab, and choose **Cisco**.

3. Enter your **Tenant Details** and **Service Principal Details**, as created in the previous step**.** The Tenant URL should include the schema with no added URIs (https\://\.observe.appdynamics.com).

4. Click Save. On success the save button will become disabled, and you'll be able to Test the integration. **Congratulations: You’ve now successfully integrated Aporia to Cisco's FSO Platform!** After Integrating Cisco FSOP, any monitors will be automatically configured to send alerts to the platform.

In addition, the workspace state will now be synced to the FSO platform periodically, including models and alerts. Happy Monitoring! --- # Source: https://docs.aporia.com/api-reference/code-based-metrics.md # Code-Based Metrics Code-based metrics allow users to define Pyspark-based metrics that allow for computation on raw data, element-wise operations, and support third-party libraries. In the following guide we will explain how one can use code-based metrics in Aporia to gain higher flexibility on the metric’s calculation.

## Building the metric code A code-based metric in Aporia gets a Pyspark data frame as an input and should return a numeric value/NaN as an output. Similar to custom metrics, code-based metrics are defined for a specific model and can be used with all versions/datasets/segments of that model. Let's take a look at the following example: ```python import numpy as np def calc_metric(df): """ My function simply returns the average age, but I can do whatever calculation I wish with the data frame """ return np.average(df.collect().columns.age) ``` Supported libraries can be found below. Code-based metrics are calculated at the same frequency of all other calculation jobs as specified by your model's aggregation period. The code-based metric will be calculated on the following data frames: 1. all data over your model's retention period (you can filter this data to a specific time period) 2. all segments (separately) over your model's retention period (you can filter this data to a specific time period) {% hint style="info" %} Performance wise, it is best practice to perform the calculation on top of the Pyspark data frame rather than collecting it first using `df.collect()` {% endhint %} ## Registering your metric Once you have your metric ready, you can register it to the relevant Aporia model. Below you will find example code to help you get started: ```python import requests from http import HTTPStatus ACCOUNT = <> WORKSPACE = <> MODEL_ID = <> BASE_URL = f"https://platform.aporia.com/api/v1/{ACCOUNT}/{WORKSPACE}" BASE_METRICS_URL = f"{BASE_URL}/metrics" API_KEY = <> AUTH_HEADERS = {"Authorization": f"Bearer {API_KEY}"} # First we read the code we prepared for our metric with open('my_metric.py') as f: METRIC_CODE = f.read() # Then we register it to the relevant Aporia model metric_creation_body = { "model_id": MODEL_ID, "name": "my cool metric", "code": METRIC_CODE } CREATE_METRIC_EP = f"{BASE_METRICS_URL}/code-based-metrics" response = requests.post( url=CREATE_METRIC_EP, json=metric_creation_body, headers={"Authorization": f"Bearer {API_KEY}"} ) # We'll use the metric ID later in order to test it if (response.status_code == HTTPStatus.OK): metric_id = response.json().get('id') print(f"Successfully created metric, id: {metric_id}") ``` ## Testing your metric Once you have your metric registered, it is time to test it. Testing a code-based metric can be performed on a dataset of your choice. Below you will find example code for testing your metric on the latest version's serving dataset: ```python MODEL_VERSIONS_EP = f'{BASE_URL}/model-versions' # Select which version I want to use for the test model_version_params = {"model_id" : MODEL_ID} response = requests.get( MODEL_VERSIONS_EP, params=model_version_params, headers=AUTH_HEADERS ) if response.status_code != HTTPStatus.OK: raise Exception(f"Failed getting model versions, error: {response.status_code}") # We will use the last version returned, but you can choose a different one versions = response.json() dataset_id = versions[-1].get('serving_dataset').get('id') # Test the metric to make sure it works validate_metric_ep = f"{BASE_METRICS_URL}/code-based-metrics/validate" body = {"metric_id" : metric_id, "dataset_id": dataset_id} response = requests.post( url=validate_metric_ep, json=body, headers=AUTH_HEADERS ) while HTTPStatus.OK == response.status_code and "pending" == response.json().get('status'): print(f"{response.json().get('progress')}% of metric validation task is completed") response = requests.post( url=validate_metric_ep, json=body, headers=AUTH_HEADERS ) print(response.status_code) print(response.json()) ``` ## Supported 3rd party libraries * pyspark * pyspark.sql * pyspark.sql.functions * snowflake * snowflake.snowpark * snowflake.snowpark.functions * numpy * numpy.core.\_methods * pandas * math * scipy * scipy.stats * statsmodels * statsmodels.stats.proportion You can further explore all available code-based metrics features via REST API in our docs, [here](https://platform.aporia.com/api/v1/docs#tag/Metrics-$Experimental$/operation/get_many_code_based_metrics_api_v1__account_name___workspace_name__metrics_code_based_metrics_get). --- # Source: https://docs.aporia.com/v1/api-reference/custom-metric-definition-language.md # Custom Metric Definition Language In Aporia, custom metrics are defined using syntax that is similar to python's. There are three building blocks which can be used in order to create a custom metric expression: * **Constants** - a numeric value (e.g. `2`, `0.5`, ..) * **Functions** - out of the builtin function collection you can find below (e.g. `sum`, `count`, ...). All those functions return a numeric value. * **Binary operation** - `+`, `-`, `*`, `/`, `**`. Operands can be both constants or function calls. ### Builtin Functions Before we dive into each of the supported function, there are two general concepts you should be familiar with regarding all functions - field expressions and data segment filters. #### Field Expressions A field expression can be described in the following format: ``` .[] ``` Field category is one of the following: `features` / `raw_inputs` / `predictions` / `actuals`. Note that you can only use categories which you defined in you schema while [creating your model version](https://docs.aporia.com/v1/introduction/quickstart). In addition, don't forget that `predictions` and `actuals` categories have the same field names. The segment filter is optional, for further information about the filters read the section below. #### Data Segment Filters Data segment filters are boolean expressions, designed to reduce to a specific data segment the field on which we perform the function. Each boolean condition in a segment filter is a comparison between a field and a constant value. For example: ``` [features.Driving_License == True] // will filter out records in which Driving_License != True [raw_inputs.Age <= 35] // will only include records in which Age <= 35 ``` Conditions can be combined using `and` / `or` and all fields can be checked for missing values using `is None` / `is not None`. The following describe the supported combinations: | Type / Operation | == | != | < | > | >= | <= | | ---------------- | :---------------: | :---------------: | :---------------: | :---------------: | :---------------: | :---------------: | | Boolean | True/False | True/False | ✖️ | ✖️ | ✖️ | ✖️ | | Categorical | numeric constants | numeric constants | ✖️ | ✖️ | ✖️ | ✖️ | | String | numeric constants | numeric constants | ✖️ | ✖️ | ✖️ | ✖️ | | Numeric | numeric constants | numeric constants | numeric constants | numeric constants | numeric constants | numeric constants | The table cells indicates the type we can compare to. **Examples** ``` // Average annual premium of those with a driving license sum(features.Annual_Premium[features.Driving_License == True]) / prediction_count() // Three time number of prediction of those who are under 35 years old and live in CA prediction_count(raw_inputs.Age <= 35 and raw_inputs.Region_Code == 28) * 3 prediction_count(features.Age > 27) / (sum(features.Annual_Premium) + sum(features.Vintage)) ``` #### Supported functions

accuracy

**Parameters** * prediction: prediction field * label: label field * threshold: numeric. Probability threshold according to which we decide the if a class is positive * filter: the filter we want to apply on the records before calculating the metric

actuals_count

**Parameters** No parameters needed, cannot apply filters on this metric.

actuals_ratio

**Parameters** No parameters needed, cannot apply filters on this metric.

auc_roc

**Parameters** * **prediction**: prediction probability field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

count

**Parameters** No parameters needed, cannot apply filters on this metric.

f1_score

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **average**: the average strategy (micro / macro / weighted) * **top\_k**: consider only top-k items. * **filter**: the filter we want to apply on the records before calculating the metric

fn_count

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

fp_count

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

fp_rate

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

logloss

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

mae

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

mape

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

max

**Parameters** * **field**: numeric or dict. The field for which the metric will be computed. Can be of any category (`feature` / `raw_input` / `prediction` / `actual`) * **filter**: the filter we want to apply on the records before calculating the metric * **keys**: keys to filter in when field type is dict.

mean

median

min

miss_rate

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

missing_count

**Parameters** * **field**: the field for which the metric will be computed. Can be of any category (`feature` / `raw_input` / `prediction` / `actual`) * **filter**: the filter we want to apply on the records before calculating the metric

missing_ratio

mse

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

ndcg

**Parameters** * **prediction**: prediction field * **label**: label field * **rank**: the rank position * **filter**: the filter we want to apply on the records before calculating the metric

not_missing_count

precision_score

recall_score

rmse

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

specificity

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

std

sum

tn_count

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

tp_count

**Parameters** * **prediction**: prediction probability field * **label**: label field * **threshold**: numeric. Probability threshold according to which we decide the if a class is positive * **filter**: the filter we want to apply on the records before calculating the metric

unique_count

value_count

**Parameters** * **field**: the field for which the metric will be computed. Can be of any category (`feature` / `raw_input` / `prediction` / `actual`) * **value**: the value we want to count * **filter**: the filter we want to apply on the records before calculating the metric * **keys**: keys to filter in when field type is dict.

value_percentage

variance

wape

**Parameters** * **prediction**: prediction field * **label**: label field * **filter**: the filter we want to apply on the records before calculating the metric

--- # Source: https://docs.aporia.com/api-reference/custom-metric-syntax.md # Custom Metric Syntax In Aporia, custom metrics are defined using syntax that is similar to python's. There are three building blocks which can be used in order to create a custom metric expression: * **Constants** - a numeric value (e.g. `2`, `0.5`, ..) * **Functions** - out of the builtin function collection you can find below (e.g. `sum`, `count`, ...). All those functions return a numeric value. * **Binary operation** - `+`, `-`, `*`, `/`, `**`. Operands can be both constants or function calls. ## Builtin Functions Before we dive into each of the supported functions, let's take a look at a few examples of custom metric definitions. ``` // Average annual premium of those with a driving license sum(column="annual_premium") / count() // Mean predicted probability mean(column="proba") // Model revenue 5 * tp_count(column="will_buy_insurance") -2 * fp_count(column="will_buy_insurance") // nDCG@4 per step ndcg_at_k(column="p_views", k=4) ndcg_at_k(column="p_add_to_cart", k=4) ndcg_at_k(column="p_purchases", k=4) // accuracy using custom threshold accuracy(column="proba", type="numeric", threshold=0.2) // R-squared - Expanding brackets to use available aggregations rss = squared_error_sum(column="prediction") tss = squared_sum(column="actual") - 2*mean(column="actual")*sum(column="actual") + column_count(column="actual")*(mean(column="actual")**2) 1 - rss/tss ``` ### Filters within functions Within Aporia we can always set a [segment](https://docs.aporia.com/core-concepts/tracking-data-segments) on our metrics as a whole, but sometimes this is just not enough. Many times we will need to pass a segment of our data to a specific function as part of our metric. Aporia supports these cases by passing another argument to functions called "**filter"**. With the "**filter**" argument you'll be able to set any filtering to the data passed in the "**column"** argument using the [custom segment syntax](https://docs.aporia.com/api-reference/custom-segment-syntax). **For example:** ``` // Ratio of the annual premium of people above 70 out of the total premium sum(column="annual_premium", filter="age > 70") / sum(column="annual_premium") ``` To allow you to set any of your segments upon these metrics as a whole as well, setting a filter within a metric will create behind the scenes, the intersection of the segment within the filter with all of your existing filters. These segments will be counted as any regular segment. ### Supported functions #### Numerical Measures

absolute_sum

Returns the sum of absolutes for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

count

Returns the total number of rows. **Parameters** No parameters needed.

column_count

Returns the number of rows with non-null values for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be any field.

max

Returns the maximum value for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

max_length

Returns the maximum length for the given column (items for arrays/embeddings, characters for text). **Parameters** * **column**: the name of the field on which we want to apply the function. Can be **text/array/numeric array/embedding** field of any group (`feature` / `raw_input` / `prediction` / `actual`)

median

Returns the median value for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

mean

Returns the average value for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

mean_length

Returns the average length for the given column (items for arrays/embeddings, characters for text). **Parameters** * **column**: the name of the field on which we want to apply the function. Can be **text/array/numeric array/embedding** field of any group (`feature` / `raw_input` / `prediction` / `actual`)

min

Returns the minimum value for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

min_length

Returns the minimum length for the given column (items for arrays/embeddings, characters for text). **Parameters** * **column**: the name of the field on which we want to apply the function. Can be **text/array/numeric array/embedding** field of any group (`feature` / `raw_input` / `prediction` / `actual`)

missing_count

Returns the number of rows with null values for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be any field.

missing_ratio

Returns the percentage of rows with null values for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be any field.

sum

Returns the sum for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

squared_sum

Returns the sum of squared values for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

squared_deviation_sum

Returns the sum of squares for the given column. For column **x**, with **m** mean of all x samples, equals to sum of (x-m)². **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

value_count

Returns the number of entries where the given column is equal to the given value. For example, value\_count(column="bool", value=True) will return count of entries where bool=TRUE. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be any **boolean/categorical** field. * **value**: The value of the field to look for.

variance

Returns the variance for the given column. **Parameters** * **column**: the name of the field on which we want to apply the function. Can be numeric field of any group (`feature` / `raw_input` / `prediction` / `actual`)

#### Regression Metrics

absolute_error_sum

Returns the sum of absolute errors for the given prediction. For a prediction P and actual A, returns the sum of |P-A|. **Parameters** * **column**: the name of the **numeric prediction** field on which we want to apply the function. Must have a **numeric actual** mapped to it.

mae

Calculates MAE for the given prediction. **Parameters** * **column**: the name of the **numeric prediction** field on which we want to apply the function. Must have a **numeric actual** mapped to it.

mse

Calculates MSE for the given prediction. **Parameters** * **column**: the name of the **numeric prediction** field on which we want to apply the function. Must have a **numeric actual** mapped to it.

rmse

Calculates RMSE for the given prediction. **Parameters** * **column**: the name of the **numeric prediction** field on which we want to apply the function. Must have a **numeric actual** mapped to it.

squared_error_sum

Returns the sum of squared errors for the given prediction. For a prediction P and actual A, returns the sum of (P-A)². **Parameters** * **column**: the name of the **numeric prediction** field on which we want to apply the function. Must have a **numeric actual** mapped to it.

#### Binary Classification Metrics

accuracy

Calculates accuracy for the given prediction. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions**. * **method**: will define the average strategy to use. Can be: "macro", "micro" or "weighted". Required for **categorical predictions**.

auc_roc

Calculates AUC ROC for the given prediction. **Parameters** * **column**: the name of the **numeric** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it.

fn_count

Returns the number of False-Negative results. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions**.

fp_count

Returns the number of False-Positive results. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions**.

Calculates f1-score for the given prediction. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions.** * **method**: will define the average strategy to use. Can be: "macro", "micro" or "weighted". Required for **categorical predictions**.

precision

Calculates precision for the given prediction. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions.** * **method**: will define the average strategy to use. Can be: "macro", "micro" or "weighted". Required for **categorical predictions**.

recall

Calculates recall for the given prediction. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions.** * **method**: will define the average strategy to use. Can be: "macro", "micro" or "weighted". Required for **categorical predictions**.

tn_count

Returns the number of True-Negative results. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions**.

tp_count

Returns the number of True-Positive results. **Parameters** * **column**: the name of the **numeric/boolean** **prediction** field on which we want to apply the function. Must have a **boolean actual** mapped to it. * **threshold**: probability threshold according to which we decide if a class is positive. Required for **numeric predictions**.

#### Multiclass Classification Metrics

accuracy_per_class

Calculates accuracy for the given prediction per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

fn_count_per_class

Returns the number of False-Negative results per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

fp_count_per_class

Returns the number of False-Positive results per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

f1_per_class

Calculates f1-score for the given prediction per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

precision_per_class

Calculates precision for the given prediction per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

recall_per_class

Calculates recall for the given prediction per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

tn_count_per_class

Returns the number of True-Negative results per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

tp_count_per_class

Returns the number of True-Positive results per the specified category class. **Parameters** * **column**: the name of the **categorical** **prediction** field on which we want to apply the function. Must have a **categorical** **actual** mapped to it. * **class\_name**: the class on which we want to calculate the function.

#### Ranking Metrics

accuracy_at_k

Calculates Accuracy for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

map_at_k

Calculates MAP (Mean-Average-Precision) for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

mrr_at_k

Calculates MRR (Mean-Reciprocal-Rank) for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

ndcg_at_k

Calculates NDCG for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

precision_at_k

Calculates Precision for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

recall_at_k

Calculates Recall for the given prediction on the top K items. **Parameters** * **column**: the name of the **array prediction** field on which we want to apply the function. Must have an **array actual** mapped to it. If using [candidate-level ranking](https://docs.aporia.com/model-types/ranking#integrating-candidate-level-data), can be a **boolean prediction** with a mapped **boolean actual**. * **k**: numeric integer between 1 to 12. Only the top-k items will be considered.

--- # Source: https://docs.aporia.com/ml-monitoring-as-code/custom-metrics.md # Source: https://docs.aporia.com/monitors-and-alerts/custom-metrics.md # Source: https://docs.aporia.com/v1/monitors/custom-metrics.md # Custom Metric In case the monitoring metrics provided by Aporia are insufficient for your use-case, you can define your own custom metric using our custom metric definition language. ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the metrics you would like to monitor You can either choose a custom metric you have previously defined or create a new one. If this is your first time creating a custom metric in Aporia, you can read about our custom metric definition language [here](https://docs.aporia.com/v1/api-reference/custom-metric-definition-language). #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the comparison between the inspection period and the baseline leads to a conclusion outside your threshold boundaries. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. For anomaly detection method, use the monitor preview to help you decide what is the appropriate sensitivity level. --- # Source: https://docs.aporia.com/api-reference/custom-segment-syntax.md # Custom Segment Syntax In Aporia, [custom segments](https://docs.aporia.com/core-concepts/tracking-data-segments) are defined using SQL-based syntax. The definition is written as a condition that will be passed to a WHERE clause. The condition should be based only on fields within your model version schema. The following SQL operators are supported: ```sql >, <, <=, >=, !=, &, |, =, is, not, or, and ``` {% hint style="warning" %} Categorical fields are ingested as strings.\ If your segment includes a comparison of a categorical field and a constant value, remember to quote the constant value. {% endhint %}

## Examples 1. Assume we have a numeric field "age" and a boolean field "is\_customer" in our schema, we can create a segment based on these fields as follows: ```sql --Segment of customers above age 23 age > 23 and is_customer = True ``` 2. Assume we have a categorical field "partner\_type" and categorical field "deal\_step" in our schema, we can create a segment based on these fields as follows: ```sql --Segment of all 'Gold' partners with missing deal stage partner_type = 'Gold' and deal_stage is null ``` 3. Assume we have a categorical field "region\_code" in our schema, we can create a segment based on this field as follows: ```sql --Segment of all data with region_code different than '123' region_code != '123' ``` --- # Source: https://docs.aporia.com/ml-monitoring-as-code/dashboards.md # Dashboards This guide will show you how to automatically add dashboards to your models using the Python SDK. {% hint style="info" %} This functionality is currently only supported through the API SDK, and may be added to the as-code SDK in the future. {% endhint %} ## Constructing Dashboards To define a new dashboard, we start by constructing the widgets list. There are various widget types to choose from, each with its own configuration and visualization. Lets start by exploring the main concepts.

### Widget Grid Each dashboard has a widget grid, in which the widgets reside. Widgets can't overlap, and if an overlap does occur, the dashboard will attempt to resolve it, resulting in widgets sliding downwards. The grid has 12 columns, and is of infinite rows. The Text widget can be of any size, while other widgets have a minimum size of 4 columns and 5 rows (Width=4, Height=5). The position of a widget is defined as a tuple of (x,y), where x indicates the X coordinate of the widget, ranging from 0 to 11 (0 - Leftmost widget, 11 - Rightmost widget). Due to size constraints, non-text widgets X coordinate ranges from 0 to 8. Y indicates the Y coordinate of the widget, ranging from 0 onwards, with 0 being the first line. There is no size constraint, but a widget will slide upwards if there is empty space above it with no widgets in between. The size of a widget is defined in a matching tuple (width, height). ### Metric Configuration As most widgets are designed to plot metrics, the metric configuration is consolidated into 2 classes: #### MetricConfiguration (aporia.sdk.widgets.base.MetricConfiguration) This is the main configuration class for metrics in the Aporia dashboards. This defines the inspected metric and any relevant parameter for it. Each metric configuration is defined first by the metric (one of `aporia.sdk.metrics.MetricType` enum), and then any relevant parameter: * `field` - Most metrics are calculated over specific fields. This parameter should be of type `aporia.sdk.fields.Field` and indicates the field to use for calculation. You can use `model.get_fields()` or `model.get_field_by_name()` functions to retrieve these objects. * `custom_metric` - When using the `MetricType.CUSTOM_METRIC` metric type, the custom metric object (`aporia.sdk.custom_metrics.CustomMetric`) should be passed in to identify the chosen metric. * `average` - For some multiclass metrics, the average method should be specified. This should be one of `aporia.sdk.metrics.AverageMethod` values. * `threshold` - For confusion-matrix metrics with numeric predictions, the threshold indicates the cutoff threshold to decide whether a prediction is positive or negative. This should be a floating point number between 0 and 1. * `k` - For ranking metrics, the k parameter, indicating the top K choices being inspect, should be specified. This should be an integer between 1 and 12. * `class_name` - For some multiclass per-class metrics, the inspected class should also be specified. This should be a string, matching the class as it appears in the database and Aporia. * `baseline` - For drift metrics, the baseline for drift should be specified. This should be described by the `aporia.sdk.widgets.base.BaselineConfiguration` class, and contains the following fields: * `type` - This is one of the values of `aporia.sdk.widgets.base.BaselineType`: * `BaselineType.TRAINING` - Compare the metric to the matching training dataset. No extra parameters are needed. * `BaselineType.RELATIVE_TIME_PERIOD` - Compare the metric to the previous time window. The time-window should be defined by the `unit` (`aporia.sdk.widgets.base.UnitType`) and `duration` parameters, where duration is a positive integer. #### MetricOverrideConfiguration (aporia.sdk.widgets.base.MetricOverrideConfiguration) In some widget configurations, you can override certain parameters of a metric in an alternative view. This lets you edit all of the parameters mentioned above, except for the metric itself and the custom metric identifier, if chosen. The overrides will be applied, and other parameters will be maintained. #### Timeframes The dashboard is intended to show a picture for a given timeframe. However, some things are supposed to be relative to it. The dashboard itself has a global timeframe which is inherited by all widgets unless specified otherwise. All timeframes are a choice of 2 options: * Relative timeframe: This is one of a predefined set of relative timeframes supported by Aporia. It will check the time for the latest X hours/days/weeks/months compared to the current time. The possible values are: "1h", "4h", "1d", "2d", "1w", "2w", "1M", "2M", "3M". * Fixed timeframe: A specific timeframe. This is set by a specific start and end periods. This must be an instance of `aporia.sdk.widgets.base.FixedTimeframe`. This is comprised of `to` and `from` fields of type `datetime.datetime`. Due to `from` being a keyword in Python, initialize the class in the following way: `FixedTimeframe(**{"to": ..., "from": ...})` ### Compare View Some widgets support a Compare view, which allows you to add an additional plot or data to the same widget. In most cases, this is limited to the same metric/version/segment of the Data view. This lets you edit most non-constrained parameters. Read about each individual widget to see which overrides are possible. ### Widget Types All widgets share the following parameters: * `position` - The position of the widget, as described above * `size` - The size of the widget, as described above * `title` (Not applicable for text widgets) - The title of the widget To create a new widget, call the `widget = WidgetClass.create(...)` function, exported by each widget type (e.g: `MetricWidget.create(...)`). Some widgets support a compare functionality. This is available through the `widget = widget.compare(...)` function.

Text Widget

The **Text** widget displays a line of text, usually to split the dashboard to parts. **Parameters:** * **`text`** - The text to display in the widget. This can also include emojis. * `font` (default - 24) - The font size of the widget. We recommend leaving this as the default, unless an extra large text widget is desired.

Metric Widget

The **metric** widget lets you plot a metric as a number, to highlight very important information quickly, such as the bottom-line revenue numbers. **Parameters:** * **`metric`** - A MetricConfiguration class describing the metric to plot * `timeframe`(Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version`(Default - None) - The Aporia version identifier to plot metric for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * `segment`(Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value`(Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. **Compare Parameters:** * `metric_overrides` (Default - None) - metric overrides to apply on the original metric in the compare view. * `timeframe` (Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version` (Default - None) - The Aporia version identifier to plot metric for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * `segment` (Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number.

Metric by Segment Widget

The **metric by segment** widget lets you plot a metric over multiple segment values as a bar plot. This is useful to highlight areas of interest across different segments. **Parameters:** * **`metric`** - A MetricConfiguration class describing the metric to plot * `timeframe`(Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version`(Default - None) - The Aporia version identifier to plot metric for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * **`segment`**- The Aporia segment identifier to plot metric for. This should be a specific `aporia.sdk.segments.Segment`object. This is inherited by the Compare view, if used. * **`segment_values`** - A list of the segment values to plot. For categorical and custom segments, it should be the specific values. For numeric segments, it should be the lower bound numbers of each value. **Compare Parameters:** * `metric` (Default - None) - Alternative MetricConfiguration object to plot for the compare view. Mutually exclusive with `metric_overrides`. * `metric_overrides` (Default - None) - metric overrides to apply on the original metric in the compare view. Mutually exclusive with `metric`. * `timeframe` (Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version` (Default - None) - The Aporia version identifier to plot metric for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions"

Metric Correlation Widget

The **metric correlation** widget lets you plot two metric over multiple segments or versions to look for correlation between the two. You can choose between splitting view between different versions, or between different segments, but not both. The Compare view allows adding a different view selection. **Parameters:** * **`x_axis_metric`** - A MetricConfiguration class describing the metric to plot on the X axis * **`y_axis_metric`** - A MetricConfiguration class describing the metric to plot on the Y axis * `timeframe`(Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version`(Default - None) - The Aporia version identifier to plot metric for. This can a specific `aporia.sdk.versions.Version` object or `None`. Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `versions`(Default - None) - A list of Aporia version identifiers to plot metric for. This can a a list of specific `aporia.sdk.versions.Version` objects or have `None` as a list member to indicate "All versions". Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `segment` (Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. This is only applicable if the `versions` parameter is used. * `segment_values` - If a segment was chosen, this is a list of the segment values to plot. For categorical and custom segments, it should be the specific values. For numeric segments, it should be the lower bound numbers of each value. This is only applicable if the `version` parameter is used, or both `version` and `versions` are `None`. If `segment_value` and `segment_values` are both `None` and a single version is used, `segment_values` is automatically set to all values of `segment`. **Compare Parameters:** * `x_axis_metric_overrides` (Default - None) - metric overrides to apply on the original X axis metric in the compare view. * `y_axis_metric_overrides` (Default - None) - metric overrides to apply on the original Y axis metric in the compare view. * `timeframe` (Default - None) - An alternative timeframe for the metric. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate metric for (DatasetType.SERVING or DatasetType.TRAINING) * `version`(Default - None) - The Aporia version identifier to plot metric for. This can a specific `aporia.sdk.versions.Version` object or `None`. Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `versions`(Default - None) - A list of Aporia version identifiers to plot metric for. This can a a list of specific `aporia.sdk.versions.Version` objects or have `None` as a list member to indicate "All versions". Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `segment` (Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. This is only applicable if the `versions` parameter is used. * `segment_values` - If a segment was chosen, this is a list of the segment values to plot. For categorical and custom segments, it should be the specific values. For numeric segments, it should be the lower bound numbers of each value. This is only applicable if the `version` parameter is used, or both `version` and `versions` are `None`. If `segment_value` and `segment_values` are both `None` and a single version is used, `segment_values` is automatically set to all values of `segment`.

Distribution Widget

The **distribution** widget lets you plot a histogram of any column. **Parameters:** * **`field`**- The Aporia field to plot the distribution for. This parameter should be of type `aporia.sdk.fields.Field` * `timeframe`(Default - None) - An alternative timeframe for the distribution. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate distribution for (DatasetType.SERVING or DatasetType.TRAINING) * `version`(Default - None) - The Aporia version identifier to plot distribution for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * `segment`(Default - None) - The Aporia segment identifier to plot distribution for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value`(Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. **Compare Parameters:** * `timeframe` (Default - None) - An alternative timeframe for the distribution. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate distribution for (DatasetType.SERVING or DatasetType.TRAINING) * `version` (Default - None) - The Aporia version identifier to plot distribution for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * `segment` (Default - None) - The Aporia segment identifier to plot distribution for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number.

Time-Series Widget

The **time-series** widget lets you plot a metric over multiple segments or versions as a line plot over time. This is useful to track interesting metrics for changes or peaks. **Parameters:** * **`metric`** - A MetricConfiguration class describing the metric to plot * `version`(Default - None) - The Aporia version identifier to plot metric for. This can a specific `aporia.sdk.versions.Version` object or `None`. Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `versions`(Default - None) - A list of Aporia version identifiers to plot metric for. This can a a list of specific `aporia.sdk.versions.Version` objects or have `None` as a list member to indicate "All versions". Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `segment` (Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` (Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. This is only applicable if the `versions` parameter is used. * `segment_values` (Default - None) - If a segment was chosen, this is a list of the segment values to plot. For categorical and custom segments, it should be the specific values. For numeric segments, it should be the lower bound numbers of each value. This is only applicable if the `version` parameter is used, or both `version` and `versions` are `None`. If `segment_value` and `segment_values` are both `None` and a single version is used, `segment_values` is automatically set to all values of `segment`. * `granularity` (Default - None) - The timeframe granularity to plot the line in. This can be `1d`, `1w`, `1M` to indicate daily, weekly or monthly, respectively. Pass `None` to indicate Aporia should divide the timeframes automatically for the best presentation. **Compare Parameters:** * `metric_overrides` (Default - None) - metric overrides to apply on the original metric in the compare view. * `version`(Default - None) - The Aporia version identifier to plot metric for. This can a specific `aporia.sdk.versions.Version` object or `None`. Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `versions`(Default - None) - A list of Aporia version identifiers to plot metric for. This can a a list of specific `aporia.sdk.versions.Version` objects or have `None` as a list member to indicate "All versions". Mutually exclusive with `versions`. If both are `None`, this is equivalent to a single version of "All Versions". * `segment` (Default - None) - The Aporia segment identifier to plot metric for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` (Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. This is only applicable if the `versions` parameter is used. * `segment_values` (Default - None) - If a segment was chosen, this is a list of the segment values to plot. For categorical and custom segments, it should be the specific values. For numeric segments, it should be the lower bound numbers of each value. This is only applicable if the `version` parameter is used, or both `version` and `versions` are `None`. If `segment_value` and `segment_values` are both `None` and a single version is used, `segment_values` is automatically set to all values of `segment`.

Histogram over Time Widget

The **histogram-over-time** widget lets you plot an area chart of a column over time. Currently only categorical and boolean columns are supported. **Parameters:** * **`field`**- The Aporia field to plot the distribution for. This parameter should be of type `aporia.sdk.fields.Field` * `version`(Default - None) - The Aporia version identifier to plot distribution for. This can be either a specific `aporia.sdk.versions.Version` object or `None` to indicate "All Versions" * `segment`(Default - None) - The Aporia segment identifier to plot distribution for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value`(Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. * `granularity` (Default - None) - The timeframe granularity to plot the line in. This can be `1d`, `1w`, `1M` to indicate daily, weekly or monthly, respectively. Pass `None` to indicate Aporia should divide the timeframes automatically for the best presentation.

Data Health Table Widget

The **data-health table** widget lets you plot a table of the most drifting/most missing features of your model. This easily highlights the main features needed to investigate or track. A common practice is to compare to training or a previous timeframe. **Parameters:** * `timeframe`(Default - None) - An alternative timeframe for the distribution. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate distribution for (DatasetType.SERVING or DatasetType.TRAINING) * **`version`** - The Aporia version identifier to plot distribution for. This must be a specific `aporia.sdk.versions.Version` object * `segment`(Default - None) - The Aporia segment identifier to plot distribution for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value`(Default - None) - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number. * `sort_by` (Default - SortType.DRIFT\_SCORE) - Whether to highlight the top drifted features or features with highest missing ratio. Must be a value of `aporia.sdk.widgets.data_health_table_widget.SortType` * `sort_direction` (Default - SortDirection.DESCENDING) - Whether to sort in an ascending or descending direction. Must be a value of `aporia.sdk.widgets.data_health_table_widget.SortDirection` **Compare Parameters:** * `timeframe` (Default - None) - An alternative timeframe for the distribution. Pass `None` to inherit the dashboard global timeframe. * `phase` (Default - DatasetType.SERVING) - The version dataset to calculate distribution for (DatasetType.SERVING or DatasetType.TRAINING) * `segment` (Default - None) - The Aporia segment identifier to plot distribution for. This can be either a specific `aporia.sdk.segments.Segment`object or `None` to indicate "All Data" * `segment_value` - If a segment was chosen, this is one of the segment values to plot for. For categorical and custom segments, it should be the specific value. For numeric segments, it should be the lower bound number.

Usage example: ```python from aporia.sdk.metrics import MetricType from aporia.sdk.widgets import MetricWidget, TextWidget from aporia.sdk.widgets.base import MetricConfiguration widgets = [ TextWidget.create(position=(0, 0), size=(12, 1), text="This is my new dashboard!"), MetricWidget.create( position=(0, 1), size=(4, 5), title="Model activity", metric=MetricConfiguration(metric=MetricType.COUNT), ).compare(), # This will display the same number twice ] ``` ### Dashboard Configuration The next step is to define the dashboard itself. A dashboard is built of a list of widgets and global filters. The global filters includes the dashboard timeframe and other specific filters. This is defined by `aporia.sdk.dashboards.DashboardGlobalFilters`. The fields are: * **`timeframe`** - The global dashboard timeframe. See [#timeframes](#timeframes "mention") for more information. * `version_id` - An optional version ID to filter by. All widget views configured for "All Versions" will use this version instead. Other views aren't affected. * `data_segment_id` - An optional segment ID to filter by. All widget views configured for "All Data" will use this segment instead. Other views showing other segments will show the cross between the two segments, and views showing data for a different value of the same segment won't be affected. If this parameter is passed, `data_segment_value` must also be set. * `data_segment_value` - The segment value to filter by, if `data_segment_id` is not `None`. ```python from aporia.sdk.dashboards import DashboardConfiguration, DashboardGlobalFilters dashboard_configuration = DashboardConfiguration( widgets=widgets, global_filters=DashboardGlobalFilters( timeframe="7d", # Display data for last week ), ) ``` ## Usage ### Reading & Updating Dashboards You can use the `model.get_dashboards()` function to list all dashboards for a model. You will then get a list of objects you can filter by name, id, or other variables to find the relevant dashboard you want. You can then edit these dashboards using the `update` function. ### Creating Dashboards You can create new dashboards using the `model.create_dashboard(...)` function. ### Usage with as\_code SDK In order to use the dashboard functionality with the as-code objects, you can call the `stack.get_resource_id()` function on the as-code resource name to get it's resource ID. You can then use this ID using the API SDK to get the relevant object. ### Example ```python import os from aporia import Aporia from aporia.sdk.dashboards import DashboardConfiguration, DashboardGlobalFilters from aporia.sdk.datasets import DatasetType from aporia.sdk.metrics import MetricType from aporia.sdk.widgets import DistributionWidget, MetricWidget, TextWidget, TimeSeriesWidget from aporia.sdk.widgets.base import MetricConfiguration apr = Aporia( account_name=os.environ["APORIA_ACCOUNT"], workspace_name=os.environ["APORIA_WORKSPACE"], token=os.environ["APORIA_TOKEN"], ) MODEL_ID = "..." # This can be retrieved through the model URL or through the as-code get_resource_id function model = [model for model in apr.get_models() if model.id == MODEL_ID][0] version = [version for version in model.get_versions() if version.name == "My version"][0] field = model.get_field_by_name("my_prediction") dashboard = model.create_dashboard( "My SDK dashboard", definition=DashboardConfiguration( global_filters=DashboardGlobalFilters(timeframe="7d"), widgets=[ TextWidget.create(position=(0, 0), size=(12, 1), text="🔎 Model Overview"), # Show prediction volume of all versions compared to "My version" MetricWidget.create( position=(0, 1), size=(4, 5), title="Model Usage Volume by Version", metric=MetricConfiguration(metric=MetricType.COUNT), ).compare(version=version), # Plot prediction distribution of my_prediction in serving compared to training datasets across all versions DistributionWidget.create( position=(4, 1), size=(4, 5), title="Prediction Distribution - Serving VS Training", field=field, ).compare(phase=DatasetType.TRAINING), TimeSeriesWidget.create( position=(8, 1), size=(4, 5), title="Average Prediction", metric=MetricConfiguration( metric=MetricType.MEAN, field=field, ), granularity="1d", # Plot one line for "All Versions" and one line for "My version" versions=[None, version], ), ], ), ) ``` --- # Source: https://docs.aporia.com/monitors-and-alerts/data-drift.md # Source: https://docs.aporia.com/v1/monitors/data-drift.md # Data Drift ### Why Monitor Data Drift? Data drifts are one of the top reasons why model accuracy degrades over time. Data drift is a change in model input data that leads to model performance degradation. Monitoring data drift helps detect these model performance issues. Causes of data drift include: * **Upstream process changes**, such as a sensor being replaced that changes the units of measurement from inches to centimeters. * **Data quality issues**, such as a broken sensor always reading 0. * **Natural drift in the data**, such as mean temperature changing with the seasons. * **Change in relation between features**, or covariant shift. ### Comparison methods For this monitor, the following comparison methods are available: * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the baseline you choose. #### STEP 1: choose the fields you would like to monitor You may select as many fields as you want 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will compare the inspection period distribution with the baseline distribution. An alert will raise if the monitor finds a drift between these two distributions. #### STEP 3: calibrate thresholds Use the monitor preview to help you choose the right threshold and make sure you have the amount of alerts that fits your needs. The threshold for categorical fields is different then the one for numeric fields. Make sure to calibrate them both if relevant. ### How are drifts calculated? For numeric fields, Aporia detects drifts based on the [Jensen–Shannon](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) divergence metric. For categorical fields, drifts are detected using [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance). If you need to use other metrics, please contact us. --- # Source: https://docs.aporia.com/ml-monitoring-as-code/data-segments.md # Data Segments This guide will show you how to automatically add data segments to your model from code using the Python SDK. For more information on data segments, see the [tracking-data-segments](https://docs.aporia.com/core-concepts/tracking-data-segments "mention") documentation. ## Defining Data Segments To add new data segments: ```python platform_segment = aporia.Segment( "Platform", field="platform", values=["desktop", "mobile"] ) country_segment = aporia.Segment( "Country", field="country", values=["US", "IL", "DE", "FR", "GB", "DK"] ) ``` In this example, we're adding two new data segments - **platform** and **country**. To add the segments to your model, pass them to the model object: ```python model = aporia.Model( "My Model", type=aporia.ModelType.RANKING, versions=[model_version], segments=[platform_segment, country_segments] ) ``` --- # Source: https://docs.aporia.com/data-sources/delta-lake.md # Source: https://docs.aporia.com/v1/data-sources/delta-lake.md # Delta Lake This guide describes how to connect Aporia to a [Delta Lake](https://delta.io/) data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs, and optionally delayed actuals are stored in Delta Lake. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a IAM role for S3 access In order to provide access to Athena, create a IAM role with the necessary API permissions. First, create a JSON file on your computer with the following content: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject*" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ] } ] } ``` Make sure to replace `` with the name of the relevant S3 bucket. ### Creating an S3 data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Athena query. By creating a feature named `amount` or a prediction named `proba`, for example, the S3 data source will expect a column in the file named `amount` or `proba`, respectively. Next, create an instance of `S3DataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = S3DataSource( object_path="s3://my-bucket/my-file.parquet" object_format="delta", # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/nlp/example-question-answering.md # Source: https://docs.aporia.com/v1/nlp/example-question-answering.md # Example: Question Answering **Question answering models can retrieve the answer to a question from a given text**, which is useful for searching for an answer in a document. Throughout the guide, we will use a simple question answering model based on 🤗 [HuggingFace](https://huggingface.co/):thumbsup: ```python >>> from transformers import pipeline >>> qa_model = pipeline("question-answering") ``` This downloads a default pretrained model and tokenizer for Questioning Answering. Now you can use the `qa_model` on your target question / context: ```python qa_model( question="Where are the best cookies?", context="The best cookies are in Aporia's office." ) # ==> {'score': 0.8362494111061096, # 'start': 24, # 'end': 39, # 'answer': "Aporia's office"} ``` ## Extract Embeddings To extract embeddings from the model, we'll first need to do two things: 1. Pass `output_hidden_states=True` to our model params. 2. When we call `pipeline(...)` it does a lot of things for us - preprocessing, inference, and postprocessing. **We'll need to break all this**, so we can interfere in the middle and get embeddings [😉](https://emojipedia.org/winking-face/) In other words: ```python qa_model = pipeline("question-answering", model_kwargs={"output_hidden_states": True}) # Preprocess model_inputs = next(qa_model.preprocess(QuestionAnsweringPipeline.create_sample( question="Where are the best cookies?", context="The best cookies are in Aporia's office." ))) # Inference model_output = qa_model.model(input_ids=model_inputs["input_ids"]) # Postprocessing start, end = model_output[:2] qa_model.postprocess([{"start": start, "end": end, **model_inputs}]) # ==> {'score': 0.8362494111061096, 'start': 24, 'end': 39, 'answer': "Aporia's office"} ``` And finally, to extract embeddings for this prediction: ```python embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze() ``` ## Storing your Predictions The next step would be to store your predictions in a data store, including the embeddings themselves. For more information on storing your predictions, please check out the [Storing Your Predictions](https://docs.aporia.com/v1/storing-your-predictions) section. For example, you could use a Parquet file on S3 or a Postgres table that looks like this:

id	question	context	embeddings	answer	score	timestamp
1	Where are the best cookies?	The best cookies are in...	`[0.77, 0.87, 0.94, ...]`	`Aporia's Office`	0.982	2021-11-20 13:41:00
2	Where is the best hummus?	The best hummus is in...	`[0.97, 0.82, 0.13, ...]`	`Another Place`	0.881	2021-11-20 13:45:00
3	Where is the best burger?	The best burger is in...	`[0.14, 0.55, 0.66, ...]`	`Blablabla`	0.925	2021-11-20 13:49:00

### Why Explainability? There are many reasons why you would need explainability for your models, some examples: * **Trust:** Models can be viewed as a black box that generates predictions; the ability to explain these predictions increases trust in the model. * **Debugging:** Being able to explain predictions based on different inputs is a powerful debugging tool for identifying errors. * **Bias and Fairness:** The ability to see the effect of each feature can aid in identifying unintentional biases that may affect the model's fairness. For further reading on the subject, check out [our blog about explainability](https://www.aporia.com/blog/explainable-ai/). ### Integrating Explainability in Aporia Aporia lets you explain each prediction by visualizing the impact of each feature on the final prediction. This can be done by clicking on the **Explain** button near each prediction in the "Data Points" page of your model. You can also interactively change any feature value, click **Re-Explain** and see the impact on a theoretical prediction. **Make sure your feature schema in the model version is *****ordered*** When creating your model version, you'll need to make sure that the order of the features is identical to your model artifact features. Instead of passing a normal `dict` as features schema, you'll need to pass `OrderedDict`. For example: ```python # Build feature schema by order - you can use model.columns for this of course :) features = OrderedDict() features["sepal_length"] = "numeric" features["sepal_width"] = "numeric" features["petal_length"] = "numeric" features["petal_width"] = "numeric" apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="multiclass", features=features, predictions={ "variety": "categorical" } ) ``` **Log Training + Serving data** Training data is required for Explainability. Please check out [Data Sources - Overview](https://docs.aporia.com/v1/data-sources) for more information. **Upload Model Artifact in ONNX format** [ONNX](https://onnx.ai/) is an open format for Machine Learning models. Models from all popular ML libraries (XGBoost, Sklearn, Tensorflow, Pytorch, etc.) can be converted to ONNX To upload your model artifact, you'll need to execute: ```python apr_model.upload_model_artifact( artifact_type="onnx", model_artifact=onnx_model.SerializeToString() ) ``` Here are quick snippets and references that may help you with converting your model. #### XGBoost ```python import onnxmltools from onnxmltools.convert.common.data_types import FloatTensorType initial_types = [('features', FloatTensorType([None, X_train.shape[1]]))] onnx_model = onnxmltools.convert_xgboost(xgb_model, initial_types=initial_types, target_opset=9) ``` #### LightGBM ```python import onnxmltools from onnxmltools.convert.common.data_types import FloatTensorType initial_types = [('features', FloatTensorType([None, X_train.shape[1]]))] onnx_model = onnxmltools.convert_lightgbm(lgb_model, initial_types=initial_types, target_opset=9) ``` #### Catboost ```python import onnxmltools from onnxmltools.convert.common.data_types import FloatTensorType initial_types = [('features', FloatTensorType([None, X_train.shape[1]]))] onnx_model = onnxmltools.covnert_catboost(catboost_model, initial_types=initial_types, target_opset=9) ``` #### Scikit Learn ```python import onnxmltools from onnxmltools.convert.common.data_types import FloatTensorType initial_types = [('features', FloatTensorType([None, X_train.shape[1]]))] onnx_model = onnxmltools.convert_sklearn(skl_model, initial_types=initial_types, target_opset=9) ``` #### Keras ```python import onnxmltools onnx_model = onnxmltools.convert_keras(keras_model, target_opset=9) ``` #### Tensorflow ```python import onnxmltools onnx_model = onnxmltools.convert_tensorflow(keras_model, target_opset=9) ``` #### Pytorch ```python # Please see https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html ``` --- # Source: https://docs.aporia.com/ml-monitoring-as-code/getting-started.md # Getting started {% hint style="info" %} **BETA FEATURE** **The monitoring-as-code is in experimental beta and details may change** {% endhint %} Aporia's Python SDK is a powerful tool designed to streamline ML monitoring and observability. Define your models, monitors, dashboards, segments, custom metrics, and other ML Observability resources *as code*, just like in Terraform or Pulumi. The SDK also enables you to query metrics from Aporia to integrate with other platforms. ## Key Features * **ML Monitoring as Code:** Make it easier to manage and track changes by managing your models, dashboards, segments, and other ML Observability resources as code. * **CI/CD Integration:** Integrate with your CI/CD pipeline to automatically monitor all your models with Aporia. * **Query Metrics:** Fetch metrics directly from Aporia's platform to inform decisions or to use in other applications. * **Data Source Integration:** You can define and integrate multiple types of data sources, like S3, Snowflake, Glue Data Catalog, Databricks, and others. This allows your models to leverage a wide range of data for training and inference. * **Pythonic Interface:** Use the familiar Python programming paradigm to interact with Aporia. ## Installation You can install the Aporia SDK using pip: ```python pip install aporia --upgrade ``` Please make sure you have Python 3.8+. ## Use-cases ### Define models as code A common use-case of the SDK is to define models, monitors, dashboards, custom metrics and other Aporia resources as code. ```python import datetime import os from aporia import Aporia, MetricDataset, MetricParameters, TimeRange import aporia.as_code as aporia aporia_token = os.environ["APORIA_TOKEN"] aporia_account = os.environ["APORIA_ACCOUNT"] aporia_workspace = os.environ["APORIA_WORKSPACE"] stack = aporia.Stack( token=aporia_token, account=aporia_account, workspace=aporia_workspace, ) # Your model definition code goes here stack.apply(yes=True, rollback=False, config_path="config.json") ``` Similarly to frameworks like Pulumi and Terraform, resources are defined ***declaratively***. This means that if you run the script twice, models or monitors won't be created twice. Instead, the SDK diffs the current state vs. the desired state, and make sure to apply changes in Aporia. This is especially useful if you have a CI/CD pipeline to deploy models to staging / production, you can also add the model to Aporia for monitoring as an additional step. {% hint style="info" %} If you are using Aporia's European cluster, please make sure to add the following argument: `aporia.Stack(host="https://platform-eu.aporia.com", ...)` {% endhint %} ### Query Metrics using the SDK This example shows how you can use the Aporia SDK to query metrics from a model. It can be used to integrate data from Aporia to your internal systems: ```python from datetime import datetime from aporia import ( Aporia, MetricDataset, MetricParameters, TimeRange, DatasetType, ) aporia_token = os.environ["APORIA_TOKEN"] aporia_account = os.environ["APORIA_ACCOUNT"] aporia_workspace = os.environ["APORIA_WORKSPACE"] aporia_client = Aporia( token=aporia_token, account_name=aporia_account, workspace_name=aporia_workspace, ) last_week_dataset = MetricDataset( dataset_type=DatasetType.SERVING, time_range=TimeRange( start=datetime.now() - datetime.timedelta(days=7), end=datetime.now(), ), ) metrics = aporia_client.query_metrics( model_id=model_id, metrics=[ MetricParameters( dataset=MetricDataset(dataset_type=DatasetType.SERVING), name="count", ), ], ) print(f"The model had {metrics[0]} predictions last week") ```


Adding new models	Monitor models automatically by creating models in Aporia and connecting them to your data.
Data Segments	Automatically monitor various slices of the data in production.
Custom Metrics	Extend Aporia's monitoring capabilities by adding your own custom metrics.
Querying metrics	Use the SDK to fetch metrics from Aporia and bridge ML Observability with other systems in your organization.

--- # Source: https://docs.aporia.com/data-sources/glue-data-catalog.md # Source: https://docs.aporia.com/v1/data-sources/glue-data-catalog.md # Glue Data Catalog This guide describes how to use the Glue Data Catalog data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried exist as tables in Glue Data Catalog. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a IAM role for Glue access #### Step 1: Create Role 1. Log into your AWS Console and go to the **IAM** console. 2. Click the **Roles** tab in the sidebar. 3. Click **Create role**. 4. In **Select type of trusted entity**, click the **Web Identity** tile.

5. Under **Identity Provider**, click on **Create New**. 6. Under **Provider Type**, click the **OpenID Connect** tile. 7. In the **Provider URL** field, enter the Aporia cluster OIDC URL. 8. In the Audience field, enter "sts.amazonaws.com". 9. Click the **Add provider** button. 10. Close the new tab 11. Refresh the **Identity Provider** list. 12. Select the newly created identity provider. 13. In the **Audience** field, select “sts.amazonaws.com”. 14. Click the **Next** button. 15. Click the **Next** button. 16. In the **Role name** field, enter a role name.\

#### Step 2: Create an access policy 1. In the list of roles, click the role you created. 2. Add an inline policy. 3. On the Permissions tab, click **Add permissions** then click **Create inline policy**.

4. In the policy editor, click the **JSON** tab.\

5. Copy the following access policy, and make sure to fill your correct region, account ID and restrict access to specific databases and tables if necessary. ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetConnections" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::connection/*" ] }, { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetDatabases" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/default", "arn:aws:glue:::database/global_temp", "arn:aws:glue:::database/*" ] }, { "Effect": "Allow", "Action": [ "glue:GetTable", "glue:GetTables", "glue:GetPartitions", "glue:GetPartition", "glue:SearchTables" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/*", "arn:aws:glue:::table/*" ] }, { "Effect": "Allow", "Action": [ "glue:GetUserDefinedFunctions" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": [ "glue:CreateDatabase" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/default", "arn:aws:glue:::database/global_temp" ] } ] } ``` 6. Click **Review Policy**. 7. In the **Name** field, enter a policy name. 8. Click **Create policy**. 9. If you use Service Control Policies to deny certain actions at the AWS account level, ensure that `sts:AssumeRoleWithWebIdentity` is allowlisted so Aporia can assume the cross-account role. 10. In the role summary, copy the **Role ARN**. Next, please provide your Aporia account manager with the Role ARN for the role you've just created. ### Creating a Glue Data Catalog data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Glue table. By creating a feature named `amount` or a prediction named `proba`, for example, Aporia will expect a column in the table named `amount` or `proba`, respectively. If your data format does not fit exactly, you can use Spark SQL queries to shape it in any way you want. Next, create an instance of `GlueDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python apr_model.connect_serving( data_source=GlueDataSource( query="SELECT * FROM model_db.model_predictions", ), # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/data-sources/google-cloud-storage.md # Google Cloud Storage This guide describes how to connect Aporia to a Google Cloud Storage (GCS) data source in order to monitor your ML Model in production. We will assume that your model inputs, outputs, and optionally delayed actuals are stored in a file in GCS. Currently, the following file formats are supported: * `parquet` * `json` This data source may also be used to connect to your model's training dataset to be used as a baseline for model monitoring.

### Grant bucket access to Aporia Dataproc Worker Service Account In order to provide access to GCS, you'll need to update your Aporia Dataproc worker service account with the necessary API permissions. Go to the [Cloud Storage buckets page](https://console.cloud.google.com/storage/browser). 1. Select the buckets where your data is stored. 2. Click on the permissions button:

On the Permissions tab, click on the Add Principal button.

On the Grant access page, do the following: 1. Add the Aporia Dataproc Worker Service Account as a principal. 2. Assign the Storage Object Viewer role 3. Click Save.

Now Aporia has the read permission it needs to connect to the GSC buckets you have granted permissions. ### Create a GCS data source in Aporia 1. Go to the [Aporia platform](https://platform.aporia.com/) and log in to your account. 2. Go to the **Integrations** page and click on the **Data Connectors** tab 3. Scroll to **Connect New Data Source** section 4. Click **Connect** on the GCS card and follow the instructions Bravo! :clap: now you can use the data source you've created across all your models in Aporia. --- # Source: https://docs.aporia.com/nlp/intro-to-nlp-monitoring.md # Source: https://docs.aporia.com/v1/nlp/intro-to-nlp-monitoring.md # Intro to NLP Monitoring This guide will walk you through the core concepts of NLP model monitoring. Before soon, you'll be able to detect drift and measure model performance for your NLP models 🚀 Throughout the guide, we will use a simple sentiment analysis model based on 🤗 [HuggingFace](https://huggingface.co/): ```python >>> from transformers import pipeline >>> classifier = pipeline("sentiment-analysis") ``` This downloads a default pretrained model and tokenizer for Sentiment Analysis. Now you can use the `classifier` on your target text: ```python >>> classifier("I love cookies and Aporia") [{'label': 'POSITIVE', 'score': 0.9997883439064026}] ``` ## Extract Embeddings To effectively detect drift in NLP models, we use *embeddings*. {% hint style="info" %} **But... what are embeddings?** Textual data is complex, high-dimensional, and free-form. Embeddings represent text as *low-dimensional vectors*. Various language models, such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and transformer-based models like [BERT](https://en.wikipedia.org/wiki/BERT_$language_model$), are used obtain embeddings for NLP models. In case of BERT, embeddings are usually vectors of size 768. {% endhint %} To get embeddings for our HuggingFace model, we'll need to do two things: 1. Pass `output_hidden_states=True` to our model params. 2. When we call `pipeline(...)` it does a lot of things for us - preprocessing, inference, and postprocessing. **We'll need to break all this**, so we can interfere in the middle and get embeddings [😉](https://emojipedia.org/winking-face/) In other words: ```python classifier = pipeline( task="sentiment-analysis", model_kwargs={"output_hidden_states": True} ) # Preprocessing model_input = classifier.preprocess("I love cookies and Aporia") # Inference model_output = classifier.forward(model_input) # Postprocessing classifier.postprocess(model_output) # ==> {'label': 'POSITIVE', 'score': 0.9998340606689453} ``` And finally, to extract embeddings for this prediction: ```python embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze() ``` ## Storing your Predictions The next step would be to store your predictions in a data store, including the embeddings themselves. For more information on storing your predictions, please check out the [Storing Your Predictions](https://docs.aporia.com/v1/storing-your-predictions) section. For example, you could use a Parquet file on S3 or a Postgres table that looks like this:

id	raw_text	embeddings	prediction	score	timestamp
1	I love cookies and Aporia	`[0.77, 0.87, 0.94, ...]`	`POSITIVE`	0.98	2021-11-20 13:41:00
2	This restaurant was really bad	`[0.97, 0.82, 0.13, ...]`	`NEGATIVE`	0.88	2021-11-20 13:45:00
3	Hummus is the tastiest thing ever	`[0.14, 0.55, 0.66, ...]`	`POSITIVE`	0.92	2021-11-20 13:49:00

## Integrate to Aporia Now let’s add some monitoring to this model 🚀 To monitor this model in Aporia, the first step is to create a model version: ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="multiclass" raw_inputs={ "raw_text": "text", }, features={ "embeddings": {"type": "tensor", "dimensions": [768]} }, predictions={ "prediction": "string", "score": "numeric" }, ) ``` Next, we can log predictions directly to Aporia: ```python classifier = pipeline( task="sentiment-analysis", model_kwargs={"output_hidden_states": True} ) def predict(raw_text: str): # Run model pipeline model_input = classifier.preprocess(raw_text) model_output = classifier.forward(model_input) result = classifier.postprocess(model_output) # Extract embeddings embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze().tolist() # Log prediction to Aporia apr_model.log_prediction( id=str(uuid.uuid4()), raw_inputs={ "raw_text": raw_text }, features={ "embeddings": embeddings }, predictions={ "prediction": result["prediction"], "score": result["score"] } ) return result ``` Alternatively, connect Aporia to a data source. For more information, see [Data Sources - Overview:](https://docs.aporia.com/v1/data-sources) ```python apr_model.connect_serving( data_source=, id_column="id", timestamp_column="timestamp" ) ``` Your model should now be integrated to Aporia! 🎉 ## Next steps * **Create a custom dashboard for your model in Aporia** - Drag & drop widgets to show different performance metrics, top drifted features, etc. * **Visualize NLP drift using Aporia's Embeddings Projector** - Use the UMap tool to visually see drift between different datasets in production. * **Set up alerts to get notified for ML issues** - Including data integrity issues, model performance degradation, and model drift. --- # Source: https://docs.aporia.com/v1/integrations/jira.md # JIRA You can easily integrate Aporia with JIRA to create JIRA issues from Aporia alerts. Integrations can be found on the "Integrations" page, accessible through the sidebar: ![All Integrations](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FogmjRuZ5XbmPSJwXoBPm%2Fall_integrations.png?alt=media) ### Setting up the JIRA Integration After clicking the JIRA integration you will be redirected to JIRA, where you will need to allow Aporia to create issues in your project. Clicking "Accept" will redirect you back to Aporia. You can now go to the [Alerts](https://app.aporia.com/alerts) page and click the JIRA icon in order to create a JIRA issue from an alert: ![Create Issue](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FeqLCA6g1a8ANXD8TJxwJ%2Fjira_create_ticket.png?alt=media) --- # Source: https://docs.aporia.com/storing-your-predictions/kserve.md # Source: https://docs.aporia.com/v1/storing-your-predictions/kserve.md # Kubeflow / KServe If you are using [Kubeflow](https://www.kubeflow.org/) or [KServe](https://github.com/kserve/kserve) for model serving, you can store the predictions of your models using InferenceDB. [InferenceDB](https://github.com/aporia-ai/inferencedb) is an open-source cloud native tool that connects to KServe and streams predictions to a data lake, based on Kafka. {% hint style="warning" %} **WARNING: InferenceDB is still experimental!** InferenceDB is an open-source project developed by Aporia. It is still experimental, and not yet ready for production! {% endhint %} This guide will explain how to deploy a simple scikit-learn model using KServe, and log its inferences to a Parquet file in S3. ### Requirements * [**KServe**](https://kserve.github.io/website/0.8/) * [**KNative Eventing**](https://knative.dev/docs/eventing/) - with the [Kafka broker](https://knative.dev/docs/eventing/broker/kafka-broker/) * [**Kafka**](https://kafka.apache.org/) - with Schema Registry, Kafka Connect, and [Confluent S3 Sink connector](https://docs.confluent.io/kafka-connect-s3-sink/current/overview.html) plugin To get started as quickly as possible, see the [environment preperation tutorial](https://github.com/aporia-ai/inferencedb/wiki/KServe-Requirements), which shows how to set up a full environment in minutes. ### Step 1: Kafka Broker First, we will need a Kafka broker to collect all KServe inference requests and responses: ```yaml apiVersion: eventing.knative.dev/v1 kind: Broker metadata: name: sklearn-iris-broker namespace: default annotations: eventing.knative.dev/broker.class: Kafka spec: config: apiVersion: v1 kind: ConfigMap name: inferencedb-kafka-broker-config namespace: knative-eventing --- apiVersion: v1 kind: ConfigMap metadata: name: inferencedb-kafka-broker-config namespace: knative-eventing data: # Number of topic partitions default.topic.partitions: "8" # Replication factor of topic messages. default.topic.replication.factor: "1" # A comma separated list of bootstrap servers. (It can be in or out the k8s cluster) bootstrap.servers: "kafka-cp-kafka.default.svc.cluster.local:9092" ``` ### Step 2: InferenceService Next, we will serve a simple sklearn model using KServe: ```yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: sklearn-iris spec: predictor: logger: mode: all url: http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/sklearn-iris-broker sklearn: protocolVersion: v2 storageUri: gs://seldon-models/sklearn/iris ``` Note the `logger` section - you can read more about it in the [KServe documentation](https://kserve.github.io/website/0.8/modelserving/logger/logger/). ### Step 3: InferenceLogger Finally, we can log the predictions of our new model using InferenceDB: ```yaml apiVersion: inferencedb.aporia.com/v1alpha1 kind: InferenceLogger metadata: name: sklearn-iris namespace: default spec: # NOTE: The format is knative-broker-- topic: knative-broker-default-sklearn-iris-broker events: type: kserve config: {} destination: type: confluent-s3 config: url: s3://aporia-data/inferencedb format: parquet # Optional - Only if you want to override column names schema: type: avro config: columnNames: inputs: [sepal_width, petal_width, sepal_length, petal_length] outputs: [flower] ``` ### Step 4: Send requests First, we will need to port-forward the Istio service so we can access it from our local machine: ``` kubectl port-forward --namespace istio-system svc/istio-ingressgateway 8080:80 ``` Prepare a payload in a file called `iris-input.json`: ```json { "inputs": [ { "name": "input-0", "shape": [2, 4], "datatype": "FP32", "data": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } ] } ``` And finally, you can send some inference requests: ``` SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v \ -H "Host: ${SERVICE_HOSTNAME}" \ -H "Content-Type: application/json" \ -d @./iris-input.json \ http://localhost:8080/v2/models/sklearn-iris/infer ``` ### Step 5: Success! If everything was configured correctly, these predictions should have been logged to a Parquet file in S3. ```python import pandas as pd df = pd.read_parquet("s3://aporia-data/inferencedb/default-sklearn-iris/") print(df) ``` [See the full example here.](https://github.com/aporia-ai/inferencedb/tree/main/examples/kserve/kafka-broker) --- # Source: https://docs.aporia.com/v1/storing-your-predictions/logging-to-aporia-directly.md # Logging to Aporia directly This section will teach you how to integrate Aporia using [Python SDK](https://aporia-sdk-ref.netlify.app/), but you can also use our [REST API](https://docs.aporia.com/v1/api-reference/rest-api) or the integrate to your own DB. ## Get Started To get started, install the Aporia SDK: ```bash pip3 install aporia --upgrade ``` And then initialize it in your code: ```python import aporia aporia.init(token="", environment="", # e.g "production" verbose=True, raise_errors=True) ``` ### Create Model To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` This API would not recreate the model if the model ID already exists. You can also specify color, icon, tags, and model owner: ```python aporia.create_model( model_id="fraud-detection", model_name="Fraud Detection", color=ModelColor.ARCTIC_BLUE, icon=ModelIcon.FRAUD_DETECTION, tags={ "framework": "xgboost", "coolness_level": 10 }, owner="fred@aporia.com", # Integrates with your enterprise auth system ;) ) ``` ### Create Model Version Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. **Manual** ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary", features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "created_at": "datetime", }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, # Optional feature_importance={ "amount": 80, "owner": 10, "is_new": 70, "created_at": 20, } ) ``` **Inferring from Pandas DataFrame** ```python # Example DataFrames, each one with one row features_df = pd.DataFrame([[12.3, "John", True, pd.Timestamp.now()]], columns=["amount", "owner", "is_new", "created_at"]) predictions_df = pd.DataFrame([[True, 0.7]], columns=["will_buy_insurance", "proba"]) # Create a model version by inferring schemas from pandas DataFrames apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary", features=aporia.pandas.infer_schema_from_dataframe(features_df), predictions=aporia.pandas.infer_schema_from_dataframe(predictions_df), # Optional feature_importance={ "amount": 80, "owner": 10, "is_new": 70, "created_at": 20, } ) ``` Model version parameter can be any string - you can use the model file's hash, git commit hash, experiment/run ID from MLFlow or anything else. Model type can be [regression](https://docs.aporia.com/v1/model-types/regression), [binary](https://docs.aporia.com/v1/model-types/binary), [multiclass](https://docs.aporia.com/v1/model-types/multiclass-classification), [multi-label](https://docs.aporia.com/v1/model-types/multi-label-classification), or [ranking](https://docs.aporia.com/v1/model-types/ranking). Please refer to the relevant documentation on each model type for more info. #### Field Types * `numeric` - valid examples: 1, 2.87, 0.53, 300.13 * `boolean` - valid examples: True, False * `categorical` - a categorical field with integer values * `string` - a categorical field with string values * `datetime` - contains either python datetime objects, or an ISO-8601 timestamp string * `text` - freeform text * `dict` - dictionaries - at the moment keys are strings and values are numeric * `tensor` - useful for unstructured data, must specify shape, e.g. `{"type": "tensor", "dimensions": [768]}` * `vector` - useful for arrays that can be different in sizes #### Get a reference to an existing version If you already created a version, for example during your training, and you want to use it again, you can receive a reference to the version. ```python apr_model = aporia.Model("", "v1") ``` ## Logging Training / Test Sets To log the training or test sets of your model, you can use the `apr_model.log_training_set` or `apr_model.log_test_set` functions, respectively. For example, if we have the following training set: ```python import pandas as pd training_features = pd.DataFrame({ "Age": [31, 20, 53], "Annual_Premium": [11234, 534534, 859403], "Previously_Insured": [False, True, True], "Vehicle_Age_LT_1_Year": [False, True, False], "Vehicle_Age_GT_2_Years": [True, False, True], "Vehicle_Damage_Yes": [True, False, False], }) training_predictions = pd.DataFrame({ "will_buy_insurance": [True, True, False], }) training_labels = pd.DataFrame({ "will_buy_insurance": [True, False, True], }) ``` Then you can run: ``` apr_model.log_training_set( features=training_features, predictions=training_predictions, labels=training_labels, ) ``` And similarly, you can use the `apr_model.log_test_set` to log your test set. In both functions, you can pass `raw_inputs` to log the raw inputs of your training / test sets. ## Logging Serving Data ### Log Predictions Use the `apr_model.log_prediction` API to log a new prediction. ```python apr_model.log_prediction( id=, features={ "amount": 15.3, "owner": "Joe", "is_new": True, "created_at": datetime.now(), }, predictions={ "will_buy_insurance": True, "proba": 0.55, }, ) ``` Note that for each prediction you must specify an ID. This ID can later be used to log the *actual* value of the prediction. If you don't care about actuals, you can simply pass `str(uuid.uuid4())` as prediction ID. After logging your first prediction you'll be able to get into your model page on the dashboard. To log multiple predictions in one call, check out [Batching](#batching). ### Raw Inputs Raw inputs are the inputs of the model *before* preprocessing, and they're used to construct the features. Logging them is optional but can help you detect issues in your data pipeline. **Example: Log raw inputs separately** ```python apr_model.log_raw_inputs( id="a4dfcd4c-356c-4eed-8b93-b129b64fd55c", raw_inputs={ "Age": 27, "Vehicle_Damage": "Yes", "Annual_Premium": 12345, "Vehicle_Age": ">2 years" }, ) ``` **Example: Log raw inputs in `log_prediction`** ```python apr_model.log_prediction( id="a4dfcd4c-356c-4eed-8b93-b129b64fd55c", features={ "Age": 27, "Vehicle_Damage_Yes": True, "Annual_Premium": 12345, "Vehicle_Age_LT_1_Year": False, "Vehicle_Age_GT_2_Years": True, }, predictions={ "will_buy_insurance": True, }, raw_inputs={ "Age": 27, "Vehicle_Damage": "Yes", "Annual_Premium": 12345, "Vehicle_Age": ">2 years" }, ) ``` ### Actuals In some cases, you will have access to the [actual value](https://github.com/aporia-ai/docs2/blob/main/core-concepts/actuals-ground-truth/README.md) of the prediction, based on real-world data. For example, if your model predicted that a client will buy insurance, and a day later the client actually makes a purchase, then the actual value of that prediction is `True` **Example: Log actuals separately** ```python apr_model.log_actuals( id="a4dfcd4c-356c-4eed-8b93-b129b64fd55c", actuals={ "will_buy_insurance": True, }, ) ``` **Example: Log actuals in `log_prediction`** ```python apr_model.log_prediction( id="a4dfcd4c-356c-4eed-8b93-b129b64fd55c", features={ "Age": 27, "Vehicle_Damage_Yes": True, "Annual_Premium": 12345, "Vehicle_Age_LT_1_Year": False, "Vehicle_Age_GT_2_Years": True, }, predictions={ "will_buy_insurance": True, }, actuals={ "will_buy_insurance": True, }, ) ``` ### Batching All of the function above log a single prediction. If you wish to log multiple predictions in one large batch, you can use the `log_batch_*` functions. Each of these functions receive a list of dictionaries, such that each dict contains the parameters of the singular version of the function. **Example: Logging batch predictions** ```python apr_model.log_batch_prediction( [ { "id":"a4dfcd4c-356c-4eed-8b93-b129b64fd55c", "features": { "Age": 27, "Vehicle_Damage_Yes": True, "Annual_Premium": 12345, "Vehicle_Age_LT_1_Year": False, "Vehicle_Age_GT_2_Years": True, }, "predictions": { "will_buy_insurance": True, }, }, { "id":"f2d1dccb-1aef-4955-a274-69e1acb8772f", "features": { "Age": 54, "Vehicle_Damage_Yes": False, "Annual_Premium": 54324, "Vehicle_Age_LT_1_Year": True, "Vehicle_Age_GT_2_Years": False, }, "predictions": { "will_buy_insurance": False, }, }, ] ) ``` **Example: Logging batch actuals** ```python apr_model.log_batch_actuals( [ { "id":"a4dfcd4c-356c-4eed-8b93-b129b64fd55c", "actuals": { "will_buy_insurance": True, }, }, { "id":"f2d1dccb-1aef-4955-a274-69e1acb8772f", "actuals": { "will_buy_insurance": False, }, }, ] ) ``` **Example: Logging batch raw inputs** ```python apr_model.log_batch_raw_inputs( [ { "id":"a4dfcd4c-356c-4eed-8b93-b129b64fd55c", "raw_inputs": { "Age": 27, "Vehicle_Damage": "Yes", "Annual_Premium": 12345, "Vehicle_Age": ">2 years" }, }, { "id":"f2d1dccb-1aef-4955-a274-69e1acb8772f", "raw_inputs": { "Age": 54, "Vehicle_Damage": "No", "Annual_Premium": 54324, "Vehicle_Age": "<1 Year" }, }, ] ) ``` ### Logging Pandas DataFrame / Series If the data you wish to log is stored in a Pandas Series or DataFrame (with a single row), you can use the `aporia.pandas` utility API: ```python from aporia.pandas.pandas_utils import pandas_to_dict apr_model.log_prediction( id="a4dfcd4c-356c-4eed-8b93-b129b64fd55c", features=pandas_to_dict(features_dataframe), predictions={ "will_buy_insurance": True, }, ) ``` ### Asynchronous logging All of the logging functions described above log the data asynchronously to avoid blocking your program. If you wish to wait for the data to be sent, you can use the `flush` method: ```python apr_model.flush() ``` ### Troubleshooting By default, the Aporia SDK is very silent: **it doesn't raise exceptions and doesn't write debug logs.** This was done because we never want to interrupt your application! However, when first playing with the Aporia SDK, we highly recommend using the verbose argument, e.g: ```python aporia.init(..., verbose=True) ``` This will print errors in a convenient way to make integration easier to debug. You can also pass `throw_errors=True`, which will make sure you aren't missing any errors. If you have any further issues, please [contact us](mailto:support@aporia.com). **Important:** Make sure to remove `throw_errors=True` before uploading to staging / production! {% hint style="danger" %} **Prediction isn't sent?** If your application exits immediately after logging a prediction, the prediction might get discarded. The reason for this is that predictions are added to a queue and are sent asynchronously. In order to fix this, use the following API: `apr_model.flush()` {% endhint %} ## Pyspark To log a Pyspark DataFrames directly, you can use the: * `apr_model.log_batch_pyspark_prediction` for serving data * `apr_model.log_pyspark_training_set` for training set * `apr_model.log_pyspark_test_set` for test set The API of these functions is similar to the `connect_serving` API (see [Data Sources - Overview](https://docs.aporia.com/v1/data-sources)). Example: ```python import aporia aporia.init(host="", token="", environment="", verbose=True, raise_errors=True) # Create a new model + model version in Aporia model_id = aporia.create_model("my-model", "My Model") apr_model = aporia.create_model_version( model_id=model_id, model_version="v1", model_type="binary", features={ "f1": "numeric", "f2": "numeric", "f3": "numeric", }, predictions={ "score": "boolean", }, ) # Log training set # We'll assume that there is a column in the dataframe for each feature / prediction df_train = spark.sql("SELECT * FROM ...") apr_model.log_pyspark_training_set(df) # Load & log production data to Aporia # We'll assume that there is a column in the dataframe for each feature / prediction df = spark.sql("SELECT * FROM <>") apr_model.log_batch_pyspark_prediction( data=df, # Names of the "ID" and "occurred_at" columns id_column="id", timestamp_column="occurred_at", # Map an prediction (from the schema) to a label labels={ "": "", }, ) ``` --- # Source: https://docs.aporia.com/monitors-and-alerts/metric-change.md # Source: https://docs.aporia.com/v1/monitors/metric-change.md # Metric Change ### Why Monitor Metric Change Monitoring and measuring changes in features / raw inputs metrics allows for early detection of basic problems or changes in the model's input data. For example - we can monitor and detect a deviation of more than 20% from the average of the feature 'age' from the average the monitor was trained with. ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the metrics you would like to monitor You may select as many prediction fields as you want (from raw inputs / features) 😊 the monitor will run on each selected field separately. Our metric change monitor supports the following metrics: * Missing count * Average * Minimum * Maximum * Sum * Variance * STD #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the comparison between the inspection period and the baseline leads to a conclusion outside your threshold boundaries. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. For anomaly detection method, use the monitor preview to help you decide what is the appropriate sensitivity level. --- # Source: https://docs.aporia.com/api-reference/metrics-glossary.md # Source: https://docs.aporia.com/v1/api-reference/metrics-glossary.md # Metrics Glossary Here you can find information about all the performance metrics supported by Aporia. Can't find what you are looking for? :hushed: We are constantly expanding our metrics support, but in the meantime you can always define your own custom metric :raised\_hands:. ## Statistic metrics ### Missing Count This metric counts the amount of records that didn't report a specific field while logging the data.\ It can be useful for surfacing data pipeline or infrastructure problems that may affect your model. ### Average This metric calculates the average value of the given data. It can be applied on any numeric field. ### Minimum This metric finds the minimal value out of the given data. It can be applied on ant numeric field. ### Maximum This metric finds the maximal value out of the given data. It can be applied on ant numeric field. ### Sum This metric calculates the sum of all values of the given data. It can be applied on any numeric field. ## Performance metrics ### Variance Variance is the [expectation](https://en.wikipedia.org/wiki/Expected_value) of the squared [deviation](https://en.wikipedia.org/wiki/Deviation_$statistics$) of a [random variable](https://en.wikipedia.org/wiki/Random_variable) from its [sample mean](https://en.wikipedia.org/wiki/Sample_mean). For sample variables, it is calculated using the following formula: $$ Var(x) = \frac{\sum{(x\_i-\mu)^2}}{n-1} $$ ### Standard Deviation (STD) The standard deviation is a statistical metric that measures the amount of variation or [dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion) of a set of values. STD is calculated using the following formula: $$ \sigma = \sqrt{\frac{\sum{(x\_i-\mu)^2}}{N}} $$ ### Mean Squared Error (MSE) Mean squared error is an estimator which measures the average squared difference between the estimated value and the actual value.\ MSE is calculated using the following formula: $$ MSE = \frac{1}{n}\sum\_{i=1}^{n}(y\_i-x\_i)^2 $$ ### Root Mean Squared Error (RMSE) Root mean squared error is the root of MSE.\ RMSE is calculated using the following formula: $$ RMSE = \sqrt{\sum\_{i=1}^n\frac{(y\_i - x\_i)^2}{n}} $$ ### Mean Absolute Error (MAE) Mean absolute error is an estimator which measures the average absolute difference between the estimated value and the actual value.\ MAE is calculated using the following formula: $$ MAE = \frac{\sum\_{i-1}^{n} |y\_i - x\_i|}{n} $$ ### Confusion matrix ![](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FypnRoY57h1NST9zLS43V%2Fimage.png?alt=media\&token=f6a12cf7-ce8c-4c71-aa1c-377aedd76bbb) #### True Positive Count (TP) This metric measures the amount of correctly predicted to be positive for a specific characteristic. It is commonly used in classification problems. #### True Negative Count (TN) This metric measures the amount of correctly predicted to be negative for a specific characteristic. It is commonly used in classification problems. #### False Positive Count (FP) This metric measures the amount of incorrectly predicted to be positive for a specific characteristic. It is commonly used in classification problems. #### False Negative Count (FN) This metric measures the amount of incorrectly predicted to be negative for a specific characteristic. It is commonly used in classification problems. ### Precision This metric measures the percentage of our correctly predicted positive for a specific class, out of all of the positive predictions. The higher score we get, the more concise our classification is. Precision is useful to measure when the cost of a False Positive is high. For example, let's say that your model predicts whether an email is spam (positive) or not (negative). The cost of classifying an email as spam when it's not (FP) is high so we would like to monitor that our model's precision score remains high to avoid bad business impact. Precision is calculated using the following formula: $$ Precision = \frac{TP}{TP + FP} $$ ### Recall This metric measures the percentage of our correctly predicted positive for a specific class, out of all the actual positives. The higher score we get, the fewer positives we missed. Recall is useful to measure when the cost of a False Negative is high. For example, let's say that your model predicts whether a certain seller is a fraud (positive) or not(negative). The cost of miss detecting the fraud seller (FN) is high so we would like to monitor that our model's recall score remains high to avoid bad business impact. Recall is calculated using the following formula: $$ Recall = \frac{TP}{TP + FN} $$ ### Accuracy This metric measures the percentage of our correct predictions out of all the predictions. The higher score we get, the "closer to reality" our classifications are. Accuracy is useful when we have a balanced class distribution and we want to give more weight to the business value of the TP and TN. Accuracy is calculated using the following formula: $$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$ ### ### F1 This metric is trying to balance between the precision and the recall metrics. It fits when we have an uneven class distribution and we want to give more weight to the business cost of the FP and FN. F1 is calculated using the following formula: $$ F1 = 2\cdot\frac{Precision\times Recall}{Precision + Recall} $$ ### Normalized Discounted Commutative Gain (nDCG) This metric measures the quality of ranking. Using the DCG metric we assume two things: First, an object with high relevance will produce more gain if it gets a higher rank. Second, that giving the same rank objects with higher relevance will produce more gain. DCG is calculated using the following formula: $$ DCG\_p = \sum\_{i=1}^{p}\frac{2^{rel\_i}-1}{log\_2(i-1)} $$ Where RELi is the list of top i objects ordered by their rank. The normalized version of the metric (nDCG) gives you the ability to compare between two rankings of different lengths. nDCG is calculated using the following formula:
$$ nDCG = \frac{DCG\_p}{IDCG\_p} $$ where IDCG is the ideal DCG calculated by: $$ DCG\_p = \sum\_{i=1}^{REL\_p}\frac{2^{rel\_i}-1}{log\_2(i-1)} $$ --- # Source: https://docs.aporia.com/monitors-and-alerts/missing-values.md # Source: https://docs.aporia.com/v1/monitors/missing-values.md # Missing Values ### Why Monitor Missing Values? In real world data, there are often cases where a particular data element is missing. It is important to monitor the changes in missing values in order to spot and handle cases in which the model has not been trained to deal with. Causes of missing values include: * Serving environment fault * Data store / provider schema changes * Changes in internal API * Changes in model subject input ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the fields you would like to monitor You may select as many fields as you want (from features/raw inputs) 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the comparison between the inspection period and the baseline leads to a conclusion outside your threshold boundaries. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. For anomaly detection method, use the monitor preview to help you decide what is the appropriate sensitivity level. --- # Source: https://docs.aporia.com/monitors-and-alerts/model-activity.md # Source: https://docs.aporia.com/v1/monitors/model-activity.md # Model Activity ### Why Monitor Model Activity? In many cases, the number of model predictions is within a predictable range. Identifying deviations from the range can indicate on underlying problems, anomalous events, or an ongoing trend that is worth noting. Causes of change in the number of predictions include: * Natural increase in model invocations * Serving environment fault * Malicious attempt to analyze model behavior ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the predictions you would like to monitor You may select as many prediction fields as you want 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the amount of predictions in the inspection period exceeds your threshold boundaries compared to the baseline's amount of predictions. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. For anomaly detection method, use the monitor preview to help you decide what is the appropriate sensitivity level. --- # Source: https://docs.aporia.com/monitors-and-alerts/model-staleness.md # Source: https://docs.aporia.com/v1/monitors/model-staleness.md # Model Staleness ### Why Monitor Model Staleness? Monitoring the last time a model version was deployed helps track models that do not meet the organization's policy, or require high attention to track metrics and changes. ### Configuring your monitor The monitor will raise an alert when the model version is older than the specified time period. You can choose time granularity to be hour, day, week or month. --- # Source: https://docs.aporia.com/core-concepts/model-versions.md # Source: https://docs.aporia.com/v1/core-concepts/model-versions.md # Models & Versions ### Model In Aporia, a `model` is any system that can make predictions and can be improved through the use of data. We use this broad definition in order to support a large number of use cases. Some examples of a model include: * a simple Pytorch model * an ensemble of 15 XGBoost models, 37 LightGBM models, and a few deterministic algorithms * or even an evolutionary algorithm Aporia models usually serve specific business use cases: Fraud Detection, Credit Risk, Patient Diagnosis, Churn Prediction, LTV, etc. ### Model Version Each `model` in Aporia can have different `version`. When you (re)train your `model` or update a model's `schema` you should create a new model version in Aporia (via the **Versions page** or **SDK**). When creating a new model version in Aporia, you'll be able to specify the model version's `schema` - a definition of the inputs and outputs of the model. --- # Source: https://docs.aporia.com/monitors-and-alerts/monitor-overview.md # Overview By now, you probably understand why monitoring your model is essential to keeping it healthy and up-to-date in production. In the following section, you will learn how to setup relevant monitors for your model and customize them for your needs. If this is your first time creating a monitor in Aporia, feel free to quickly go over the following basic monitoring concepts. ### Monitor types In general, monitors can be divided into four sections of interest: * **Integrity** - credible data is basic to maintaining a successful model. Monitoring the appearance of new values, amount of missing ones and that all values are within a reasonable range can help you assure that and detect problems early. * **Performance** - depending on your use-case and KPIs, you can use different performance metric to assess how productive your model is and decide when it's best to retrain it. * **Drift** - drift of features or predictions can result in model performance degradation. Monitoring them both is useful to notice such trends early and take the proper action before it affects your business. * **Activity** - it's great to know that after all your hard work your model is out there making real world decisions. Monitoring your activity can help you reflect that to others and notice any surprising changes in volume that needs further investigation ### Comparison methods Aporia provides you with several comparison methods: * **Absolute values** - thresholds or boundaries are defined by specific predefined values. The inspection data is a serving data segment of your choice. * **Change in percentage** - thresholds or boundaries are defined by a change in percentage compared to baseline. Both baseline and inspection data are of the same serving data segment. * **Anomaly detection** - detects anomalies in pattern compared to the baseline. Both baseline and inspection data are of the same serving data segment. * **Compared to segment** - thresholds or boundaries are defined by a change in percentage compared to baseline. Inspection data and baseline data can be of deferent serving data segments. * **Compared to training** - thresholds or boundaries are defined by a change in percentage compared to baseline. Baseline data includes all the training data reported, filtered by the same data segment as the inspection data's. ### It's time to create your own monitor! 🎬 --- # Source: https://docs.aporia.com/v1/monitors/monitor-template.md # Overview By now, you probably understand why monitoring your model is essential to keeping it healthy and up-to-date in production. In the following section, you will learn how to setup relevant monitors for your model and customize them for your needs. If this is your first time creating a monitor in Aporia, feel free to quickly go over the following basic monitoring concepts. ### Monitor types In general, monitors can be divided into four sections of interest: * **Integrity** - credible data is basic to maintaining a successful model. Monitoring the appearance of new values, amount of missing ones and that all values are within a reasonable range can help you assure that and detect problems early. * **Performance** - depending on your use-case and KPIs, you can use different performance metric to assess how productive your model is and decide when it's best to retrain it. * **Drift** - drift of features or predictions can result in model performance degradation. Monitoring them both is useful to notice such trends early and take the proper action before it affects your business. * **Activity** - it's great to know that after all your hard work your model is out there making real world decisions. Monitoring your activity can help you reflect that to others and notice any surprising changes in volume that needs further investigation ### Comparison methods Aporia provides you with several comparison methods: * **Absolute values** - thresholds or boundaries are defined by specific absolute values. The inspection data is a serving data segment of your choice. * **Change in percentage** - thresholds or boundaries are defined by a change in percentage compared to baseline. Both baseline and inspection data are of the same serving data segment. * **Anomaly detection** - detects anomalies in pattern compared to the baseline. Both baseline and inspection data are of the same serving data segment. * **Compared to segment** - thresholds or boundaries are defined by a change in percentage compared to baseline. Inspection data and baseline data can be of deferent serving data segments. * **Compared to training** - thresholds or boundaries are defined by a change in percentage compared to baseline. Baseline data includes all the training data reported, filtered by the same data segment as the inspection data's. ### It's time to create your own monitor! 🎬 --- # Source: https://docs.aporia.com/ml-monitoring-as-code/monitors.md # Monitors This guide will show you how to automatically add monitors to your models from code using the Python SDK. For more information on the various types of monitors in Aporia, [please refer to the documentation on monitors.](https://docs.aporia.com/monitors-and-alerts) ## Defining monitors To define a new monitor, create an `aporia.Monitor(...)` object. ### Step 1: Choose monitor type and detection method When designing a new monitor, the first decision you have to make is: * **What's the monitor type you'd like to create?** * Examples: Data drift, Missing values, Performance degradation, etc. * This step is similar to the following step in the UI:

* **What's the detection method you'd like to use?** * Examples: Anomaly Detection over Time, Change in Percentage, Absolute values, Compared to Training, Compared to Segment, etc. * This step is similar to the following step in the UI:

To begin, start by creating a monitor object, with your chosen `monitor_type` and `detection_method`, and add it to your model. See [#detection-methods-overview](#detection-methods-overview "mention") and [#supported-monitor-types-detection-methods](#supported-monitor-types-detection-methods "mention") for the an overview on different monitor types and their supported detection methods.

import aporia.as_code as aporia

stack = aporia.Stack(...)

data_drift = aporia.Monitor(
    "Data Drift - Last week compared to Training",
    monitor_type=aporia.MonitorType.DATA_DRIFT,
    detection_method=aporia.DetectionMethod.COMPARED_TO_TRAINING,
    ...
)

my_model = aporia.Model(
    "My Model",
    type=aporia.ModelType.RANKING,
    icon=aporia.ModelIcon.RECOMMENDATIONS,
    ...,
    
    monitors=[data_drift],
)

stack.add(my_model)

### Step 2: Choose focal and baseline datasets The next step is to choose the dataset in which your monitor will be evaluated - this is called the **focal** dataset. In most detection methods, you'll also need to provide a **baseline** dataset. For example, if you want to create a data drift monitor to compare the distribution of a feature from the last week to the training set, then focal will be "last week in serving", and baseline will be "training set".

data_drift = aporia.Monitor(
    "Data Drift - Last week compared to Training",
    monitor_type=aporia.MonitorType.DATA_DRIFT,
    detection_method=aporia.DetectionMethod.COMPARED_TO_TRAINING,
    focal=aporia.FocalConfiguration(
        # Last week in serving
        timePeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.WEEKS)
    ),
    baseline=aporia.BaselineConfiguration(
         # Training dataset
        source=aporia.SourceType.TRAINING
    ),
    ...
)

Baseline is required for any monitor that has a "Compared to" field like in the example below, or in any detection method that is not `DetectionMethod.ABSOLUTE`:

Here's an example for focal / baseline in an anomaly detection over time monitor:

activity_anomaly_detection = aporia.Monitor(
    "Activity Anomaly Detection",
    monitor_type=aporia.MonitorType.MODEL_ACTIVITY,
    detection_method=aporia.DetectionMethod.ANOMALY,
    focal=aporia.FocalConfiguration(
        # Last day
        timePeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    baseline=aporia.BaselineConfiguration(
        # Last 2 weeks *before* the last day
        source=aporia.SourceType.SERVING,
        timePeriod=aporia.TimePeriod(count=2, type=aporia.PeriodType.WEEKS),
        skipPeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    ...
)

### Step 3: Configure monitor Next, it is time to configure some important parameters for the monitor. For example:

activity_anomaly_detection = aporia.Monitor(
    "Activity Anomaly Detection",
    monitor_type=aporia.MonitorType.MODEL_ACTIVITY,
    detection_method=aporia.DetectionMethod.ANOMALY,
    focal=aporia.FocalConfiguration(
        # Last day
        timePeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    baseline=aporia.BaselineConfiguration(
        # Last 2 weeks *before* the last day
        source=aporia.SourceType.SERVING,
        timePeriod=aporia.TimePeriod(count=2, type=aporia.PeriodType.WEEKS),
        skipPeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    sensitivity=0.3,
    ...
)

The following table describes the required parameters for each monitor type

Monitor	Detection Method	Required Parameters
Model Activity	Anomaly Detection	`sensitivity` (0-1) `baseline`
Model Activity	Change in Percentage	`percentage` (0-100) `baseline`
Model Activity	Absolute values	`min` (optional) `max` (optional)
Data Drift	Anomaly Detection	`thresholds (aporia.ThresholdConfiguration)` `features (list[str])` `raw_inputs (list[str])` `baseline`
Data Drift	Compared to Segment	`thresholds (aporia.ThresholdConfiguration)` `features (list[str])` `raw_inputs (list[str])` `baseline` (On segment)
Data Drift	Compared to Training	`thresholds (aporia.ThresholdConfiguration)` `features (list[str])` `raw_inputs (list[str])` `baseline` (On Training)
Prediction Drift	Anomaly Detection	`thresholds (aporia.ThresholdConfiguration)` `predictions (list[str])` `baseline`
Prediction Drift	Compared to Segment	`thresholds (aporia.ThresholdConfiguration)` `predictions (list[str])` `baseline` (On segment)
Prediction Drift	Compared to Training	`thresholds (aporia.ThresholdConfiguration)` `predictions (list[str])` `baseline` (On Training)
Missing Values	Anomaly Detection	`sensitivity` (0-1) `min` (0-100) (Optional) `raw_inputs (list[str])` `features (list[str])` `predictions (list[str])` `baseline` `testOnlyIncrease` (Optional)
Missing Values	Change in Percentage	`percentage` `min` (0-100) (Optional) `raw_inputs (list[str])` `features (list[str])` `predictions (list[str])` `baseline`
Missing Values	Absolute values	`min` (0-100) (Optional) `max` (0-100) (Optional) `raw_inputs (list[str])` `features (list[str])` `predictions (list[str])`
Missing Values	Compared to Segment	`percentage` `min` `raw_inputs (list[str])` `features (list[str])` `predictions (list[str])` `baseline` (On segment)
Model Staleness	Absolute	`staleness_period (aporia.TimePeriod)`
New Values	Percentage	`new_values_count_threshold` (Optional) `new_values_ratio_threshold` (Optional) `baseline`
New Values	Compared to Segment	`new_values_count_threshold` (Optional) `new_values_ratio_threshold` (Optional) `baseline` (on Segment)
New Values	Compared to Training	`new_values_count_threshold` (Optional) `new_values_ratio_threshold` (Optional) `baseline` (on Training)
Values Range	Percentage	`distance` `baseline`
Values Range	Absolute	`min` (Optional) `max` (Optiona)
Values Range	Compared to Segment	`distance` `baseline` (on Segment)
Values Range	Compared to Training	`distance` `baseline` (on Training)
Performance Degradation	Anomaly Detection	`metric` `sensitivity` `baseline` metric-specific parameters
Performance Degradation	Absolute	`metric` `min` (Optional) `max` (Optional) metric-specific parameters
Performance Degradation	Percentage	`metric` `percentage` `baseline` metric-specific parameters
Performance Degradation	Compared to Segment	`metric` `percentage` `baseline` (Compared to Segment) metric-specific parameters
Performance Degradation	Compared to Training	`metric` `percentage` `baseline` (Compared to Training) metric-specific parameters
Metric Change	Anomaly Detection	`metric` `sensitivity` `baseline` metric-specific parameters
Metric Change	Absolute	`metric` `min` (Optional) `max` (Optional) metric-specific parameters
Metric Change	Percentage	`metric` `percentage` `baseline` metric-specific parameters
Metric Change	Compared to Segment	`metric` `percentage` `baseline` (Compared to Segment) metric-specific parameters
Metric Change	Compared to Training	`metric` `percentage` `baseline` (Compared to Training) metric-specific parameters
Custom Metric	Anomaly Detection	`custom_metric`/`custom_metric_id` `sensitivity` `baseline`
Custom Metric	Absolute	`custom_metric`/`custom_metric_id` `min` (Optional) `max` (Optional) `baseline`
Custom Metric	Percentage	`custom_metric`/`custom_metric_id` `percentage` `baseline`

#### Metric-specific parameters * `k`: Used for ranking metrics, such as NDCG, MRR, MAP etc. * `prediction_threshold`: Used for binary confusion matrix metrics, such as accuracy, tp\_count, recall etc. Used with a numeric prediction (0-1) and a boolean actual. * `prediction_class`: The class for which to calculate per-class metrics, such as accuracy-per-class. * `average_method`: Used for precision/recall/f1\_score on multiclass predictions. Values are of `aporia.AverageMethod` enum (`MICRO`/`MACRO`/`WEIGHTED`) ### Step 4: Configure monitor action (e.g alert) Finally, any monitor requires a `severity` parameter to describe the severity of the alerts generated by this monitor (low / medium / high). You can also add an `emails` parameter for receiving the alert in mail, or the `messaging` parameter for integration with Webhooks, Datadog, Slack, Teams, etc.

activity_anomaly_detection = aporia.Monitor(
    "Activity Anomaly Detection",
    monitor_type=aporia.MonitorType.MODEL_ACTIVITY,
    detection_method=aporia.DetectionMethod.ANOMALY,
    focal=aporia.FocalConfiguration(
        # Last day
        timePeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    baseline=aporia.BaselineConfiguration(
        # Last 2 weeks *before* the last day
        source=aporia.SourceType.SERVING,
        timePeriod=aporia.TimePeriod(count=2, type=aporia.PeriodType.WEEKS),
        skipPeriod=aporia.TimePeriod(count=1, type=aporia.PeriodType.DAYS)
    ),
    sensitivity=0.3,
    severity=aporia.Severity.MEDIUM,
    emails=[<EMAIL_LIST>],
    messaging={"WEBHOOK": [WEBHOOK_INTEGRATION_ID], "SLACK": [SLACK_INTEGRATION_ID]}
)

## Detection Methods Overview

Detection Method	Enum value	Description	Exmaple
Anomaly Detection over Time	`DetectionMethod.ANOMALY`	This will train an anomaly detection model to raise an alert if there's an anomaly in metric value with respect to a certain baseline	Missing value ratio of last week compared to week before that
Change in Percentage	`DetectionMethod.PERCENTAGE`	Detect change in percentage in metric value	Standard deviation changed by >20%
Absolute values	`DetectionMethod.ABSOLUTE`	Raise an alert when metric value is larger or smaller than a certain value	Accuracy is lower than 0.9
Compared to Segment	`DetectionMethod.COMPARED_TO_SEGMENT`	Detect changes in metric value between two data segments	Data drift between `gender=male` to `gender=female`
Compared to Training	`DetectionMethod.COMPARED_TO_TRAINING`	Data change in metric value compared to the training set	Prediction drift of last month in serving compared to training

## Supported Monitor Types / Detection Methods The following table describes the various monitor types and their supported detection methods:

Monitor Type	Enum value	Supported detection methods
Model Activity	`MonitorType.MODEL_ACTIVITY`	`DetectionMethod.ANOMALY` `DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE`
Data Drift	`MonitorType.DATA_DRIFT`	`DetectionMethod.ANOMALY` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`
Prediction Drift	`MonitorType.PREDICTION_DRIFT`	`DetectionMethod.ANOMALY` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`
Missing Values	`MonitorType.MISSING_VALUES`	`DetectionMethod.ANOMALY` `DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE` `DetectionMethod.COMPARED_TO_SEGMENT`
Performance Degradation	`MonitorType.PERFORMANCE_DEGRADATION`	`DetectionMethod.ANOMALY` `DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`
Metric Change	`MonitorType.METRIC_CHANGE`	`DetectionMethod.ANOMALY` `DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`
Custom Metric	`MonitorType.CUSTOM_METRIC_MONITOR`	`DetectionMethod.ANOMALY` `DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE`
Model Staleness	`MonitorType.MODEL_STALENESS`	`DetectionMethod.ABSOLUTE`
Value Range	`MonitorType.VALUE_RANGE`	`DetectionMethod.PERCENTAGE` `DetectionMethod.ABSOLUTE` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`
New Values	`MonitorType.NEW_VALUES`	`DetectionMethod.PERCENTAGE` `DetectionMethod.COMPARED_TO_SEGMENT` `DetectionMethod.COMPARED_TO_TRAINING`

### --- # Source: https://docs.aporia.com/data-sources/mssql.md # Microsoft SQL Server This guide describes how to connect Aporia to an MSSQL data source in order to monitor your ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with SQL. This data source may also be used to connect to your model's training set to be used as a baseline for model monitoring. ### Access MSSQL using user/password authentication In order to provide access to MSSQL, create a read-only user for Aporia in MSSQL. Please use the SQL snippet below to create the user for Aporia. Before using the snippet, you will need to populate the following: * ``: The name of the user you want to create * ``: Strong password to be used by the user ```sql CREATE USER WITH PASSWORD ''; ALTER ROLE db_datareader ADD MEMBER ; ``` ### Access MSSQL using Azure AD authentication {% hint style="info" %} This authentication method is currently supported for databricks deployments only. Need it for other deployment type? Let us know! {% endhint %} #### Step 1: Create a new application for Aporia access in your Azure Active Directory 1. Go the Azure Active Directory portal and login 2. Click on **+ Add** and choose **App registration**

3. Insert a display name for the Aporia app and click on **Register** 4. Create a new secret for the newly created application

#### Step 2: Create corresponding secrets in your databricks account In order to enable authentication using Azure AD, create the following secrets in the same databricks account where Aporia is deployed: * `aporia-client-secret` - The application secret value you created in the previous step * `aporia-client-id` - Client ID of the application created in the previous step * `aporia-tenant-id` - Tenant ID of the application created in the previous step

Client ID & Tenant ID can be found in the application page

#### Step 3: Create a read-only user for MSSQL access In order to provide access to MSSQL, create a read-only user for Aporia in MSSQL. Please use the SQL snippet below to create the user for Aporia. Before using the snippet, you will need to populate the following: * ``: The name of the application you have created in the previous step ```sql CREATE USER FROM EXTERNAL PROVIDER; ALTER ROLE db_datareader ADD MEMBER ; ``` {% hint style="info" %} Make sure that the Aporia data plane IP can access your Microsoft SQL Server {% endhint %} ### Create a MSSQL data source in Aporia 1. Go to [Aporia platform](https://platform.aporia.com/) and login to your account. 2. Go to **Integrations** page and click on the **Data Connectors** tab 3. Scroll to **Connect New Data Source** section 4. Click **Connect** on the MSSQL card and follow the instructions 1. Note that the provided URL should be in the following format `jdbc:mssqlsql://`. Bravo! :clap: now you can use the data source you've created across all your models in Aporia. --- # Source: https://docs.aporia.com/model-types/multi-label-classification.md # Source: https://docs.aporia.com/v1/model-types/multi-label-classification.md # Multi-Label Classification Multi-label classification models predict multiple outcomes. In Aporia, these models are represented with the `multi-label` model type. Examples of multi-label classification problems: * Is this song sad, happy, funny, rock, jazz, or all simultaneously? * Does this movie belong to one or more of the 'romantic', 'comedy', 'documentary', 'thriller' categories, or all simultaneously? ### Integration To monitor a multi-label model, create a new model version with a `dict` field where keys are different labels and values are the probabilities for each label: ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="multi-label" features={ ... }, predictions={ "genres": "dict" }, ) ``` Next, connect to a data source or manually log predictions like so: ```python apr_model.log_prediction( id="", features={ ... }, predictions={ "genres": { "action": 0.8, "horror": 0.7, "thriller": 0.9, "drama": 0.2, ... } }, ) ``` If you don't have probabilities for each label, you can log zeros and ones instead. To log actuals for this prediction: ```python apr_model.log_actuals( id="", actuals={ "genres": { "action": 1.0, "horror": 1.0, "thriller": 1.0, "drama": 0.0, ... } }, ) ``` You can also log multiple `dict` fields if you have a multi-multi-label model :) --- # Source: https://docs.aporia.com/model-types/multiclass-classification.md # Source: https://docs.aporia.com/v1/model-types/multiclass-classification.md # Multiclass Classification Multiclass classification models predict one of more than two outcomes. In Aporia, these models are represented with the `multiclass` model type. Examples of multiclass classification problems: * Is this product a book, movie, or clothing? * Is this movie a romantic comedy, documentary, or thriller? * Which category of products is most interesting to this customer? Frequently, multiclass models output a confidence value or a score for each class. ### Integration To monitor a multiclass model, create a new model version with a `string` field representing the predicted class, and optionally a `dict` field with the probabilities for all classes: ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="multiclass" features={ ... }, predictions={ "product_type": "string", "proba": "dict" }, ) ``` Next, connect to a data source or manually log predictions like so: ```python apr_model.log_prediction( id="", features={ ... }, predictions={ "product_type": "book", "proba": { "book": 0.8, "movie": 0.1, "clothing": 0.1 } }, ) ``` To log actuals for this prediction: ```python apr_model.log_actuals( id="", actuals={ "product_type": "book", "proba": { "book": 1.0, "movie": 0.0, "clothing": 0.0, }, }, ) ``` If you don't need to monitor probabilities, you may omit the `proba` field. --- # Source: https://docs.aporia.com/v1/integrations/new-relic.md # New Relic Aporia allows you to connect alerts generated from Aporia’s monitors to New Relic’s Incident Intelligence engine and the predictions data in order to create a comprehensive monitoring dashboard in New Relic for your models in production. ### Integrate New Relic with Aporia 1. Log into Aporia’s console. On the navbar on the left, click on **Integrations** and choose **New Relic**.\

2. Log into your New Relic account, and click on **+ Add more data**.

3. In the search bar type **Aporia** (or scroll down to the **MLOps Integrations** section) and click on the **Aporia** icon.

1. Under **Prediction data**, click the **Select or create API key** to create a new API key or use an existing one.

1. After creating the token, click on the copy symbol to copy the token. 2. Then go back to the Aporia dashboard and paste the token under **New Relic Insert Token**.

3. Return to the New Relic dashboard. Copy the account ID.

4. Go back to the Aporia dashboard and paste it under **New Relic Account ID**. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2F5nKHUu0U4ExgKtLmQJKn%2Fnr_09.png?alt=media) 5. In the Aporia dashboard, click on the **Verify Tokens** button to verify both tokens are working correctly. Green check marks or red error marks should appear to indicate the status. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FTwWR6b1cdzfKe12sXgYr%2Fnr_10.png?alt=media) 6. Once everything is set, click on the **Save** button. 7. Return to the New Relic dashboard and click on the **See your data** button. This will redirect you to a dashboard displaying data reported to Aporia in New Relic. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FLMtK9WSdDQFfVlfWMiXw%2Fnr_11.png?alt=media) **Congratulations: You’ve now successfully integrated Aporia with New Relic!** #### Easy data filtering – Monitoring Platform for Machine Learning Models You can make it easy to filter the data, on both the **Most Active Models** chart and the **Most Active Model Versions** chart by making the adjustments shown below: 1. Click on the **…** symbol and click **edit**. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2Fl5j3OMKy5yJ6tRw4qBmt%2Fnr_12.png?alt=media) 2. On the right navbar, under **User as filter** activate **Filter the current dashboard** and click **Save**.![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FQYkBzcIBRCk9KqlVkbGz%2Fnr_13.png?alt=media) ### Additional ML Graphs Additional graphs display statistics over the predictions reported: 1. The **Model Inferences** graph displays the number of unique inferences reported for each model and version. 2. The **Average Numeric Inferences** graph displays the average value numeric inferences reported for each model and version. 3. The **Numeric Inferences Heatmaps** graph displays heatmaps of the numeric inferences values reported for each model and version. 4. The **Categorical Inferences** graph displays the different unique values and their frequencies of categorical predictions reported for each model and version. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FP69icAZ5ECSyC6ZwFxeg%2Fnr_14.png?alt=media) ### Alerts and Applied Intelligence for ML models When a new alert is detected by Aporia, it will be reported to New Relic’s Incident Intelligence engine. To view these alerts in New Relic, click on **Alerts & AI** and on the left navbar click on **Issues & activity**. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2F0n0QoQmsZE4WWNJx4Hn6%2Fnr_15.png?alt=media) On this page you will be able to see the correlated alerts. Clicking on an issue will open a screen with additional data, including all the related activities to the issue and their payloads. ![New Relic Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FfTaD1UWVQ6n29Iuncbe2%2Fnr_16.png?alt=media) Newly created alerts will now be correlated with your New Relic alerts and you should be able to see data about newly reported predictions. Happy Monitoring! --- # Source: https://docs.aporia.com/monitors-and-alerts/new-values.md # Source: https://docs.aporia.com/v1/monitors/new-values.md # New Values ### Why Monitor New Values? Monitoring new values of **categorical fields** helps to locate and examine changes in the model's input. For example, setting the monitor for a feature named `state` will help us discover a new region for which the model is asked to predict results. ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the fields you would like to monitor You may select as many fields as you want (from features/raw inputs) 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the amount of new values in the inspection period compared to the baseline's values exceeds your threshold. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. You can always readjust it later if needed. --- # Source: https://docs.aporia.com/data-sources/oracle.md # Oracle This guide describes how to connect Aporia to an Oracle data source in order to monitor your ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with SQL. This data source may also be used to connect to your model's training set to be used as a baseline for model monitoring. ### Create a read-only user for Oracle access In order to provide access to Oracle, create a read-only user for Aporia in Oracle. Please use the SQL snippet below to create the user for Aporia. Before using the snippet, you will need to populate the following: * ``: The user name to create. * ``: Strong password to be used by the user. * ``: The resources to which we want to granted access to the new user. ```sql -- Create user and grant access CREATE USER IDENTIFIED BY ''; -- Grant access to DB and schema GRANT CONNECT TO ; -- Grant access to multiple tables GRANT SELECT ON schema_name.table1 TO ; GRANT SELECT ON schema_name.table2 TO ; GRANT SELECT ON schema_name.table3 TO ; ``` ### Create a Oracle data source in Aporia 1. Go to [Aporia platform](https://platform.aporia.com/) and login to your account. 2. Go to **Integrations** page and click on the **Data Connectors** tab 3. Scroll to **Connect New Data Source** section 4. Click **Connect** on the Oracle card and follow the instructions 1. Note that the provided URL should be in the following format `jdbc:oracle:thin:@hostname:port_number:instance_name`. Bravo! :clap: now you can use the data source you've created across all your models in Aporia. --- # Source: https://docs.aporia.com/dashboards/overview.md # Source: https://docs.aporia.com/data-sources/overview.md # Source: https://docs.aporia.com/storing-your-predictions/overview.md # Source: https://docs.aporia.com/v1/data-sources/overview.md # Source: https://docs.aporia.com/v1/storing-your-predictions/overview.md # Overview **Monitoring your Machine Learning models begins with storing their inputs and outputs in production.** Oftentimes, this data is used not just for model monitoring, but also for retraining, auditing, and other purposes; therefore, it is crucial that you have complete control over it. Aporia monitors your models by connecting directly to *your* data, in *your* format. This section discusses the fundamentals of storing model predictions. {% hint style="info" %} If you are not storing your predictions today, you can also [log your predictions directly to Aporia](https://docs.aporia.com/v1/storing-your-predictions/logging-to-aporia-directly), although storing your predictions in your own database is highly recommended. {% endhint %} ### Storage Depending on your existing enterprise data lake infrastructure, performance requirements, and cloud costs constraints, storing your predictions can be done in a variety of data stores. Here are some common options: * [BigQuery](https://cloud.google.com/bigquery) * [Delta Lake](https://delta.io/) / [Databricks Lakehouse](https://www.databricks.com/) * [Snowflake](https://www.snowflake.com/) * [Elasticsearch](https://www.elastic.co/) / [OpenSearch](https://opensearch.org/) * Parquet files on S3 / GCS / ABS * If you choose this option, a metastore such as [Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html) is recommended. ### Directory Structure When storing your predictions, it's highly recommended to adopt a standardized directory structure (or SQL table structure) across all of your organization's models. With a standardized structure, you'll be able to get all models onboarded to the monitoring system automatically. Here is a very basic example: ``` s3://myorg-models/ ├── my-model/ ├── v1/ │ ├── train.parquet │ ├── test.parquet │ ├── serving.parquet │ ├── artifact.onnx ├── v2/ │ ├── train.parquet │ ├── test.parquet │ └── serving.parquet │ └── artifact.onnx ``` {% hint style="info" %} Even though this section focuses on the storage of *predictions*, you should also consider saving the **training** and **test sets** of your models. They can serve as a monitoring baseline. {% endhint %} ### Data Structure Recommendations: * One row per prediction. * One column per feature, prediction or raw input. * Use a prefix for column names to identify their group (e.g `features.`, `raw_inputs.`, `predictions.`, `actuals.`, etc.) * For serving, add ID and prediction timestamp columns. Example: ``` +-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+ | id | timestamp | predictions.score | actuals.score | raw_inputs.age | raw_inputs.gender | features.my_embeddings | features.age | features.gender_male | features.gender_female | +-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+ | 1 | 2022-10-19T14:21:08Z | 0.58 | 0.59 | 64 | male | [0.58, 0.19, 0.38, ...] | 64 | 1 | 0 | | 2 | 2022-10-19T14:21:08Z | 0.64 | 0.66 | 62 | woman | [0.48, 0.20, 0.42, ...] | 62 | 0 | 1 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | +-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+ ``` --- # Source: https://docs.aporia.com/monitors-and-alerts/performance-degradation.md # Source: https://docs.aporia.com/v1/monitors/performance-degradation.md # Performance Degradation ### Why Monitor Performance Degradation? ML models performance often unexpectedly degrade when they are deployed in real-world domains. It is very important to track the true model performance metrics from real-world data and react in time, to avoid the consequences of poor model performance. Causes of model's performance degradation include: * Input data changes (various reasons) * Concept drift ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the predictions & metrics you would like to monitor You may select as many prediction fields as you want 😊 the monitor will run on each selected field separately. Our performance degradation monitor supports a large variety of metrics that can measure the performance of your model's predictions given their corresponding actuals. You can check the full list of metric supported by Aporia in our [glossary](https://docs.aporia.com/v1/api-reference/metrics-glossary). #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the comparison between the inspection period and the baseline leads to a conclusion outside your threshold boundaries. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. For anomaly detection method, use the monitor preview to help you decide what is the appropriate sensitivity level. --- # Source: https://docs.aporia.com/data-sources/postgresql.md # Source: https://docs.aporia.com/v1/data-sources/postgresql.md # PostgreSQL This guide describes how to connect Aporia to an PostgreSQL data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with SQL. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a read-only user for PostgreSQL access In order to provide access to PostgreSQL, read-only user for Aporia in PostgreSQL. Please use the SQL snippet below to create a user for Aporia. Before using the snippet, you will need to populate the following: * ``: Strong password to be used by the user. * ``: PostgreSQL database with your ML training / inference data. * ``: PostgreSQL schema with your ML training / inference data. ```sql CREATE USER aporia WITH PASSWORD ''; -- Grant access to DB and schema GRANT CONNECT ON DATABASE database_name TO username; GRANT USAGE ON SCHEMA TO username; -- Grant access to multiple tables GRANT SELECT ON table1 TO username; GRANT SELECT ON table2 TO username; GRANT SELECT ON table3 TO username; ``` ### Creating an PostgreSQL data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the PostgreSQL query. By creating a feature named `amount` or a prediction named `proba`, for example, the PostgreSQL data source will expect a column in the PostgreSQL query named `amount` or `proba`, respectively. Next, create an instance of `PostgresJDBCDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = PostgresJDBCDataSource( url="jdbc:postgresql:///", query='SELECT * FROM "my_db"."model_predictions"', user="", password="", # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/monitors-and-alerts/prediction-drift.md # Source: https://docs.aporia.com/v1/monitors/prediction-drift.md # Prediction Drift ### Why Monitor Prediction Drift? Prediction drift allows you to monitor a change in the distribution of the predicted label or value. For example, a larger proportion of credit-worthy applications when your product was launched in a more affluent area. Your model still holds, but your business may be unprepared for this scenario. ### Comparison methods For this monitor, the following comparison methods are available: * [Anomaly detection](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the baseline you choose. #### STEP 1: choose the predictions you would like to monitor You may select as many prediction fields as you want 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the predictions you chose in the previous step, the monitor will compare the inspection period distribution with the baseline distribution. An alert will raise if the monitor finds a drift between these two distributions. #### STEP 3: calibrate thresholds Use the monitor preview to help you choose the right threshold and make sure you have the amount of alerts that fits your needs. The threshold for categorical predictions is different than the one for numeric predictions. Make sure to calibrate them both if relevant. ### How are drifts calculated? For numeric predictions, Aporia detects drifts based on the [Jensen–Shannon](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) divergence metric. For categorical predictions, drifts are detected using [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance). If you need to use other metrics, please contact us. --- # Source: https://docs.aporia.com/ml-monitoring-as-code/querying-metrics.md # Querying metrics To query metrics from Aporia, initialize a new client and call the `query_metrics` API: ```python from datetime import datetime, timedelta from aporia import ( Aporia, MetricDataset, MetricParameters, TimeRange, ) from aporia.sdk.datasets import DatasetType aporia_token = os.environ["APORIA_TOKEN"] aporia_account = os.environ["APORIA_ACCOUNT"] aporia_workspace = os.environ["APORIA_WORKSPACE"] aporia_client = Aporia( base_url="https://platform.aporia.com", # or "https://platform-eu.aporia.com" token=aporia_token, account_name=aporia_account, workspace_name=aporia_workspace, ) last_week_dataset = MetricDataset( dataset_type=DatasetType.SERVING, time_range=TimeRange( start=datetime.datetime.now() - datetime.timedelta(days=7), end=datetime.datetime.now(), ), ) res = metrics.query_batch( model_id=model_id, metrics=[ MetricParameters( dataset=last_week_dataset, name="count", ), ], ) print(f"The model had {metrics[0]} predictions last week") ``` ## Parameters The `query_metrics` API has the following parameters:

Parameter	Type	Description
model_id	`str`	Model ID to query metrics for.
metrics	`List[MetricParameters]`	List of metrics to query.

The API can request values for multiple metrics concurrently. ### MetricParameters Here are different fields for the `MetricParameters` object:

Field	Type	Description
name	`str`	Metric name (see #supported-functions). Required.
dataset	`MetricDataset`	Specifies what data to query (training / serving), what segment, and what timeframe. Required.
column	`str`	Name of the column to calculate the metric for. Required except for the `count` metric. For performance metrics, this should be the name of the prediction, not the actual.
k	`int`	K value for ranking metrics such as nDCG. Required only for `ndcg_at_k`, `map_at_k`, `mrr_at_k`, `accuracy_at_k`, `precision_at_k`, and `recall_at_k`.
threshold	`float`	Threshold to use when calculating binary performance metrics. Required only if the prediction is `numeric` and the actual is `boolean`, and the metric is a binary performance metric such as `accuracy`, `recall`, `precision`, `f1_score`, etc.
custom_metric_id	`str`	Custom metric ID. Required only if you want to query a custom metric.
baseline	`MetricDataset`	Specifies what data to use as baseline. Required only for statistical distances such as `js_distance`, `ks_distance`, `psi`, and `hellinger_distance`.

### MetricDataset The `MetricDataset` object contains the following fields:

Field	Type	Description
dataset_type	`DatasetType`	Can be either `DatasetType.SERVING` or `DatasetType.TRAINING`. Required.
time_range	`TimeRange`	Time range (contains `start` and `end` fields). Do not pass this for training.
model_version	`str`	Model version to filter by. Optional.
segment	`MetricSegment`	Used to query metrics in a specific data segment. Contains `id` and `value` fields.

--- # Source: https://docs.aporia.com/introduction/quickstart.md # Source: https://docs.aporia.com/v1/introduction/quickstart.md # Quickstart With just a few lines of code, any Machine Learning model can be integrated and monitored in production with Aporia. ![Quickstart](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FjFCV3GyHcgQt0GwHanVe%2Fquickstart.gif?alt=media) In this guide, we will use Aporia's Python API to create a model in Aporia and log its predictions. ### Install the Aporia SDK To get started, install the Aporia Python library: ``` pip3 install aporia --upgrade ``` Next, import and initialize the Aporia library: ```python import aporia aporia.init(token="", environment="", # e.g prod verbose=True, raise_errors=True) ``` ### Create Model To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` This API will not recreate the model if the model ID already exists. You can also specify color, icon, tags and model owner: ```python aporia.create_model( model_id="fraud-detection", model_name="Fraud Detection", color=ModelColor.ARCTIC_BLUE, icon=ModelIcon.FRAUD_DETECTION, tags={ "framework": "xgboost", "coolness_level": 10 }, owner="fred@aporia.com", # Integrates with your enterprise auth system ;) ) ``` ### Create Model Version Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "created_at": "datetime", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Model version parameter can be any string - you can use the model file's hash, git commit hash, experiment/run ID from MLFlow or anything else. Model type can be [regression](https://docs.aporia.com/v1/model-types/regression), [binary](https://docs.aporia.com/v1/model-types/binary), [multiclass](https://docs.aporia.com/v1/model-types/multiclass-classification), [multi-label](https://docs.aporia.com/v1/model-types/multi-label-classification), or [ranking](https://docs.aporia.com/v1/model-types/ranking). Please refer to the relevant documentation on each model type for more info. #### Field Types * `numeric` - valid examples: 1, 2.87, 0.53, 300.13 * `boolean` - valid examples: True, False * `categorical` - a categorical field with integer values * `string` - a categorical field with string values * `datetime` - contains either python datetime objects, or an ISO-8601 timestamp string * `text` - freeform text * `dict` - dictionaries - at the moment keys are strings and values are numeric * `tensor` - useful for unstructured data, must specify shape, e.g. `{"type": "tensor", "dimensions": [768]}` ### Log Predictions Next, we will log some predictions to the newly created model version. These predictions will be kept in an Aporia-managed database. In production, **we strongly recommend** [**storing your model's predictions in your own database**](https://docs.aporia.com/v1/storing-your-predictions) **that you have complete control over**- we've seen many of our customers do this anyway for retraining, auditing, and other purposes. Aporia can then connect to your data directly and use it for model observability, stripping away the need for data duplication. However, this quickstart assumes you have no database and would simply like to log model inferences: ```python apr_model.log_prediction( id=, features={ "amount": 15.3, "owner": "Joe", "is_new": True, "created_at": datetime.now(), "embeddings": [...], }, predictions={ "will_buy_insurance": True, "proba": 0.55, }, ) ``` You must specify an ID for each prediction. This ID can later be used to log the prediction's actual value. If you don't care about this, just pass `str(uuid.uuid4())` as the prediction ID. Both of these APIs are entirely asynchronous. This was done to avoid blocking your application, which may handle a large number of predictions per second. You can now access Aporia and see your model, as well as create dashboards and monitors for it! --- # Source: https://docs.aporia.com/model-types/ranking.md # Source: https://docs.aporia.com/v1/model-types/ranking.md # Ranking Ranking models are often used in recommendation systems, ads, search engines, etc. In Aporia, these models are represented with the `ranking` model type. ### Integration If you have a ranking or recommendations model, then your database may look like the following:

id	feature1 (numeric)	feature2 (boolean)	scores (array)	relevance (array)	timestamp (datetime)
1	13.5	True	`[9, 8, 10, ...]`	`[2, 0, 1, ...]`	2014-10-19 10:23:54
2	-8	False	`[4.5, 8.7, 9, ...]`	`[0, 1, 2, ...]`	2014-10-19 10:24:24

To monitor a ranking model, create a new model version with an `array` field(s): ```python apr_model = aporia.create_model_version( model_id="", # You will need to create a model with this MODEL_ID in advance model_version="v1", model_type="ranking" features={ ... }, predictions={ "scores": "array" }, ) ``` To connect your data source to this model in Aporia, please call the `connect_serving(...)` API: ```python apr_model.connect_serving( data_source=my_data_source, id_column="id", timestamp_column="timestamp", predictions={ # Prediction name -> Column name representing "relevance": "scores" } ) ``` Check out the [Data Sources](https://docs.aporia.com/v1/data-sources) section for further reading on the available data sources and how to connect to each one of them. --- # Source: https://docs.aporia.com/administration/rbac.md # Role Based Access Control (RBAC) Aporia supports full role based access control. Using account-level and workspace-level permissions, users will only have access to the data and actions for which they are permitted. ## Aporia roles & permissions ### Account level roles An **`account`** is the highest level entity in Aporia and represents your company/organization. Every user in Aporia must be associated with an **`account`**. There are two types of roles at the **`account`** level: | Account Role | Permissions | | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | account.admin | Account admins can manage users (invite new users & update their associated permissions/roles). Account admins are admins of all workspaces in their account (top-down inheritance). | | account.member | Account members (the default permission of a user) can access the Aporia platform in your organization's account. | An **`account.admin`** may create multiple **`workspaces`** within your Aporia **`account`.** ### Workspace level roles A **`workspace`** is a silo created for separate teams. Each **`workspace`** represents a team/group within your organization and acts as an independent entity, i.e. a workspace encapsulate models, dashboards, monitors, etc. and these entities are ***not*** shared between workspaces. There are three types of roles at the **`workspace`** level: | Workspace Role | Permissions | | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | workspace.admin |

Manage user permissions of existing workspace users and ability to invite existing account members to the respective workspace.
Edit & view permissions across all Aporia entities within the workspace, i.e. models, versions, dashboards, monitors, segments, & metrics.

| | workspace.editor | Edit & view permissions across all Aporia entities within a workspace, i.e. models, versions, dashboards, monitors, segments, & metrics. | | workspace.viewer | View-only permissions for all entities in the workspace. | ## Roles management #### Via Aporia platform As an account admin you can manage roles as follows: 1. Log in to the Aporia platform 2. Go to the `Teams` page 3. Manage your account and workspace level permissions by adding, removing, and editing roles.

--- # Source: https://docs.aporia.com/storing-your-predictions/real-time-models-kafka.md # Source: https://docs.aporia.com/v1/storing-your-predictions/real-time-models-kafka.md # Real-time Models (Kafka) For high-throughput, real-time models (e.g models with an HTTP endpoint such as `POST /predict` and billions of predictions per day), you can stream predictions to [Kafka](https://kafka.apache.org/) or other message brokers, and then have a separate process to store them in a persistent storage. Using a message broker such as Kafka lets you store predictions of real-time models with low latency. {% hint style="info" %} **Don't have billions of predictions?** If you are not dealing with billions of predictions per day, you should consider a simpler solution. Please see the guide on [real-time models with Postgres](https://docs.aporia.com/v1/storing-your-predictions/real-time-models-postgres). {% endhint %} ### Step 1: Deploy Kafka You can deploy Kafka in various ways: * If you are using Kubernetes, you can deploy the [Confluent Helm charts](https://github.com/confluentinc/cp-helm-charts) or the [Strimzi operator](https://strimzi.io/). * Deploy a managed Kafka service in your cloud provider, e.g [AWS MSK](https://aws.amazon.com/msk/). * Use a managed service such as [Confluent](https://www.confluent.io/). ### Step 2: Write predictions to Kafka Writing messages to a Kafka queue is very simple in Python and other languages. Here are examples for Flask and FastAPI, which are commonly used to serve ML models. #### Flask With Flask, you can use the [kafka-python](https://kafka-python.readthedocs.io/en/master/) library. Example: ```python producer = KafkaProducer(bootstrap_servers="kafka-cp-kafka:9092") @app.route("/predict", methods=["POST"]) def predict(): ... producer.send("my-model", json.dumps({ "id": str(uuid.uuid4()), "model_name": "my-model", "model_version": "v1", "inputs": { "age": 38, "previously_insured": True, }, "outputs": { "will_buy_insurance": True, "confidence": 0.98, }, }).encode("ascii")) ``` #### FastAPI With async FastAPI, you can use the [aiokafka](https://aiokafka.readthedocs.io/en/stable/) library. First, initialize a new Kafka producer: ```python aioproducer = None @app.on_event("startup") async def startup_event(): global aioproducer aioproducer = AIOKafkaProducer(bootstrap_servers="my-kafka:9092") await aioproducer.start() @app.on_event("shutdown") async def shutdown_event(): await aioproducer.stop() ``` Then, whenever you have a new prediction you can publish it to a Kafka topic: ```python @app.post("/predict") async def predict(request: PredictRequest): ... await aioproducer.send("my-model", json.dumps({ "id": str(uuid.uuid4()), "model_name": "my-model", "model_version": "v1", "inputs": { "age": 38, "previously_insured": True, }, "outputs": { "will_buy_insurance": True, "confidence": 0.98, }, }).encode("ascii")) ``` ### Step 3: Stream to a Persistent Storage Now, you can stream predictions from Kafka to a persistent storage such as S3. There are different ways to achieve this - we'll cover here [Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html) and [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html). #### Spark Streaming Spark Streaming is an extension of the core Spark API that allows you to process real-time data from various sources including Kafka. This processed data can be pushed out to file systems and databases. In this example, we will process messages from the `my-model` topic and store them in a Delta lake table: ```python # Create stream with Kafka source df = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "my-kafka:9092") \ .option("subscribe", "my-model") \ .option("startingOffsets", "earliest") \ .option("failOnDataLoss", "false") \ .load() # Parse JSON from Kafka schema = StructType() \ .add("sepal_length", FloatType()) \ .add("sepal_width", FloatType()) \ .add("petal_length", FloatType()) \ .add("petal_width", FloatType()) \ .add("prediction", IntegerType()) \ .add("confidence", FloatType()) df = df.withColumn("json", F.from_json(F.col("value").cast("string"), schema)) df = df.select(F.col("json.*")) # Write to Delta Lake df.writeStream \ .format("delta") \ .outputMode("append") \ .option("mergeSchema", "true") \ .option("checkpointLocation", f"{S3_BASE_URL}/my-model/serving/_checkpoints/kafka") \ .start(f"{S3_BASE_URL}/my-model/serving") \ .awaitTermination() ``` #### Kafka Connect Kafka Connect makes it easy to quickly define connectors to move data between Kafka and other data systems, such as S3, Elasticsearch, and others. As a prerequisite to Kafka Connect, you'll need [Schema Registry](https://docs.confluent.io/platform/current/schema-registry/index.html), which is a tool to manage schemas for Kafka topics. Here is an example of a connector to stream messages from the `my-model` topic to Parquet file on S3: ```json PUT /connectors/my-model-connector/config { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "storage.class": "io.confluent.connect.s3.storage.S3Storage", "s3.region": "us-east-1", "s3.bucket.name": "myorg-models", "topics.dir": "my-model/serving", "flush.size": "2", "rotate.schedule.interval.ms": "20000", "auto.register.schemas": "false", "tasks.max": "1", "s3.part.size": "5242880", "timezone": "UTC", "parquet.codec": "snappy", "topics": "my-model", "s3.credentials.provider.class": "com.amazonaws.auth.DefaultAWSCredentialsProviderChain", "format.class": "parquet", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.storage.StringConverter", "schema.registry.url": "http://my-schema-registry", "value.converter.schema.registry.url": "http://my-schema-registry" } ``` --- # Source: https://docs.aporia.com/storing-your-predictions/real-time-models-postgres.md # Source: https://docs.aporia.com/v1/storing-your-predictions/real-time-models-postgres.md # Real-time Models (Postgres) For real-time models with mid-level throughput (e.g models with an HTTP endpoint such as `POST /predict`), you can insert predictions to a database such as [Postgres](https://www.postgresql.org/), [MySQL](https://www.mysql.com/), or even [Elasticsearch](https://www.elastic.co/). If you are dealing with billions of predictions, this solution might not be sufficient for you. {% hint style="warning" %} **Dealing with billions of predictions?** If you are dealing with billions of predictions, this solution might not be sufficient for you. Please consider the guide on [real-time models with Kafka](https://docs.aporia.com/v1/storing-your-predictions/real-time-models-kafka). {% endhint %} ### Example: FastAPI + SQLAlchemy If you are serving models with Flask or FastAPI, and don't have an extremely high throughput, you can simply insert predictions to a standard database. Here, we'll use [SQLAlchemy](https://www.sqlalchemy.org/), which is a Python ORM to replace writing SQL `INSERT` statements directly with something a bit nicer. Please see the [FastAPI + SQLAlchemy tutorial](https://fastapi.tiangolo.com/tutorial/sql-databases/) for more details. First, we can define the structure of our database table using Pydantic: ```python class IrisModelPrediction(BaseModel): id: str timestamp: datetime # Features sepal_length: float sepal_width: float petal_length: float petal_width: float # Predictions prediction: int confidence: float ``` And here is a sample implementation of `POST /predict` endpoint: ```python @app.post("/predict") def predict(request: PredictRequest): # Preprocess & predict df = pd.DataFrame(columns=['sepal.length', 'sepal.width', 'petal.length', 'petal.width'], data=[[request.sepal.length, request.sepal.width, request.petal.length, request.petal.width]]) y, confidence = model.predict(df) # Insert prediction to DB prediction = IrisModelPrediction( id=str(uuid.uuid4()), timestamp=datetime.now(), sepal_length=request.sepal.length, sepal_width=request.sepal.width, petal_length=request.petal.length, petal_width=request.petal.width, prediction=y, confidence=confidence, ) db.add(prediction) db.commit() return {"prediction": y_pred} ``` --- # Source: https://docs.aporia.com/data-sources/redshift.md # Source: https://docs.aporia.com/v1/data-sources/redshift.md # Redshift This guide describes how to connect Aporia to an Redshift data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with Redshift SQL. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a IAM role for Redshift access In order to provide access to Redshift, create a IAM role with the necessary API permissions. First, create a JSON file on your computer with the following content: ```json { "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Action": "redshift:GetClusterCredentials", "Resource": "arn:aws:redshift:::dbuser:/" } } ``` Make sure to replace the following placeholders: * ``: You can specify the Redshift AWS region or `*` for the default region. * ``: The Redshift AWS account ID. * ``: The Redshift cluster name. * ``: Name of the Redshift user. For more information, see [Using IAM authentication to generate database user credentials](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-user-credentials.html). Next, create a new user in AWS with programmatic access only, and grant it the role you've just created. Create security credentials for it (access and secret keys) and use them in the next section. {% hint style="info" %} **IAM Authentication** For authentication without security credentials (access key and secret key), please contact your Aporia account manager. {% endhint %} ### Creating an Redshift data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Redshift query. By creating a feature named `amount` or a prediction named `proba`, for example, the Redshift data source will expect a column in the Redshift query named `amount` or `proba`, respectively. Next, create an instance of `RedshiftDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = JDBCDataSource( url="jdbc:redshift:iam://:5439/company?AccessKeyID=&SecretAccessKey=&DbUser=&ssl=true&tcpKeepAlive=true", query="SELECT * FROM model_predictions", # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/model-types/regression.md # Source: https://docs.aporia.com/v1/model-types/regression.md # Regression Regression models predict a `numeric` value. In Aporia, these models are represented with the `regression` model type. Examples of regression problems: * What will the temperature be in Seattle tomorrow? * For product X, how many units will sell? * How many days until this customer stops using the application? * What price will this house sell for? ### Integration Regression predictions are usually represented in a database with a `numeric` column. For example:

id	feature1 (numeric)	feature2 (boolean)	predicted_temperature (numeric)	actual_temperature (numeric)	timestamp (datetime)
1	13.5	True	22.83	24.12	2017-01-01 12:00:00
2	123	False	26.04	25.99	2017-01-01 12:01:00
3	42	True	29.01	11.12	2017-01-01 12:02:00

To monitor this model, we will create a new model version with a schema that includes a `numeric` prediction: ```python apr_model = aporia.create_model_version( model_id="", # You will need to create a model with this MODEL_ID in advance model_version="v1", model_type="regression" features={ ... }, predictions={ "predicted_temperature": "numeric", }, ) ``` To connect this model to Aporia from your data source, call the `connect_serving(...)` API: ```python apr_model.connect_serving( data_source=my_data_source, id_column="id", timestamp_column="timestamp", # Map the actual_temperature column as the label for the # predicted_temperature. labels={ # Prediction name -> Column name "predicted_temperature": "actual_prediction" } ) ``` Check out the [data sources section](https://docs.aporia.com/v1/data-sources) for more information about how to connect all other available data sources. {% hint style="info" %} **Don't want to connect to a database?** Don't worry - you can [log your predictions directly to Aporia.](https://docs.aporia.com/v1/storing-your-predictions/logging-to-aporia-directly) {% endhint %} --- # Source: https://docs.aporia.com/release-notes/release-notes-2023.md # Release Notes 2023 Welcome 2023! :tada: We are extremely excited for the year ahead as we continuously enhance our platform to ensure that you and your team can observe your models in production, detect issues and improve their performance as efficiently as possible. In this page, you'll be able to find a constantly-growing list of some of our most impactful new features and enhancements that we release every month. ## October 2023 * Deactivate versions - There is a time when a version gets old and we want to be able to observe it but stop syncing new data. For this scenarios, we are happy to introduce the ability to deactivate an existing version. Don't worry, you can always turn it active again. * Multiple integrations of the same type - Different teams use differnt slack channels for notifications or want to create different webhook automation once an alert is fired. For this reason, we now support creation of multiple integrations of the same type for all of our available integrations. * "My Workspaces" view - You can now enjoy our new pinned view where you can observe all models, monitors, alerts and investigation cases, across all their respective workspaces in a single view.

* Global filters for dashboards - Sometimes we want to be able to view the same insights for different segments of our data without needing to rebuild our dashboard. For those cases, you can now mark specific segment groups for global filtering. Once marked as such, you will be able to use those segments as global filters in dashboards. ## September 2023 * Code based metrics - This new advanced feature allows users to allow define Pyspark-based metrics that allow for computation on raw data, element-wise operations, and support third-party libraries. For usage example explore our [Code-based metrics guide](https://docs.aporia.com/api-reference/code-based-metrics). * New version actions - Deletion of versions / datasets is now supported via both UI & REST API.

* Resolve multiple alerts - You can now resolve multiple alerts in one click. Just filter the alerts you wish to dispose and click “Resolve All”! * Extended REST API support - Create and delete data sources is now supported via REST API. For more details read our [REST API docs](https://platform.aporia.com/api/v1/docs#tag/Data-Sources/operation/create_data_source_api_v1__account_name___workspace_name__data_sources_post). ## August 2023 * New DDC - For those of you who store your data in [Microsoft SQL Server](https://docs.aporia.com/data-sources/mssql), you can now directly and easily integrate it using our new data connectors.

* Recalculate metrics API - sync your data on demand without waiting for the upcoming scheduled calculation job. We've added [recalculating metrics on demand](https://platform.aporia.com/api/v1/docs#tag/Metrics-$Experimental$/operation/recalculate_metrics_api_v1__account_name___workspace_name__metrics_recalculate_post) via API. * Extended REST API support - Edit segments is now supported via REST API. Whether it's part of your CI/CD pipeline, triggered by a new value monitor alert, or just to ease building your Dashboards. For more details read our [REST API docs](https://platform.aporia.com/api/v1/docs#tag/Data-Segments-$Experimental$/operation/edit_data_segment_api_v1__account_name___workspace_name__data_segments__identifier__put). * Bug fixes & performance improvements! ## July 2023 * New dashboard widgets - Visualizations are a powerful tool for analyzing your model performance, business impact, data behavior, etc. For this reason we are happy to introduce three new widgets

* API keys - Account admins can now simply manage their API keys by going to `Account Management > API Keys`. * Drift monitors - Different users might like to monitor drift using different metric to fit they use-case best. For this reason, we added support for choosing the monitored drift metric. Available metrics can be found [here](https://docs.aporia.com/api-reference/metrics-glossary#statistical-distances). * Metrics on actuals - All our statistical metrics can now be used on actual fields as well. You can easily monitor & visualize them all across the platform. More information about our available metrics can be found [here](https://docs.aporia.com/api-reference/custom-metric-syntax#supported-functions). * SQL for blob storage - You can now transform data originated in S3/ AzureBlobStorage/ GoogleCloudStorage using a SQL query. ## June 2023 * Cross model dashboards - Sometimes we need to get insights on multiple models in one place. It could be observe model orchestration or just to get an overall status in one glance. For this reason, we added support for creating [cross model dashboards](https://docs.aporia.com/dashboards/overview#cross-model-dashboards). * Alert consolidation - Monitors can create unnecessary noise when thresholds are not yet fine-tuned or your team is handling an ongoing issue. Learn how to keep notifications meaningful by [consolidating your alerts within Aporia](https://docs.aporia.com/monitors-and-alerts/alerts-consolidation). * Extended error information - You will be able to see a full traceback for errors detected while trying to retrieve your datasets/metrics. * Default investigation case - Depending on your monitor type, a default investigation case will automatically be created to provide you the most relevant tools and tips to quickly get to the root cause of the issue detected. Default investigation cases are now available for prediction drift, data drift and performance degradation monitors.

## May 2023 * Workspaces - For enterprises and organizations which require silos for separate teams/models/data integrations... Aporia introduces workspaces (team silos) managed by Aporia account admins. For more information read our [RBAC docs](https://docs.aporia.com/administration/rbac). * Role Based Access Control - Full [role based access control](https://docs.aporia.com/administration/rbac) is now available in Aporia! Using account-level and workspace-level permissions, users will only have access to the data and actions for which they are permitted. * New integrations - We've expanded our integration support and now you can receive alert notifications via your organization Teams & Webhook. * New data source action - Deleting a data source is available by clicking on the actions button in the data connectors page

* Filters in custom metrics - In order to build your custom metric you may need to apply different data filtering in different parts of the calculation. For those cases, Aporia supports custom filtering in custom metrics. For more information and examples read our docs. * Custom segments - Grouping segments with common logic is now available when creating / editing custom segments. Learn how to use it with our updated [Custom Segment Syntax examples](https://docs.aporia.com/api-reference/custom-segment-syntax). * New DDCs - For those of you who store your data in [BigQuery](https://docs.aporia.com/data-sources/big-query) / Azure Blob Storage, you can now directly and easily integrate it using our new data connectors. * Value range monitor - You can now create value range monitor to get alerted when your inference data exceeds the desired range.

* Dashboard widgets - You can now control the granularity with which to plot your time series widgets. ## April 2023 * Multiple dashboards per model - Different users might like to get different insights on the same model. For this reason, we added support for creating [multiple dashboards per model](https://docs.aporia.com/release-notes/broken-reference). * Error detection for datasets - You will be able to see an indication in the relevant places across the platform in case we detected any error while trying to retrieve your datasets. * Performance improvements - Resolved various performance bottlenecks and dramatically increased performance at scale. * Edit versions - You can now edit existing stages by clicking on "edit" in the model versions page

## March 2023 * REST API - For those of you who would like to create automations for model integration, monitors creation, schema validation, etc. For more information read the [REST API documentation](https://platform.aporia.com/api/v1/docs). * Default dashboard - Depending on your model type, a default dashboard will automatically be created to provide you a quick overview and insights on your first integrated version. * Snowflake data source - For those of you who store your data in snowflake, you can now directly and easily integrate it using our new [snowflake data connector.](https://docs.aporia.com/data-sources/snowflake) * Bug fixes - Resolved errors raised by using special characters in version schema. * Custom segments - You can now create custom segments using a SQL-based syntax, that empowers you to create that exact segment you wish. For more info, check out our [docs](https://docs.aporia.com/api-reference/custom-segment-syntax). * New custom metrics actions - Deleting a custom metric is available by clicking on the actions button in the custom metrics page

* Cross-version monitoring - We added the ability to use "all versions" in the monitors configurations. This way you can create monitors to detect issues across the unification of all model versions. ## February 2023 * Azure AD authentication for Postgres data source - In addition to using username & password, you can now configure your Postgres data source to use Azure AD authentication. This is available for accounts using SSO integration with Azure AD. * New model actions - You can now rename and delete your models. Just click on the actions button in the models management page

* Ranking metrics support - Accuracy\@k, MRR\@k and nDCG\@k are natively supported in Aporia platform and you can use them in monitors, widgets, custom metrics, etc. * Databricks deployment over Azure - Aporia deployment over Databricks is now supported for clients using Azure as their cloud provider. * Bug fixes & performance improvements! ## January 2023 Direct Data Connectors - we are happy to introduce you with our transformative technology that empowers ML teams to effortlessly monitor and track their ML models by seamlessly integrating Aporia with their production database. By directly accessing your existing data lake, you can effortlessly monitor billions of predictions at minimal cloud costs (never duplicate your data!). For more details read the full [announcement post](https://www.aporia.com/blog/aporia-introduces-direct-data-connectors-monitoring-large-scale-data-made-easy/). --- # Source: https://docs.aporia.com/release-notes/release-notes-2024.md # Release Notes 2024 Welcome 2024! :tada: We are extremely excited for the year ahead as we continuously enhance our platform to ensure that you and your team can observe your models in production, detect issues and improve their performance as efficiently as possible. In this page, you'll be able to find a constantly-growing list of some of our most impactful new features and enhancements that we release every month. ## October 2024 * **Custom Metric Monitors:** Added the ability to alert on a specific segment from the Aporia UI and run comparisons to segment or training data. * **Improved error messages for model integration:** Resolved an issue where users received JSON-formatted error messages in the "test-dataset/check compatibility" flow. The error messages are now human-readable. * **Global Dashboards edits are now saved:** Fixed an issue with global dashboards, all edits are now properly saved once you click on ‘save changes’ on each widget. * **Move workspace:** Fix the error with the move workspace feature. ## August - September 2024 * **Dashboard Creation via SDK:** Users can now build dashboards programmatically using the SDK. * **Monitor Support for Current Day Runs:** Users can now run monitors for the current day, in addition to previous days. * **Global Version Filters in Dashboards:** Introduced global filters in dashboards, allowing users to change the configuration of all widgets together. * **API for Monitor Runs and Calculation Status:** * **Run Monitors API:** A new external API to trigger monitor runs by ID, with success status response. * **Calculation Status API:** Reflects the calculation status of Models, Versions, and Datasets we have in the dashboard. ## June - July 2024 * **API for Dashboards:** CRUD for Dashboards: Support an external API to create, update and delete dashboards. * **Teams Connector support Power Automate:** Support for Power Automate workflows integration for Teams Connectors to work after retirement of old teams connector support by Microsoft. * **Aporia UI Resilience:** Aporia UI and models will now continue to work even in cases of dataplane major dataplane errors (EMR/Databricks/etc clusters unreachable). * **Dataplane Control:** Users can now reconfigure & restart their dataplane in case of errors using an API. * **Embedding Projector UMAP Bug:** Data Points in the embedding projector should now work properly. * **UI Overflow Issue in Segment Editing:** Addressed an issue where the UI overflowed when editing segments with many dependencies, making the confirm button unclickable. * **Negative Values for Monitors:** Fixed an issue that prevented users from setting negative values when configuring thresholds for monitors using the Absolute Values detection method. Negative values are now supported. * **Optional Absolute Value Thresholds:** Previously, users could not nullify the upper and lower bounds of absolute value thresholds once set. Either bound can now be nullified, but not both. * **Data Health Table Version Comparison:** Fixed an issue where the "Compare" version in the data health table did not update correctly when switching versions. The "Compare" tab version now automatically matches the "Data" tab version. * **Grouped Alert Resolution:** Resolving a main alert in an alert group now resolves all associated child alerts. ## April - May 2024 * **Cumulative view for time series widgets** - Aporia now allows you to configure time series widgets to display cumulative value of the chosen metric. Cumulation is computed since the start time of the series. * **Sort workspaces list alphabetically**. * **Move models between workspaces** - Remove visibility of “Move Workspace” action for non-admin users. * **Deactivate version bug fix** - Aporia will now exclude deactivated versions from the following: * Recalculation schedules * Activity monitors * **Training baseline in investigation room cells** - You can now configure training as baseline for relevant IR cells. * **Min threshold for percentage change monitors** - you can now set both min and max thresholds for percentage change monitors. * **Data sources UI fix** - When editing data sources, configured optional SQL transformations will now appear. ## February - March 2024 * **Move models between workspaces** - Aporia now allows account admins to move models between different workspaces. Models will be moved along with all their related resources such as monitors, alerts, dashboards, etc. In case one of the model’s datasets / monitors are using a data source / integration (respectively) that does not exist in the new workspace, a new data source / integration will automatically be created in the new workspace with the same configuration. * **Using the same SSO integration for different Aporia accounts** - For organizations that have multiple Aporia accounts, it is now possible to use the same SSO integration in order to connect them both. The default account users log into is constant for all users and can be changed by your Aporia account representative. ## December - January 2024 * **Data Points cell** - Improving performance of the data points cell in the Investigation Room. Only filtered columns will be loaded automatically. You can still change the field's filtering as you wish. * **Bug fix -** Fixing an issue with dates displayed at the activity histogram in the models overview page for hourly-granularity models. * **Oracle data source** - Oracle database can now be used as an Aporia data source to directly connect your model’s data. ## November 2023 * **Code Based metrics** - Added support for sklearn as a 3rd party library. * **Ability to set metrics recalculation schedule** - Aporia now allows you to set the cadence and extent at which you would like to recalculate metrics per each model. This can be useful to handle delayed actuals / inputs. Customizing recalculation according to your data updates schedules can be used to reduce costs. --- # Source: https://docs.aporia.com/v1/api-reference/rest-api.md # REST API Aporia provides a REST API, which is currently in beta. ### Using the REST API The API is accessible thorough `https://app.aporia.com/v1beta`. To use the API, you must pass your token in the authorization header of each request: ``` Authorization: Bearer ``` ### Endpoints #### Create Model Creates a new [model](https://docs.aporia.com/v1/core-concepts/model-versions#model). ``` POST https://app.aporia.com/v1beta/models { "id": "my-model", "name": "My Model", "description": "My awesome model", "color": "turquoise", "icon": "fraud-detection", "owner": "owner@example.com", "tags": { "foo": "bar" } } ``` ``` { "id": "my-model" } ``` **Request Parameters** | Parameter | Type | Required | Description | | ----------- | --------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | id | str | False | A unique identifier for the new model, which will be used in all future operations. If this parameter is not passed, an id will be generated from the `name` parameter | | name | str | True | A name for the new model, which will be displayed in Aporia's dashboard | | description | str | False | A description of the model | | color | ModelColor | False | A color to distinguish the model in Aporia's dashboard. Defaults to `blue` | | icon | ModelIcon | False | An icon that indicates the model's designation. Defaults to `general` | | owner | str | False | The email of the model owner (must be a registered aporia user) | | tags | Dict\[str, str] | False | A mapping of tag keys to tag values | **ModelColor options:** `blue`,`arctic_blue`, `green`, `turquoise`, `pink`, `purple`, `yellow`, `red` **ModelIcon options**: `general`, `churn-and-retention`, `conversion-predict`, `anomaly`, `dynamic-pricing`, `email-filtering`, `demand-forecasting`, `ltv`, `personalization`, `fraud-detection`, `credit-risk`, `recommendations` **Response** | Value | Type | Description | | ----- | ---- | --------------------------------- | | id | str | The id of the newly created model | #### Delete Model Deletes a model. ``` DELETE https://app.aporia.com/v1beta/models// ``` **Path Parameters** | Parameter | Description | | --------- | ------------------------------ | | model\_id | The ID of the model to delete. | #### Get Model Versions Returns all model versions and their creation date. ``` GET https://app.aporia.com/v1beta/models//versions ``` ``` [ { "id": "4dc246a2-0fd4-4342-8e30-95c2b43e8b63", "name": "v1", "model_type": "regression", "created_at": "2021-10-03T10:23:00.913784+00:00" }, { "id": "21a6ee3f-8102-4e54-90bd-5809cff409cd", "name": "v2", "model_type": "regression", "created_at": "2021-10-03T10:33:54.073001+00:00" } ] ``` **Path Parameters** | Parameter | Description | | --------- | ----------------------------------------------------- | | model\_id | The ID of the model whose versions you wish to fetch. | **Response** A List of VersionDetails objects, each with the following format: | Value | Type | Description | | ----------- | ---- | ----------------------------------------------------------------------- | | id | str | Version id. | | name | str | Version name. | | model\_type | str | The type of the model created by the version (regression, binary, etc). | | created\_at | str | The creation date of the version. | #### Create Model Version Defines a new version for an existing model. ``` POST https://app.aporia.com/v1beta/models//versions { "name": "v1", "model_type": "binary", "version_schema": { "features": { "amount": "numeric", "owner": "string", "is_new": "boolean", "created_at": "datetime" }, "predictions": { "approved": "boolean", "another_output_field": "numeric" } }, "feature_importance" : { "amount": 100, "owner": 20, "is_new": 50, "created_at": 10 } } ``` ``` { "id": "d84a497b-6a13-49e3-91f0-b01117f49ac7" } ``` **Path Parameters** | Parameter | Description | | --------- | -------------------------------------------------------------- | | model\_id | The ID of the model for which the new version is being defined | **Request Parameters** | Parameter | Type | Required | Description | | ------------------- | ----------------- | --------------------------------------------------------------------- | ------------------------------------------------ | | name | str | True | A unique name for the new model version | | model\_type | ModelType | True | Model type | | version\_schema | object | The schema for the new version, mapping various fields to their types | | | feature\_importance | Dict\[str, float] | False | Mapping between feature name to it's importance. | **Notes** * **ModelType options:** `binary`, `multiclass`, `multi-label`, `regression` * **Feature positions:** When reporting a model schema, there is an optional argument called feature\_positions. This argument provides mapping of feature names to feature positions in the dataframe which the model receives. Feature Positions are required for Explainability capabilities. In the console, to explain a data point, go to Model Overview -> Investigation Toolbox -> Data points and click Explain on a specific data point. For example: ``` "feature_positions":{ "Age":1, "Gender:2 } ``` **Response** | Value | Type | Description | | ----- | ---- | ------------------------------- | | id | UUID | The id of the new model version | #### Create Monitor Creates a new monitor. The documentation for each monitor contains an example of creating that monitor using the REST API. ``` POST https://app.aporia.com/v1beta/monitors { "name": "Hourly Predictions > 100", "type": "model_activity", "scheduling": "*/5 * * * *", "configuration": { "configuration": { "focal": { "source": "SERVING", "timePeriod": "1h" }, "metric": { "type": "count", "field": "_id" }, "actions": [ { "type": "ALERT", "schema": "v1", "severity": "MEDIUM", "alertType": "model_activity_threshold", "description": "An anomaly in the number of total predictions within the defined limits was detected.
The anomaly was observed in the {model} model, in version {model_version} for the last {focal_time_period} ({focal_times}) {focal_segment}.

Based on defined limits, the count was expected to be above {min_threshold}, but {focal_value} was received.
", "notification": [ { "type": "EMAIL", "emails": [ "dev@aporia.com" ] } ], "visualization": "value_over_time" } ], "logicEvaluations": [ { "max": null, "min": 100, "name": "RANGE" } ] }, "identification": { "models": { "id": "seed-0000-5wfh" }, "segment": { "group": null }, "environment": null } } } ``` ``` { "id": "a5d11808-0a42-4d25-84fa-0cc71173044c" } ``` **Request Parameters** | Parameter | Type | Required | Description | | -------------------------- | ----------- | -------- | ---------------------------------------------------------------------------------------------- | | name | str | True | A name for the new monitor, which will be displayed in Aporia's dashboard | | type | MonitorType | True | The type of monitor to create | | scheduling | str | True | A cron expression that indicates how often the monitor will run | | configuration | object | True | The monitor's configuration | | is\_active | bool | False | True if the new monitor should be created as active, False if it should be created as inactive | | custom\_alert\_description | str | False | A custom description for the alerts generated by this monitor | **MonitorType options:** `model_activity`, `missing_values`, `data_drift`, `prediction_drift`, `values_range`, `new_values`, `model_staleness`, `performance_degradation`, `metric_change`, `custom_metric` **Response** | Value | Type | Description | | ----- | ---- | ----------------------------------- | | id | UUID | The id of the newly created monitor | #### Delete Monitor Deletes a monitor. ``` DELETE https://app.aporia.com/v1beta/monitors// ``` **Path Parameters** | Parameter | Description | | ----------- | -------------------------------- | | monitor\_id | The ID of the monitor to delete. | #### Get Existing Environments Return the defined environments. ``` GET https://app.aporia.com/v1beta/environments ``` ``` { "environments": [ { "id": "12345678-1234-1234-1234-1234567890abc", "name": "local-dev" } ] } ``` **Request Parameters** No parameters required for the request. **Response** Return "environments" list of objects with the following fields: | Value | Type | Description | | ----- | ---- | --------------------------- | | id | UUID | The id of the environment | | name | str | The name of the environment | #### Get Model Tags Returns all of the tags that were defined for a model. ``` GET https://app.aporia.com/v1beta/models//tags ``` ``` { "tags": { "foo": "bar", "tag_key": "tag_value" } } ``` **Path Parameters** | Parameter | Description | | --------- | ------------------------------------------------- | | model\_id | The ID of the model whose tags you wish to fetch. | **Response** | Value | Type | Description | | ----- | --------------- | ----------------------------------- | | tags | Dict\[str, str] | A mapping of tag keys to tag values | #### Delete Model Tag Deletes a single model tag. ``` DELETE https://app.aporia.com/v1beta/models//tags/ ``` **Path Parameters** | Parameter | Description | | --------- | ------------------------------------------------- | | model\_id | The ID of the model whose tags you wish to fetch. | | tag\_key | The key of the tag to delete. | #### Create Model Tags Creates or updates model tags. ``` POST https://app.aporia.com/v1beta/models//tags { "tags": { "tag_1": "value_1", "foo": "bar", "my tag key": "my-tag-value!" } } ``` **Path Parameters** | Parameter | Description | | --------- | ------------------------------------------------- | | model\_id | The ID of the model whose tags you wish to fetch. | **Request Parameters** | Parameter | Type | Required | Description | | --------- | --------------- | -------- | ----------------------------------- | | tags | Dict\[str, str] | True | A mapping of tag keys to tag values | **Notes** * Each model is restricted to 10 tags * Tag keys are restricted to 15 characters, and may only contain letters, numbers, spaces, '-' and '\_'. * Tag values are restricted to 100 characters, and may only contain letters, numbers and special characters * If a tag key already exists, you can use this enpoint to update its value #### Update Model Owner Update the owner of an existing model. ``` POST https://app.aporia.com/v1beta/models//owner { "owner": "owner@example.com" } ``` **Path Parameters** | Parameter | Description | | --------- | ---------------------------------------------------------------- | | model\_id | The ID of the model for which you would like to update an owner. | **Request Parameters** | Parameter | Type | Required | Description | | --------- | ---- | -------- | ---------------------------------------------------------------- | | owner | str | True | The email of the model owner (must be a registered aporia user). | **Response** | Value | Type | Description | | --------- | ---- | ------------------------------------- | | model\_id | str | The ID of the model that was updated. | | owner | str | The email of new model's owner. | #### Update Feature Positions Update feature positions for an existing model version. Feature Positions are required for Explainability capabilities. In the console, to explain a datapoint, go to Model Overview -> Investigation Toolbox -> Datapoints and click Explain on a specific datapoint. ``` POST https://app.aporia.com/v1beta/models/{model_id}/versions/{model_version}/feature_positions { "feature_positions":{ "Age": 1, "Gender: 2 } } ``` **Path Parameters** | Parameter | Description | | -------------- | --------------------------------------------------------------------------- | | model\_id | The ID of the model for which you would like to update features' positions. | | model\_version | The version for which you would like to update an features' positions. | **Request Parameters** | Parameter | Type | Required | Description | | ------------------ | ---- | -------- | ---------------------------------------------------------------------------------------- | | feature\_positions | dict | True | Mapping of feature names to feature positions in the dataframe which the model receives. | **Notes** * Features should be identical to the model schema. #### Update Feature Importance Update feature importance for an existing model version. ``` POST https://app.aporia.com/v1beta/models/{model_id}/versions/{model_version}/feature_importance { "feature_importance":{ "Age": 100, "Gender: 50 } } ``` **Path Parameters** | Parameter | Description | | -------------- | ---------------------------------------------------------------------------- | | model\_id | The ID of the model for which you would like to update features' importance. | | model\_version | The version for which you would like to update an features' importance. | **Request Parameters** | Parameter | Type | Required | Description | | ------------------- | ---- | -------- | ----------------------------------------------- | | feature\_importance | dict | True | Mapping of feature names to feature importance. | **Notes** * Mapping of features from the scema and their importance is expected. Partial mappings are also supported. * When using the API call, all previous reported feature importance values will be overridden. --- # Source: https://docs.aporia.com/data-sources/s3.md # Source: https://docs.aporia.com/v1/data-sources/s3.md # Amazon S3 This guide describes how to connect Aporia to an S3 data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals are stored in a file in S3. Currently, the following file formats are supported: * `parquet` * `json` * `csv` * `delta` This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a IAM role for S3 access In order to provide access to S3, create a IAM role with the necessary API permissions. #### Step 1: Create Role 1. Log into your AWS Console and go to the **IAM** console. 2. Click the **Roles** tab in the sidebar. 3. Click **Create role**. 4. In **Select type of trusted entity**, click the **Web Identity** tile.

4. In the policy editor, click the **JSON** tab.

5. Copy the following access policy, and make sure to fill your correct bucket name. ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ] } ] } ``` 6. Click **Review Policy**. 7. In the **Name** field, enter a policy name. 8. Click **Create policy**. 9. If you use Service Control Policies to deny certain actions at the AWS account level, ensure that `sts:AssumeRoleWithWebIdentity` is allowlisted so Aporia can assume the cross-account role. 10. In the role summary, copy the **Role ARN**. Next, please provide your Aporia account manager with the Role ARN for the role you've just created. ### Creating an S3 data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Athena query. By creating a feature named `amount` or a prediction named `proba`, for example, the S3 data source will expect a column in the file named `amount` or `proba`, respectively. Next, create an instance of `S3DataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = S3DataSource( object_path="s3://my-bucket/my-file.parquet" object_format="parquet", # other options: csv, json, delta # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/explainability/shap-values.md # SHAP values In the following guide we will explain how one can visualize SHAP values in Aporia to gain better explainability for their model’s predictions and increase trust.

### Ingest your Shaply values Ingesting your Shaply values in Aporia can be done by adding a column with the following naming convention `_shap`. For example, the SHAP column corresponding to a `featureX` would be `featureX_shap`. Please note: 1. the SHAP column should not be mapped to the version schema but you must include it in your SQL query when integrating your training/serving dataset. 2. `_shap` must be lowercase and the `` must be same case as the feature in Aporia. For those of you who use Snowflake we would recommend to pay attention that if the value is read directly from a table using `SELECT *`, the case-ness of the column name will be saved. Otherwise, your can force Snowflake to preserve case by using double quotes in the query. For example, `SELECT 1 AS a, 2 AS "b"` would return a table with 2 columns: `A` and `b`. ### Explain your predictions Exploring SHAP values can be done via our Data Points cell as part of an Investigation Case. When clicking on explain you’ll be able to view all the available SHAP values as well as getting a textual business explanation which you can share with stakeholders.

Click on Explain to view the SHAP values of the chosen prediction

Copy the business explanation to share with stakeholders

--- # Source: https://docs.aporia.com/integrations/slack-integration.md # Source: https://docs.aporia.com/v1/integrations/slack-integration.md # Slack You can integrate Aporia with Slack to receive alerts and notifications directly to your Slack workspace. Integrations can be found in the "Integrations" page, accessible through the sidebar: ![All Integrations](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FogmjRuZ5XbmPSJwXoBPm%2Fall_integrations.png?alt=media) ### Setting up the Slack Integration After clicking the Slack integration, you will be redirected to Slack, where you will need to allow Aporia to post to a channel in your Slack workspace: ![Authorize Slack Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2Fzx5CRTQtyBcanZJUIcMs%2Fslack_authorize.png?alt=media) Choosing a channel will redirect you back to Aporia: ![Slack Success](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FRD2AQUqguIwF3vUqMGBE%2Fslack_success.png?alt=media) You can then send a test message, or remove the integration, through the Slack integration page: ![Slack Manage](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FL7k5qz2g3sr6BtsvkLE5%2Fslack_manage.png?alt=media) ### Sending Alerts to Slack After setting up the Slack integration, you can configure monitors to send a message to your chosen slack channel when an anomaly is detected: ![Slack in Monitor Config](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2Fr5q60CsDDUN3sUXqTJWM%2Fslack_monitor.png?alt=media) ### Tagging Users in Slack You can easily tag users in the Slack notifications using an alert's custom description. Get the user id from Slack: ![Get Slack User ID](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2F2qvAvDpDRdLipyQ21rPS%2F1.png?alt=media) Insert it in the custom description:

The user tag should be in the form of `<@user_id>` Save the monitor. Now, whenever you receive a Slack alert, the user will be tagged in the message: ![Alert Custom Description](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FKqZab93b61LqhA0NtobO%2F3.png?alt=media) --- # Source: https://docs.aporia.com/data-sources/snowflake.md # Source: https://docs.aporia.com/v1/data-sources/snowflake.md # Snowflake This guide describes how to connect Aporia to a Snowflake data source in order to monitor a new ML Model in production. We will assume that your model inputs, outputs and optionally delayed actuals can be queried with Snowflake SQL. This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring. ### Create a Service Account for Snowflake access In order to provide access to Snowflake, read-only service account for Aporia in Snowflake. Please use the SQL snippet below to create a service account for Aporia. Before using the snippet, you will need to populate the following: * ``: Strong password to be used by the service account user. * ``: Snowflake database with your ML training / inference data. ```sql -- Configuration set aporia_username='APORIA'; set aporia_password=''; set aporia_role_name='APORIA_ROLE'; set dbname=''; -- Set role for grants USE ROLE ACCOUNTADMIN; -- Create the role Aporia will use CREATE ROLE IF NOT EXISTS identifier($aporia_role_name); -- Create Aporia's user and grant access to role CREATE USER IF NOT EXISTS identifier($aporia_username) PASSWORD=$aporia_password DEFAULT_ROLE=$aporia_role_name; GRANT ROLE identifier($aporia_role_name) TO USER identifier($aporia_username); -- Grant read-only privileges to the database GRANT SELECT ON ALL TABLES IN DATABASE identifier($dbname) TO ROLE identifier($aporia_role_name); GRANT SELECT ON ALL VIEWS IN DATABASE identifier($dbname) TO ROLE identifier($aporia_role_name); USE DATABASE identifier($dbname); ``` ### Creating an Snowflake data source in Aporia To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API: ```python aporia.create_model("", "") ``` Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia. ```python apr_model = aporia.create_model_version( model_id="", model_version="v1", model_type="binary" raw_inputs={ "raw_text": "text", }, features={ "amount": "numeric", "owner": "string", "is_new": "boolean", "embeddings": {"type": "tensor", "dimensions": [768]}, }, predictions={ "will_buy_insurance": "boolean", "proba": "numeric", }, ) ``` Each raw input, feature or prediction is mapped by default to the column of the same name in the Snowflake query. By creating a feature named `amount` or a prediction named `proba`, for example, the Snowflake data source will expect a column in the Snowflake query named `amount` or `proba`, respectively. Next, create an instance of `SnowflakeDataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`: ```python data_source = SnowflakeDataSource( url="", query='SELECT * FROM "my_db"."model_predictions"', user="APORIA", password="", database="", schema="", warehouse="", # Optional # Optional - use the select_expr param to apply additional Spark SQL select_expr=["", ...], # Optional - use the read_options param to apply any Spark configuration # (e.g custom Spark resources necessary for this model) read_options={...} ) apr_model.connect_serving( data_source=data_source, # Names of the prediction ID and prediction timestamp columns id_column="prediction_id", timestamp_column="prediction_timestamp", ) ``` Note that as part of the `connect_serving` API, you are required to specify additional 2 columns: * `id_column` - A unique ID to represent this prediction. * `timestamp_column` - A column representing when did this prediction occur. ### What's Next For more information on: * Advanced feature / prediction <-> column mapping * How to integrate delayed actuals * How to integrate training / test sets Please see the [Data Sources Overview](https://docs.aporia.com/v1/data-sources) page. --- # Source: https://docs.aporia.com/integrations/sso-saml-integration.md # Source: https://docs.aporia.com/v1/integrations/sso-saml-integration.md # Single Sign On (SAML) You can easily give access to Aporia to your team using your favorite SAML Idp. The integration can be found on the "Integrations" page, accessible through the sidebar: ![All Integrations](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FogmjRuZ5XbmPSJwXoBPm%2Fall_integrations.png?alt=media) ### Setting up the SAML integration After clicking the **Connect** button inside the **SAML Single sign on** card (only available for *Professional* users), you will be redirected to the "Integrations" page. ![Exchange data](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FStxU1ERCg8tp56OoPvnG%2Fsaml_exchange_data.png?alt=media) ### Integrate with your Idp Create a new application in your favorite SAML Idp and fill in the relevant details under the **For you** title. Here's a demonstration of the process of integrating with OKTA below. 1. Sign in to your OKTA dev account. 2. In the sidebar, click on **Applications -> Applications**. 3. Click on the **Create App Integration** button. 4. Choose **SAML 2.0** and click **Next**. You should now be in step 1 of the creation wizard named **General settings**.\

5. Fill in the **App name** as **Aporia** and click **Next**. Moving forward to step 2, **Configure SAML**. 6. Fill in the **Single sign on URL** and **Audience URI** according to the fields in the **SAML integration** page in Aporia.

7\. Scroll to the **Attribute Statements** section. Fill in the data as follows: | Name | Name format | Value | | -------------------------------------------------------------------- | ------------- | -------------- | | | URI Reference | user.email | | | URI Reference | user.firstName | | | URI Reference | user.lastName | Click on the **Add Another** button to add a new attribute. 1. Scroll down and click **Next**. In step 3, fill in the requested data however you think is right and click on **Finish**. ### Integrate with Aporia 1. Inside your OKTA application page, click on the **Sign On** tab. 2. Scroll down and click the **View Setup Instructions** button. 3. Copy the value under **Identity Provider Single Sign-On URL** step and download the **X.509 Certificate**. 4. In Aporia, under the **For us** title, fill in the data you gathered from step 3 and click on **Connect**.

5. You'll be redirected to the **Integration success page** where you'll be able to see and edit your connection data.

You can now go and test your connection using the Idp-initiated login link. --- # Source: https://docs.aporia.com/introduction/support.md # Source: https://docs.aporia.com/v1/introduction/support.md # Support Need help? Want something more? Reach out! 📧 ### Email Support Email us and we'll (usually) respond within a few hours, at most 24 hours. 😅 ### Schedule a Call Schedule a call with one of our team members. We'd be happy to walk you through the platform and help you onboard your first model! 🚀 [Schedule a call](https://www.aporia.com/request-a-demo/) --- # Source: https://docs.aporia.com/integrations/teams.md # Teams You can integrate Aporia with Microsoft's Teams to receive alerts and notifications directly to your Teams channels. ### Setting up the Teams Integration 1. Create an incoming webhook for the desired Teams channel according to the following guidance: [Microsoft Support](https://support.microsoft.com/en-us/office/post-a-workflow-when-a-webhook-request-is-received-in-microsoft-teams-8ae491c7-0394-4861-ba59-055e33f75498) 2. Log into Aporia’s console. On the navbar on the left, click on **Integrations,** switch to the **Applications** tab, and choose **Teams**. 3. ```

``` 4. Enter your **Integration Name** and **Webhook URL.** The URL should include the schema (http/ https).

5. Click Save. On success the save button will become disabled, and you'll be able to Test or Remove the integration. **Congratulations: You’ve now successfully added your Teams integration to Aporia!** After Integrating Teams, you'll be able to select sending alerts to your Teams channel in the **Custom mode** of the monitor builder.

### Alert's format The alert will be sent to your Teams channel with a link to the alert in Aporia.

Happy Monitoring! --- # Source: https://docs.aporia.com/core-concepts/tracking-data-segments.md # Source: https://docs.aporia.com/v1/core-concepts/tracking-data-segments.md # Tracking Data Segments Sometimes looking over our entire data doesn't supply us with enough insights to understand what is best to do. We need the ability to break our data into smaller pieces to reach valuable and sharp insights. This is exactly when data segmentation jumps to our help! Zooming into a specific data segment can help us understand if our overall performance degradation originates just in that segment or do we have a wide problem. Comparing two different segments can help us decide which one of them is more valuable to invest in our future campaign. ### Tracking Data Segments There are infinite ways to segment your data. Let us say we want to segment our subjects by their age. What interval between bins should we choose? should that interval be constant or maybe correlating to a real-world segmentation? Don't be tempted to create them all. Think about what segmentation choice can help you answer real valuable questions that may influence the actions you'll take. For example, gender is often just raw data and not a feature, but slicing your data by gender can help you surface performance differences or even biases. In such cases, you should consider even monitoring specific issues by segments. --- # Source: https://docs.aporia.com/core-concepts/understanding-data-drift.md # Source: https://docs.aporia.com/v1/core-concepts/understanding-data-drift.md # Understanding Data Drift ### What is Data Drift? Data drift occurs when the distribution of *production data* is different from a certain baseline (e.g *training data*). The model isn't designed to deal with this change in the feature space and so, its predictions may not be reliable. Drift can be caused by changes in the real world or by data pipeline issues - missing data, new values, changes to the schema, etc. It's important to look at the data that has drifted and follow it back through its pipeline to find out **when** and **where** the drift started. {% hint style="info" %} **When should I retrain my model?** As the data begins to drift, we may not notice significant degradation in our model's performance immediately. However, this is an excellent opportunity to retrain before the drift has a negative impact on performance. {% endhint %} ### Measuring Data Drift To measure how distributions differ from each other, you can use a **statistical distance**. This is a metric that quantifies the distance between two distributions, and it is extremely useful. There are many different statistical distances for different scenarios.

By default, Aporia calculates a metric called **Drift Score**, which is a smart combination of statistical distances such as [Hellinger Distance](https://en.wikipedia.org/wiki/Hellinger_distance) for categorical variables and [Jensen-Shannon Divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) for numeric variables. Besides the default drift score, you can customize and add your own statistical distances. ### Intuition to Drift Score Let's say we have a categorical feature called `pet_type` with 2 possible values: * 🐶 Dog * 🐱 Cat In our training set, the distribution of this feature was **100% 🐶** + **0% 🐱**. This means that when we trained our model, we only had dogs and no cats. Now, let's evaluate different scenarios in production, and see what would be the drift score: * If the current distribution is **0% 🐶** + **100% 🐱**, the drift score would be **1.0**. * Tons of drift! * If the current distribution is **50% 🐶** + **50% 🐱**, the drift score would be **0.54**. * If the current distribution is **60% 🐶** + **40% 🐱**, the drift score would be **0.47**. * If the current distribution is **100% 🐶** + **0% 🐱**, the drift score would be **0.0**. * No drift at all! --- # Source: https://docs.aporia.com/monitors-and-alerts/value-range.md # Source: https://docs.aporia.com/v1/monitors/value-range.md # Value Range ### Why Monitor Value Range? Monitoring changes in the value range of numeric fields helps to locate and examine anomalies in the model's input. For example, setting the monitor for a feature named `hour_sin` with the range `-1 <= x <= 1` will help us discover issues in model input. ### Comparison methods For this monitor, the following comparison methods are available: * [Change in percentage](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Absolute value](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to segment](https://docs.aporia.com/v1/monitor-template#comparison-methods) * [Compared to training](https://docs.aporia.com/v1/monitor-template#comparison-methods) ### Customizing your monitor Configuration may slightly vary depending on the comparison method you choose. #### STEP 1: choose the fields you would like to monitor You may select as many fields as you want (from features/raw inputs) 😊 Note that the monitor will run on each selected field separately. #### STEP 2: choose inspection period and baseline For the fields you chose in the previous step, the monitor will raise an alert if the value range in the inspection period exceeds your threshold boundaries compared to the baseline's value range. #### STEP 3: calibrate thresholds This step is important to make sure you have the right amount of alerts that fits your needs. You can always readjust it later if needed. --- # Source: https://docs.aporia.com/integrations/webhook.md # Source: https://docs.aporia.com/v1/integrations/webhook.md # Webhook Aporia allows you to send alerts generated from Aporia’s monitors to any system using webhooks. ### Add a Webhook integration 1. Log into Aporia’s console. On the navbar on the left, click on **Integrations** and choose **Webhook**.

2. Enter your **Integration Name**, **Webhook URL** and **Custom Headers**(optional). The url should include the schema (http/ https). 3. Click Save. On success the save button will become disable, and you'll be able to Test or Remove the integration. **Congratulations: You’ve now successfully add your webhook integration to Aporia!** After Integrating your webhook you'll be able to select sending alerts to your webhook in the **Custome mode** of the monitor builder. ![Webhook Integration](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FTt77HwIpGD9M0GsGEq1Q%2Fwebhook-action.png?alt=media) ### Alert's format The alert will be sent by **POST** action to the URL defined in the integration, as a JSON in the following format: | Key | Description | | ------------------- | ------------------------------------------------------------------------- | | alert\_id | The ID of the alert. | | monitor\_type | The type of the monitor that rose the alert. | | monitor\_id | The ID of the monitor that rose the alert. | | monitor\_name | The name of the monitor that rose the alert. | | model\_id | The ID of the model that the monitor created on. | | model\_name | The name of the model that the monitor created on. | | severity | The severity of the alert as defined when building the monitor. | | environment | The environment the model received alert at. | | pretty\_description | A short pretty summery about the specific alert. | | dashboard\_link | A link for the alert in the Aporia's dashboard for farther investigation. | You'll be able to see an example alert by clicking on **Test** in the Webhook Integration page mentioned in the previous section. Happy Monitoring! --- # Source: https://docs.aporia.com/v1/welcome-to-aporia.md # Welcome to Aporia! Data Science and ML teams rely on Aporia to **visualize** their models in production, as well as **detect and resolve** data drift, model performance degradation, and data integrity issues. Aporia offers quick and simple deployment and can monitor billions of predictions with low cloud costs. We understand that use cases vary and each model is unique, that’s why we’ve cemented **customization** at our core, to allow our users to tailor their dashboards, monitors, metrics, and data segments to their needs.

## Monitor your models in 3 easy steps


Learn	Learn about data drift, measuring model performance in production across various data segments, and other ML monitoring concepts.	why-monitor-ml-models
Connect	Connect to an existing database where you already store the predictions of your models.	data-sources
Monitor	Build a dashboard to visualize your model in production and create alerts to notify you when something bad happens.	monitors

--- # Source: https://docs.aporia.com/core-concepts/why-monitor-ml-models.md # Source: https://docs.aporia.com/v1/core-concepts/why-monitor-ml-models.md # Why Monitor ML Models? You spent *months* working on a sophisticated model, and finally deployed it to production. 8 months later, and the model is still running. Making amazing predictions. Increasing business KPIs by a ton - boss is happy. Satisfied with the results, you move on to this next-gen super cool deep learning computer vision project. **Sounds like a dream?** ![To Production and Beyond](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2F9gzmVNCmvR26i4areXJs%2Fto-production-and-beyond.jpg?alt=media) *** ### The Real Work Begins Even though we spend a lot of time training and testing our models, *the real work begins when we deploy them to production.* It's one of the most fundamental differences between ML and traditional software engineering. With traditional software, most of the work is done during the development phase, and once the system is up and running - as long as we've tested it thoroughly - it usually works the way we planned. With Machine Learning, it *doesn't matter* how well we test our models after training them. **When models run in production, they are exposed to data that's different from what they've been trained on.** Naturally, their performance degrades over time. ### Simple Workflow for ML in Production Don't panic! Even though models in production do degrade over time, it doesn't mean you'll have to actively take care of each one of them every single minute they're in production. With two simple principles, you'll be able to move on to that super cool next-gen computer vision project, while knowing your production models are in safe hands: #### 1. Build a Custom Dashboard Each one of your models should have a customized production dashboard where you can easily see *the most important metrics* about it. **Put something on your calendar**, and take a look at these dashboards from time to time, to make sure your models are on track! ![Custom Dashboards](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FidvRA9LWpa0iR2EPITs4%2Fcustom-dashboards.png?alt=media) **Bonus points** if you put your dashboard on a big TV screen in the office! #### 2. Set up important alerts You should also set up alerts to detect drift, performance degradation, data integrity issues, anomalies in your custom metrics, etc. To avoid false positives and alert fatigue, make sure to customize the alerts so they only trigger when something important happens. ![Monitor Builder](https://1009457926-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiBXs570GNM7Jbx4EBQ9%2Fuploads%2FPw4zlFSBxKwWZwDfZ0d2%2Fmonitor-builder.png?alt=media)