| Continuous Integration |
|
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml --profile prod upHOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml --profile dev upTo configure HAMSTRING according to your needs, use the provided config.yaml.
The most relevant settings are related to your specific log line format, the model you want to use, and possibly infrastructure.
The section pipeline.log_collection.collector.logline_format has to be adjusted to reflect your specific input log
line format. Using our adjustable and flexible log line configuration, you can rename, reorder and fully configure each
field of a valid log line. You can freely define timestamps, RegEx patterns, lists, and IP addresses. For example, your
configuration might look as follows:
- [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- [ "client_ip", IpAddress ]
- [ "dns_server_ip", IpAddress ]
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
- [ "response_ip", IpAddress ]
- [ "size", RegEx, '^\d+b$' ]The options pipeline.data_inspection and pipeline.data_analysis are relevant for configuring the model. The section
environment can be fine-tuned to prevent naming collisions for Kafka topics and adjust addressing in your environment.
For more in-depth information on your options, have a look at our official documentation, where we provide tables explaining all values in detail.
If you want to ingest data to the pipeline, you can do so via the zeek container. Either select the interface in the config.yaml zeek should be listening on and set static_analysis: false or provide PCAPs to Zeek by adding them in the data/test_pcaps directory, which is mounted per default for Zeek to ingest static data.
To monitor the system and observe its real-time behavior, multiple Grafana dashboards have been set up.
Have a look at the following pictures showing examples of how these dashboards might look at runtime.
Overview dashboard
Contains the most relevant information on the system's runtime behavior, its efficiency and its effectivity.
Latencies dashboard
Presents any information on latencies, including comparisons between the modules and more detailed, stand-alone metrics.
Log Volumes dashboard
Presents any information on the fill levels of each module, i.e. the number of entries that are currently in the module for processing. Includes comparisons between the modules, more detailed, stand-alone metrics, as well as total numbers of logs entering the pipeline or being marked as fully processed.
Alerts dashboard
Presents details on the number of logs detected as malicious including IP addresses responsible for those alerts.
Dataset dashboard
This dashboard is only active for the datatest mode. Users who want to test their own models can use this mode for inspecting confusion matrices on testing data.
This feature is in a very early development stage.
For testing purposes, you can ingest PCAPs or tap on network interfaces using the zeek-based sensor that is integrated into the docker-compose file. For more information on the sensor, please refer to the documentation.
Important
This is only a brief wrap-up of a custom training process. We highly encourage you to have a look at the documentation for a full description and explanation of the configuration parameters.
We feature two trained models:
- XGBoost (
src/train/model.py#XGBoostModel) and - RandomForest (
src/train/model.py#RandomForestModel).
After installing the requirements, use src/train/train.py:
> python -m venv .venv
> source .venv/bin/activate
> pip install -r requirements/requirements.train.txt
> python src/train/train.py
Usage: train.py [OPTIONS] COMMAND [ARGS]...
Options:
-h, --help Show this message and exit.
Commands:
explain
test
trainSetting up the dataset directories (and adding the code for your model class if applicable) lets you start the training process by running the following commands:
> python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>The results will be saved per default to ./results, if not configured otherwise.
> python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_output_path <path_to_model_version>> python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>This will create a rules.txt file containing the innards of the model, explaining the rules it created.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
Distributed under the EUPL License. See LICENSE.txt for more information.
