Cribl LogStream – Docs

Cribl LogStream Documentation

Questions? We'd love to help you! Meet us in #Cribl Community Slack (sign up here)
Download entire manual as PDF – v.3.1.2

Managing Large Lookups

This page offers a general approach to managing large lookup files. While LogStream's Git integration normally helps manage configuration changes, large lookups are exceptions. In many cases, you might want to exclude these files from Git, to reduce excessive deploy traffic. This approach can also prevent Git Push commands from encountering large file errors.

Good scenarios for this approach are:

  • Large binary files – like databases – which don't benefit from Git's typical efficient storage of only the deltas between versions. (With binary files, Git must replace the whole file for each new version.)

  • Files updated frequently and/or files updated independent of LogStream.

  • Files replicated on many Worker Nodes.

👍

The steps below assume access to a command line and (more importantly) to your OS' filesystem. Where you lack such access – for example, in a LogStream Cloud deployment – load lookup files of all sizes via LogStream's UI, as outlined in Lookups Library.

About the MaxMind GeoLite Example

We'll illustrate this with an example that often combines all three conditions: setting up the free, popular MaxMind GeoLite2 City database to support LogStream's GeoIP lookup Function. This example anticipates a LogStream production distributed deployment, where the GeoLite database is updated nightly across multiple Workers.

This example includes complete instructions for this particular setup. However, you can generalize the example to other MaxMind databases, and to other large lookup files – including large .csv's that similarly receive frequent updates.

Reducing Deploy Traffic

The general approach for handling large lookups is:

  • Do not place these files in the standard $CRIBL_HOME/data/lookups.

  • Instead, place them in a $CRIBL_HOME subdirectory that's excluded from Git version control, through inclusion in the $CRIBL_HOME/.gitignore file. Deploying the files to the Leader Node and all desired Workers will require a manual procedure and will be required for the initial deployment as well as subsequent updates.

The example below uses $CRIBL_HOME/state subdirectory, which is already listed in the default .gitignore file that ships with LogStream.

📘

If you prefer, you can use a different path, including a path outside $CRIBL_HOME. If you choose this alternative, be sure to add that path to .gitignore.

However, Cribl recommends using a $CRIBL_HOME subdirectory like $CRIBL_HOME/state, because this inherits appropriate permissions and simplifies backup/restore operations.

Let's move on to the MaxMind GeoLite specifics.

Download and Extract the Database

To enable the GeoIP Function using the MaxMind GeoLite 2 City database, your first steps are:

  1. Create a free MaxMind account, at the page linked above.

  2. Log in to your MaxMind account portal and select Download Databases.

  3. On the Download page, look for the database you want. (In this example, you'd locate the GeoLite2 City section.) Note the Format: GeoIP2 Binary, and select Download GZIP.

GeoLite2 City database: Download binary GZIPGeoLite2 City database: Download binary GZIP

GeoLite2 City database: Download binary GZIP

  1. Extract the archive to your local system.

  2. Change to the directory created when you extracted the archive. This directory's name will correspond to the date you downloaded the file, so in the above 2020-10-06 example, you would use: $ cd GeoLite2-City_20201006

Copy the Database File to the Leader and Worker Nodes (Recommended)

In distributed deployments, Cribl recommends copying the MaxMind database separately to the Leader and all Worker Nodes, e.g.. placing it in the $CRIBL_HOME/state path. This will minimize the Git commit/deploy overhead around nightly updates to the binary database file.

Once in the database's directory, execute commands of this form:

$ scp *.mmdb <user>@<master-node>:
$ scp *.mmdb <user>@<worker-node>:

🚧

Copy the file to each Worker in the Worker Group where you intend to use LogStream's GeoIP Function.

The above commands copy the .mmdb database file into your user's home directory on each Node. Next, we’ll move it to $CRIBL_HOME/state on each Node. Execute these commands on both the Leader and Worker Nodes:

$ sudo mv ~/*.mmdb <$CRIBL_HOME>/state/
$ sudo chown -R cribl:cribl <$CRIBL_HOME>/state/

Now that the database is in place, your Pipelines can use the GeoIP Function to enrich data. In the Function's GeoIP file (.mmdb) field, insert the complete $CRIBL_HOME/state/<filename>.mmdb file path.

Copy the Database File Only to the Leader (Alternative)

In smaller deployments, you might choose to copy this MaxMind database only to the Leader Node, and to let Workers receive updates via Git commit/deploy. In this case, the final commands above might look like this:

$ sudo cp ~/*.mmdb /opt/cribl/groups/<group-name>/data/lookups/
$ cd /opt/cribl/groups/<group-name>/data/lookups/
$ sudo chown cribl:cribl *.mmdb

Automatic Updates to the MaxMind Database

To set up automatic updates, see MaxMind's Automatic Updates for GeoIP2 and GeoIP Legacy Databases documentation. You'll need two modifications specific to LogStream:

  • This must be set up on the Leader, and on each Worker in any Group that uses GeoIP lookups.

  • The default setting in GeoIP.conf writes output to /usr/local/share/GeoIP. You must change this setting to the path where your databases actually reside. If you're using the recommended architecture above, you'd set: DatabaseDirectory <$CRIBL_HOME>/state/.

Memory Considerations

Storage aside, large lookup files can also require additional RAM on each Worker Node that processes the lookups. For details, see Memory Sizing for Large Lookups.

Updated 2 months ago

Managing Large Lookups


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.