This page offers a general approach to managing large lookup files. While LogStream's Git integration normally helps manage configuration changes, large lookups are exceptions. In many cases, you might want to exclude these files from Git, to reduce excessive deploy traffic. This approach can also prevent Git Push commands from encountering large file errors.
Good scenarios for this approach are:
Large binary files – like databases – which don't benefit from Git's typical efficient storage of only the deltas between versions. (With binary files, Git must replace the whole file for each new version.)
Files updated frequently and/or files updated independent of LogStream.
Files replicated on many Worker Nodes.
We'll illustrate this with an example that often combines all three conditions: setting up the free, popular MaxMind GeoLite2 City database to support LogStream's GeoIP lookup Function. This example anticipates a LogStream production distributed deployment, where the GeoLite database is updated nightly across multiple Workers.
This example includes complete instructions for this particular setup. However, you can generalize the example to other MaxMind databases, and to other large lookup files – including large
.csv's that similarly receive frequent updates.
The general approach for handling large lookups is:
Do not place these files in the standard
Instead, place them in a
$CRIBL_HOMEsubdirectory that's excluded from Git version control, through inclusion in the
$CRIBL_HOME/.gitignorefile. Deploying the files to the Master Node and all desired Workers will require a manual procedure and will be required for the initial deployment as well as subsequent updates.
The example below uses
$CRIBL_HOME/state subdirectory, which is already listed in the default
.gitignore file that ships with LogStream.
If you prefer, you can use a different path, including a path outside
$CRIBL_HOME. If you choose this alternative, be sure to add that path to
However, Cribl recommends using a
$CRIBL_HOME/state, because this inherits appropriate permissions and simplifies backup/restore operations.
Let's move on to the MaxMind GeoLite specifics.
To enable the GeoIP Function using the MaxMind GeoLite 2 City database, your first steps are:
Create a free MaxMind account, at the page linked above.
Log in to your MaxMind account portal and select Download Databases.
On the Download page, look for the database you want. (In this example, you'd locate the GeoLite2 City section.) Note the Format: GeoIP2 Binary, and select Download GZIP.
Extract the archive to your local system.
Change to the directory created when you extracted the archive. This directory's name will correspond to the date you downloaded the file, so in the above
2020-10-06example, you would use:
$ cd GeoLite2-City_20201006
In distributed deployments, Cribl recommends copying the MaxMind database separately to the Master and all Worker Nodes, e.g.. placing it in the
$CRIBL_HOME/state path. This will minimize the Git commit/deploy overhead around nightly updates to the binary database file.
Once in the database's directory, execute commands of this form:
$ scp *.mmdb <user>@<master-node>: $ scp *.mmdb <user>@<worker-node>:
Copy the file to each Worker in the Worker Group where you intend to use LogStream's GeoIP Function.
The above commands copy the
.mmdb database file into your user's home directory on each Node. Next, we’ll move it to
$CRIBL_HOME/state on each Node. Execute these commands on both the Master and Worker Nodes:
$ sudo mv ~/*.mmdb <$CRIBL_HOME>/state/ $ sudo chown -R cribl:cribl <$CRIBL_HOME>/state/
Now that the database is in place, your Pipelines can use the GeoIP Function to enrich data. In the Function's GeoIP file (.mmdb) field, insert the complete
$CRIBL_HOME/state/<filename>.mmdb file path.
In smaller deployments, you might choose to copy this MaxMind database only to Master Node, and to let Workers receive updates via Git commit/deploy. In this case, the final commands above might look like this:
$ sudo cp ~/*.mmdb /opt/cribl/groups/<group-name>/data/lookups/ $ cd /opt/cribl/groups/<group-name>/data/lookups/ $ sudo chown cribl:cribl *.mmdb
To set up automatic updates, see MaxMind's Automatic Updates for GeoIP2 and GeoIP Legacy Databases documentation. You'll need two modifications specific to LogStream:
This must be set up on the Master, and on each Worker in any Group using GeoIP lookups.
The default setting in
GeoIP.confwrites output to
/usr/local/share/GeoIP. You must change this setting to the path where your databases actually reside. If you're using the recommended architecture above, you'd set:
Storage aside, large lookup files can also require additional RAM on each Worker Node that processes the lookups. For details, see Memory Sizing for Large Lookups.
Updated about a month ago