Overview

The Regex plugin creates a single “union” table in a database from any files matching a regular expression (RegEx) by leveraging any of the flat file connectors (CSV, Excel, JSON, etc) as a base.

See the Union plugin if you are trying to "union" multiple database tables together into single database table.

Create a RegEx job

Create a job using any of the flat file connectors:

Edit the job.

Rename the input’s use to base. base is how the Regex plugin knows which underlying flat file connector to use.

Add a new input use key with the value flatfile.iov2#RegexInput.

Remove the input source and add regex with the regular expression value.

Update the steps with the job's steps to these steps:

    "steps": [
        {
            "column": "__mtime__",
            "use": "mitto.iov2.steps#MaxTimestamp"
        },
        {
            "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "use": "mitto.iov2.steps.upsert#CreateTempTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "key": "__source__",
            "use": "mitto.iov2.steps.upsert#SyncTempTable"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]

The Regex connector uses all the input parameters (e.g. delimiter, encoding, skip, etc) from the initial job's plugin.

Output table

The resulting database table has three additional columns:

  • __index__ - the index of each row in each separate file
  • __source__ - the source file the rows originate from
  • __mtime__ - the modified time of the source file

Upsert

Upsert is supported for RegEx jobs by using the __source__ as the primary key and __mtime__ as the last modified time.

Every time the RegEx job is run, it checks the output table and finds the maximum __mtime__. Only files with modified times greater than or equal to this __mtime__ will be upsert on each run.

If you ever need to "reset" or do a full table refresh, you can drop the output table and rerun the RegEx job.

Example

We have 3 CSV files we want to union together into a single database table.

Here's what the initial CSV files look like:

Here's the resulting database table for regex_1.csv after using the XSV plugin:

idname__index__
trueJustin1

And here's the resulting database table after using the RegEx plugin:

idname__index____source____mtime__
3Birdie1regex_3.csv2020-02-19 17:30:20.680760
2Bear1regex_2.csv2020-02-19 17:30:20.628760
1Justin1regex_1.csv2020-02-19 17:30:20.572761

Here's the job config for the original XSV job:

{
    "input": {
        "delimiter": ",",
        "encoding": "ASCII",
        "includes_header": true,
        "source": "/var/mitto/data/regex_1.csv",
        "use": "xsv.iov2#XsvInput2"
    },
    "output": {
        "dbo": "postgresql://localhost/analytics",
        "schema": "test",
        "tablename": "regex_1",
        "use": "call:mitto.iov2.db#todb"
    },
    "sdl": {
        "columns": [
            {
                "__header__": "id",
                "__jpath__": "$.'id'",
                "name": "id",
                "type": "Boolean"
            },
            {
                "__header__": "name",
                "__jpath__": "$.'name'",
                "length": 30,
                "name": "name",
                "type": "String"
            }
        ],
        "name": "regex_1"
    },
    "steps": [
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

And here's the modified job config using the RegEx plugin:

{
    "input": {
        "delimiter": ",",
        "encoding": "ASCII",
        "includes_header": true,
        "regex": "^regex_.+\\.csv$",
        "base": "xsv.iov2#XsvInput2",
        "use": "flatfile.iov2#RegexInput"
    },
    "output": {
        "dbo": "postgresql://localhost/analytics",
        "schema": "test",
        "tablename": "regex",
        "use": "call:mitto.iov2.db#todb"
    },
    "sdl": {
        "columns": [
            {
                "__header__": "id",
                "__jpath__": "$.'id'",
                "name": "id",
                "type": "Boolean"
            },
            {
                "__header__": "name",
                "__jpath__": "$.'name'",
                "length": 30,
                "name": "name",
                "type": "String"
            }
        ],
        "name": "regex_1"
    },
    "steps": [
        {
            "column": "__mtime__",
            "use": "mitto.iov2.steps#MaxTimestamp"
        },
        {
            "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "use": "mitto.iov2.steps.upsert#CreateTempTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "key": "__source__",
            "use": "mitto.iov2.steps.upsert#SyncTempTable"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

The input's use was renamed to base. A new use (for the RegEx plugin) was added to the input. The input's source was removed and regex was added. The output tablename was updated. The steps were completely replaced with the RegEx steps.

The input's regex for this particular job is ^regex_.+\\.csv$. We are looking for any files that start with (^)  regex_, have any characters in the middle (.+), and end with ($) .csv.

This regex will feed in the 3 files we want to union:

  • regex_1.csv
  • regex_2.csv
  • regex_3.csv

Regular Expression syntax:

One helpful tool for testing regular expressions is RegExr.

One important thing to remember, because a job's config is stored in JSON, when using escape characters (\), you need to double escape them (\\).