RegEx

The Regex plugin creates a single “union” table in a database from any files matching a regularexpression (RegEx) by leveraging any of the flat file connectors (CSV, Excel) as a base.

union flat files together

See the Union plugin if you are trying to “union” multiple database tables together into single database table.

Create a RegEx job

Create a job using any of the flat file connectors:

Edit the job.

Input changes

Rename the input’s use to base. base is how the Regex plugin knows which underlying flat file connector to use.

Add a new input use key with the value flatfile.iov2#RegexInput.

Remove the input source and add regex with the regular expression value.

SDL changes

If there is an sdl section, remove it entirely.

Steps changes

Update the steps with the job’s steps to these steps:

"steps": [
    {
        "column": "__mtime__",
        "use": "mitto.iov2.steps#MaxTimestamp"
    },
    {
        "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
    },
    {
        "transforms": [
            {
                "use": "mitto.iov2.transform#ExtraColumnsTransform"
            },
            {
                "use": "mitto.iov2.transform#ColumnsTransform"
            }
        ],
        "use": "mitto.iov2.steps#Input"
    },
    {
        "use": "mitto.iov2.steps#CreateTable"
    },
    {
        "use": "mitto.iov2.steps.upsert#CreateTempTable"
    },
    {
        "transforms": [
            {
                "use": "mitto.iov2.transform#FlattenTransform"
            }
        ],
        "use": "mitto.iov2.steps#Output"
    },
    {
        "key": "__source__",
        "use": "mitto.iov2.steps.upsert#SyncTempTable"
    },
    {
        "use": "mitto.iov2.steps#CollectMeta"
    }
]

The Regex connector uses all the input parameters (e.g. delimiter, encoding, skip, etc) from the initial job’s plugin.

Output table

The resulting database table has three additional columns:

  • __index__ - the index of each row in each separate file

  • __source__ - the source file the rows originate from

  • __mtime__ - the modified time of the source file

Upsert

Upsert is supported for RegEx jobs by using the __source__ as the primary key and __mtime__ as the last modified time.

Every time the RegEx job is run, it checks the output table and finds the maximum __mtime__. Only files with modified times greater than or equal to this __mtime__ will be upsert on each run.

If you ever need to “reset” or do a full table refresh, you can drop the output table and rerun the RegEx job.

Example

We have 3 CSV files we want to union together into a single database table.

Here’s what the initial CSV files look like:

../_images/regex__initial_csv.png

Initial CSV

Here’s the resulting database table for regex_1.csv after using the XSV plugin:

id

name

index

1

Justin

1

And here’s the resulting database table after using the RegEx plugin:

id

name

index

source

mtime

3

Birdie

1

regex_3.csv

2020-02-19 17:30:20.680760

2

Bear

1

regex_2.csv

2020-02-19 17:30:20.628760

1

Justin

1

regex_1.csv

2020-02-19 17:30:20.572761

Here’s the job config for the original XSV job:

{
    "input": {
        "delimiter": ",",
        "encoding": "ASCII",
        "includes_header": true,
        "source": "/var/mitto/data/regex_1.csv",
        "use": "xsv.iov2#XsvInput2"
    },
    "output": {
        "dbo": "postgresql://dbo/analytics",
        "schema": "test",
        "tablename": "regex_1",
        "use": "call:mitto.iov2.db#todb"
    },
    "sdl": {
        "columns": [
            {
                "__header__": "id",
                "__jpath__": "$.'id'",
                "name": "id",
                "type": "Boolean"
            },
            {
                "__header__": "name",
                "__jpath__": "$.'name'",
                "length": 30,
                "name": "name",
                "type": "String"
            }
        ],
        "name": "regex_1"
    },
    "steps": [
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

And here’s the modified job config using the RegEx plugin:

{
    "input": {
        "delimiter": ",",
        "encoding": "ASCII",
        "includes_header": true,
        "regex": "^regex_.+\\.csv$",
        "base": "xsv.iov2#XsvInput2",
        "use": "flatfile.iov2#RegexInput"
    },
    "output": {
        "dbo": "postgresql://dbo/analytics",
        "schema": "test",
        "tablename": "regex",
        "use": "call:mitto.iov2.db#todb"
    },
    "steps": [
        {
            "column": "__mtime__",
            "use": "mitto.iov2.steps#MaxTimestamp"
        },
        {
            "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "use": "mitto.iov2.steps.upsert#CreateTempTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "key": "__source__",
            "use": "mitto.iov2.steps.upsert#SyncTempTable"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

The input’s use was renamed to base. A new use (for the RegEx plugin) was added to the input. The input’s source was removed and regex was added. The output tablename was updated. The steps were completely replaced with the RegEx steps.

The input’s regex for this particular job is ^regex_.+\\.csv$. We are looking for any files that start with (^)  regex_, have any characters in the middle (.+), and end with ($) .csv.

This regex will feed in the 3 files we want to union:

  • regex_1.csv

  • regex_2.csv

  • regex_3.csv

Regular Expression Syntax

One helpful tool for testing regular expressions is RegExr.

One important thing to remember, In older versions of Zuar Runner (pre-2.9.x) because a job’s config is stored in JSON, when using escape characters (\), you need to double escape them (\\).

PRO-TIP: Testing in the Files Page

You can test what your regular expression will match by using the search input on the Files Page.

files page search input