

Just recently I had the opportunity to work on a tool in charge of processing gigabytes of data over the wire, the end goal was to download that data, process the values and finally insert them into persistent storage in batches.
This is the first of a series of posts covering all the different pieces involved to achieve the final tool.
Succinctly this solution will consist of 3 processes:
This is the classic problem solved using Pipelines. The biggest difference between that classic post and this new series is how cancellation comes into place when working with multiple goroutines. This means defining rules regarding the expected behavior when anything fails, all of this handled using two great Go packages: context
and errgroup
.
For our example we will be using a file part of IMDB’s datasets. Those files are gzipped, tab-separated-values (TSV) formatted in the UTF-8 character set. The specific file to use will be name.basics.tsv.gz
which defines the following fields:
|-------------------|-----------|---------------------------------------------------|
| Field | Data Type | Description |
|-------------------|-----------|---------------------------------------------------|
| nconst | string | alphanumeric unique identifier of the name/person |
| primaryName | string | name by which the person is most often credited |
| birthYear | string | in YYYY format |
| deathYear | string | in YYYY format if applicable, else '\N' |
| primaryProfession | []string | the top-3 professions of the person |
| knownForTitles | []string | titles the person is known for |
|-------------------|-----------|---------------------------------------------------|
Because of the location (http resource) and the file data format (gzip) of this input file, our Data Producer Process will request the file using net/http
, uncompres the received values using compress/gzip
and send them to the Data Consumer Process as raw []byte
.
Those raw []byte
values will be read into TSV records using encoding/csv
and from there they will be converted into values of a new struct type Name
that our next step in our pipeline can understand.
The following will be used as the type containing the values to be eventually persisted in a database:
type Name struct {
NConst string
PrimaryName string
BirthYear string
DeathYear string
PrimaryProfessions []string
KnownForTitles []string
}
We will be using PostgreSQL as the relational database for this, and specifically github.com/jackc/pgx
will be imported for storing the values in batches.
The next blog post will cover the implementation of the PostgreSQL Batcher, as we progress in the series we will continously connect all the pieces together to eventually complete our final tool.