Complex Pipelines in Go (Part 3): Transforming Data to Tab Separated Values

Aug 12, 2020

This post is part 3 in a series:

Part 1 - Introduction
Part 2 - Storing Values in Batches
Part 3 - Transforming Data to Tab Separated Values (this post)
Part 4 - Sane Coordination and Cancellation
Part 5 - Putting it All Together

Transforming Data to Tab Separated Values

In part 2 we focused on persisting data previously transformed, in this post we will be work on implementing that transformation step. Specifically we will be using the data we got and we will make sure that is manipulated in a way the downstream pipeline can handle it the easiest way possible, this component is a key piece of the Data Producer Process.

Minimum Requirements

All the code relevant to this post is on Github, feel free to explore it for more details, the following is the minimum required for running the example:

Go 1.14

Parsing Values

Selecting the right package for parsing values depends on the format being used, common examples of those formats include JSON, CSV or Protocol Buffers; knowing the right data format is definitely important but so is knowing the specific standard and version being used.

In the end the package to be used will be determined by the original datasource. With that in mind in and recalling what we talked in part 1, we know for sure our origin datasource will be IMDB, the concrete file is gziped and contains tab-separated-values. By the way, in this post we are not be covering the gunzipping part, that will be discussed in future posts.

So what is exactly needed for this?

We can be tempted to use encoding/csv and set up our reader with a configuration similar to:

cr := csv.NewReader(br)
cr.Comma = '\t'
cr.LazyQuotes = true

However this will not work, the reason being is that the original datasource is not really using the expected format this package implements. I’m explicitly mentioning this because it is really important to determine this in advance, it may look like a CSV (using tabs instead of commas) but really it is not.

In reality the implementation for parsing IMDB’s TSV file is easier than that. We can use bufio and do something like the following:

br := bufio.NewReader(f)

for {
	line, err := br.ReadString('\n')
	if err == io.EOF {
		return
	}

	if err != nil {
		log.Fatalf("Reading TSV %s", err)
	}

	fmt.Printf("%#v\n", strings.Split(strings.Trim(line, "\n"), "\t"))
}

This should give us what we expect: a data structure (in this case a slice of strings) that can easily be handled by out downstream pipeline.

What’s next?

The next blog post will cover the phenomenal errgroup package! We are so close to connect everything together.

Back to posts