The “secret sauce” to tf.data lies in TensorFlow’s multi-threading/multi-processing implementation, and more specifically, the concept of “autotuning.” The short answer is yes, using tf.data is significantly faster and more efficient than using ImageDataGenerator - as the results of this tutorial will show you, we’re able to obtain a ≈6.1x speedup when working with in-memory datasets and a ≈38x increase in efficiency when working with images data residing on disk. Is tf.data more efficient for building data pipelines?įigure 2: The “tf.data” module is significantly faster than the “ImageDataGenerator” class due to an optimized producer/consumer relationship ( image source). Working with data is now significantly easier using tf.data - and as we’ll see, it’s also worlds faster and more efficient than relying on the old ImageDataGenerator class. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. The tf.data API enables you to build complex input pipelines from simple, reusable pieces. The TensorFlow v2 API has gone through a number of changes, and arguably one of the biggest/most important changes is the introduction of the tf.data module. The ImageDataGenerator function, while a perfectly fine option, wasn’t the fastest method either. Manually implementing your own data loading functions is hard work and can result in bugs. Utilize Keras’ ImageDataGenerator function for working with image datasets too large to fit into memory and/or when data augmentation needed to be applied.Manually define their own data loading functions.Up until TensorFlow v2, Keras and TensorFlow users would have to either: Users of the PyTorch library are likely familiar with the Dataset and DatasetLoader classes - they make loading and preprocessing data incredibly easy, efficient, and fast. Therefore the output_shapes argument should be a 2-element tuple, where each element is (SIZE,): shapes = ((SIZE,), (SIZE,))įinally, you will need to provide a little more information about shapes to the tf.feature_column.numeric_column() and tf.estimator.LinearRegressor() APIs: x_col = tf.feature_column.numeric_column(key='x', shape=(SIZE,))Įs = tf.estimator.Figure 1: The “tf.data” module can be used to build fast, efficient data pipelines using Keras and TensorFlow ( image source). (The shape of a tf.Tensor is more general: see this Stack Overflow question for a discussion.) Let's look at the actual shape of feats: > SIZE = 10 The shape of an array is always a tuple of dimensions. Therefore the "nested structure" is a tuple of two elements (one for each array).Įach component of the output_shapes structure should match the shape of the corresponding tensor. In your program, _generator() contains the statement yield feats, labels. a tuple, a tuple of tuples, a dict of tuples, etc.) that must match the structure of the value(s) yielded by your generator. The output_shapes argument is a "nested structure" (e.g. There are two constraints on its type that define how it should be specified: The optional output_shapes argument of tf._generator() allows you to specify the shapes of the values yielded from your generator. All opinions and examples are highly appreciated. X_col = tf.feature_column.numeric_column(key='x', )Įs = tf.estimator.LinearRegressor(feature_columns=)Īnother question is if it is possible to use this functionality to provide data for feature columns which are tf.feature_column.crossed_column? The overall goal is to use om_generator functionality in batch training where data is loaded on chunks from a database in cases when data does not fit in memory. Iterator = dataset.make_one_shot_iterator()įeatures_tensors, labels = iterator.get_next() Here is the code: SIZE = 10ĭataset = tf._generator(generator=_generator, The idea is just to yield two numpy arrays with SIZE = 10 and run linear regression with them. I tried several combinations including not specifying it but still receive some errors due to shape mismatch of the tensors. I can not understand how to set output_shapes argument. Trying to build simple model just to figure out how to deal with tf._generator.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |