Ticket #123 (closed enhancement: fixed)

Opened 5 years ago

Last modified 16 months ago

Implement filters

Reported by: vreixo Owned by: vreixo
Priority: major Milestone: libisofs-0.6.4
Component: libisofs Version: libisofs-0.6.3
Keywords: Cc:

Description

The idea is to implement the concept of a Filter, i.e., the possibility "filter" file content before writing them to image. This filtering process can consist of:

  • cut-off some parts of the file
  • Transform file contents: encoding, compression, encryption...

The idea is that a FilterStream or similar takes care of applying the Filter. The discussion is whether a filter implementation means creating its own Stream (i.e. GzStream, EncryptStream...), or we can just provide a generic FilterStream, that takes a reference to a IsoFilter interface, that is what each filter implements. In this second case, the idea is that the Filter can be shared among several nodes:

	IsoFilter *filter = iso_filter_gz_create(...);
	iso_file_add_filter(file1, filter);
	iso_file_add_filter(file2, filter);
	...

i.e., the filter is a place whether we can store configuration options for the Filter (encryption algorithm, key, ....). In this case, the FilterStream read function should be something like

    int filter_stream_read(Stream s, buffer,...)
    {
        FilterStreamData *data = s->data;
        Stream *source = data->source;
        Filter *f = data->filter;

        source->read(tmpbuffer)
	filter->filter(tempbuffer into buffer)
    }

However, it seems the filter->filter() function is not trivial to implement, as different filters may need different data chunks.

Another solution is to just implement each filter as an IsoStream implementation. In this case, it is each filter who implements its own stream->read() function. This needs, however, an ugly API, as the user needs to create each "FilterStream" implementation. And, at the end, we need the Filter idea (ie. a shared context) anyway.

i.e, we need something like:

	IsoFilter *filter1 = iso_filter_gz_create(...);
	iso_file_add_filter(file1, filter1);
	IsoFilter *filter2 = iso_filter_gz_create(...);
	iso_file_add_filter(file2, filter);

and this if we extend the IsoStream interface to a FilterStream whether we define the original_stream field. Otherwise we need something like:

	IsoStream *orig_stream = iso_file_get_stream(file1);
	IsoFilter *filter1 = iso_filter_gz_create(orig_stream, ...);
	iso_file_add_filter(file1, filter1);

or maybe directly, of course

	iso_file_add_gz_filter(file1, ....);

but we still have the problem of the impossibility to use the shared context.

A final alternative is to define a generic FilterContext:

    struct FilterContext {
        void *data; //filter specific shared data
        IsoStream (*get_filter)(IsoFile*);
    }

whether the get_filter is a factory method to create the concrete Filter implementation for each file. The API usage will be, then:

	FilterContext *filter = iso_filter_gz_create(...options...);
	//the filter->get_filter gets filled with a ptr to a filter-dependent function

	iso_file_add_filter(file, filter);
	// it calls the filter->get_filter() function to get the IsoStream that is filter dependent.
        // the user does not need to know the concrete IsoStream implementation for each filter.

Some considerations:

  • Given the filter can change file size, we would need to apply the filter twice: when image structures are computed, are when the file is actually written. With complex filters, this can be a problem. Thus, all filters must have a property "on_the_fly", that decides whether the filter is applying each time the file must be read, or whether it should be applied once and stored in a temporal folder. The user could decide whether to priorize temporal hard disk space or computation time based on that flag. It is legal to ignore that flag (for example, in the cut-off filter it make no sense to store a temporal file).
  • I wonder whether we shoudl provide some kind of plugin system with filters. Ideas?

Change History

  Changed 5 years ago by pygi

  • milestone set to libisofs-0.6.4

  Changed 5 years ago by scdbackup

Finally my cut-out-node (see my comments to tickets 121 and 122).

The questions here are hard to decide. I propose you implement the cut-out node and a gzip compression as convenience functions in the libisofs API without yet exposing the underlying general mechanism. Both should yield special node types which encapsulate the filter entrails.

While getting to run the two proposed special filters you can explore advantages and drawbacks of filter implementations. We can also invent new useful special nodes and finally expose an API to build custom IsoFilterNode? objects.

Too much preplanning now will only lead to delays in implementation and will nevertheless have to be partly revoked (with high probability). So better lets start to show some features and then abstrahate from them to a customizable re-implementation.

  Changed 5 years ago by scdbackup

All filter algorithms for now should yield the same number of bytes if applied to the same source object with the same parameters several times. Buffering data is a pain. We should leave it to the filter implementation to do it if its programmer has no better idea.

Changing source object content and resulting length changes is no new problem and already handled well by the according MISHAP events.

Especially it must be harmless to read the content data an arbitrary number of times.

Filters put in question my wish about *_lseek() with IsoStream?. I would possibly retract that random access wish in favor of a gzip filter. But i must be able to close and re-open the stream for multiple passes over the content.

  Changed 5 years ago by vreixo

  • status changed from new to assigned

The questions here are hard to decide. I propose you implement the cut-out node and a gzip compression as convenience functions in the libisofs API without yet exposing the underlying general mechanism.

I disagree. Filters are powerful. It would be great to expose them in an API, in such a way users can implement their own filters. There are lots of possible use cases for filters. We can't implement all of them. Let users do so. Of course, filter implemented inside libisofs will have a simple API.

custom IsoFilterNode?? objects.

What is this? A kind of IsoFileSource? A kind of IsoNode? A thing completelly different?

So better lets start to show some features and then abstrahate from them to a customizable re-implementation.

yes, why not? But the customizable re-implementation must be part of the 0.6.3 API, in my opinion.

All filter algorithms for now should yield the same number of bytes if applied to the same source object with the same parameters several times.

yes

Buffering data is a pain.

But useful. Suppose a complex processing, for example a filter that converts mp3 files to vorbis, or mpeg videos to xvid... an user may prefer to waste some MBs of temporal usage that appling the filter several times.

Implementation is not so hard. The first time a filter is applied, the contents are written to a temporal file. Following times, we read from that temporal files. Fex extra lines of code, actually.

For me the problem is time cost, and not changes in source content. That case can't be handled propertly, it is just a MISHAP.

I would possibly retract that random access wish in favor of a gzip filter. But i must be able to close and re-open the stream for multiple passes over the content.

See my comment to ticket #121. IsoStream have an is_repeatable() funtion. If it returns != 0, you are able to read content several times. If not, that is simply not possible (for example, when reading from pipes).

  Changed 5 years ago by scdbackup

I can only re-iterate my particular wish for reading data via a loaded IsoImage? and its IsoNodes?. IsoFilesystem? is not a solution for that because it only can deal with read-only filesystems (i.e. those which reside in a single blob of data which can be accessed by a single IsoFileSource?).

If this can be done with filters soon - fine. If not, then i would like to see a solution based on other means.

Similar: the cut-out nodes are needed to give xorriso full backup grade capabilities. It needs to be able to handle oversized files on-the-fly.

So the architectural beauty of any filter mechanism is of subordinate priority to me. I would prefer if it could stay encapsulated for a while.

  Changed 5 years ago by vreixo

I can only re-iterate my particular wish for reading data via a loaded IsoImage

Reading data is not the purpose of Filters. Filters modify data. You will need to use IsoStream to read data.

  Changed 5 years ago by vreixo

I've taken a look at the cut-out filter. It can be easily implemented as a Filter. But I think filters are not the better way to handle it. Reason: if we want to filter bytes from 2GB - 4 GB of a big file, we are forced to read (and discard) first 2GB. Bad option. A filter is like a pipe where input data gets modified and ouput. That's the reason they take a stream of bytes. It should work with any stream of bytes. However, the cut-out filter can take advantage of the random-read of regular files, and indeed are only useful in this case.

I think this should be either created as a CutOutStream?! I know filters are also IsoStreams?, but the idea of filter is also to be created once (parameter) and then applied to existing IsoFiles?. Cut-out "filter", however, fits better in the idea of Builder. It is the builder (responsible of creating the IsoFile from the IsoSourceFile?) who must take care of big files, and cut them propertly, using the CutOutStream? instead of the usual IsoFileSourceStream?.

Even better, we can modify Builder interface to be able to return several IsoNodes? from a single IsoFileSource? (it has some other use cases, for example when auto-unpacking tars...)

We could wait until builders get implemented. However, given that in any case files greater than 4 GB can't be written, I propose to implement them in current default Builder, and exposed as an option like "follow_symlinks":

iso_tree_set_split_file_files(int split, off_t max_size, off_t split_size)

the idea is that if split is 1, files greater than max_size are splitted in split_size chunks, i.e., the builder generates several IsoFiles? for it. Problem: automatic name generation. We can choose between a single name generation that just append a number at the end:

splited.file.1 splited.file.2 splited.file.3 ...

or let users specify it in a more complex way.

  Changed 5 years ago by scdbackup

I personally, in my role as developer of scdbackup, would need explicit splitting which creates one node out of a byte interval. scdbackup does not only split because of the 2GB limit but also because files do not fit on a single media.

The automat appears interesting for large media where all resulting nodes of a large file can be stored.

So the specialized CutOutStream? which relies on the inner capability to perform random access reading seems necessary. (From outside it appears as stream, but inside it needs lseek() or similar functionality and can only operate on input objects which allow random access reading.)

This raises as next the question what example to implement for filters. The xor "encryption" is not really presentable.

Implement some real encryption ? Or a gzip based compressor ?

  Changed 5 years ago by vreixo

I've created a new branch ( https://code.launchpad.net/~metalpain2002/libisofs/vreixo-filters) for filter implementations. Please take a look at it. I've added a GZip compression filter, you can test it with demo/isoz little program. It creates an image and gzip's regular files on root directory.

Note that this branch adds a dependency with zlib. I'm not sure whether adding new dependences is a good idea. What do you think? Given that filters are an obvious candidate for incrementing dependences, I wonder whether we should create a new library (libisofs-filters) where the filters were implemented. The other alternative, conditional compilation, would be a pain together with dynamic linking, given that some apps may need a library compiled with a given filter. I think libisofs-filters optional lib is the better solution for developers and packagers. Libisofs will define filter interface and functions to apply them. The several filter implementations will be part of libsisofs-filters.

I do not propose a new project, filters library would be a new folder inside libisofs project.

Comments? Ideas?

  Changed 5 years ago by scdbackup

Please no mandatory dependencies to other libraries. xorriso would inherit them.

Try to find an architecture where such additional dependencies and capabilities can be added at the installation site.

Currently scdbackup suffers from some mess-up about libreadline on some systems. I am not sure whether it is the admin or the system or libreadline, but it demonstrates that even such a simple dependency as libreadline has its pitfalls.

follow-up: ↓ 12   Changed 5 years ago by scdbackup

I am not sure whether this gesture from xorriso-standalone configure.ac would suffice

dnl Check whether there is readline-devel and readline-runtime.
dnl If not, erase this macro which would enable use of readline(),add_history()
READLINE_DEF="-DXorriso_with_readlinE"
dnl The empty yes case obviously causes -lreadline to be linked
AC_CHECK_HEADER(readline/readline.h, AC_CHECK_LIB(readline, readline, , READLINE_DEF= ), READLINE_DEF= )
dnl The X= in the yes case prevents that -lreadline gets linked twice
AC_CHECK_HEADER(readline/history.h, AC_CHECK_LIB(readline, add_history, X= , READLINE_DEF= ), READLINE_DEF= )
AC_SUBST(READLINE_DEF)

The variable READLINE_DEF is then used in Makefile.am

# No readline in the vanilla version because the necessary headers
# are in a separate readline-development package.
xorriso_xorriso_CFLAGS = -DXorriso_standalonE -DXorriso_with_maiN -DXorriso_with_regeX $(READLINE_DEF)

Inside xorriso.c there are #ifdef Xorriso_with_readlinE.

in reply to: ↑ 11   Changed 5 years ago by vreixo

Replying to scdbackup:

The variable READLINE_DEF is then used in Makefile.am ... Inside xorriso.c there are #ifdef Xorriso_with_readlinE.

This is exactly what I don't want. Conditional compilation on a library is, in my opinion, a bad idea. For xorriso is ok.

  Changed 5 years ago by scdbackup

Conditional compilation on a library is, in my opinion, a bad idea.

Then we need some other mechanism not to prevent to use of libisofs without a growing number of other libraries. The filter concept calls for such dependencies. We want them. But only if they can be fulfilled.

When we had a similar problem between libisofs and libburn, we invented libisoburn. But we can hardly introduce an extra library for any single filter.

  Changed 5 years ago by vreixo

I really think the better alternative is to provide a new library, libisofilters. This new library will hold all filters that will be implemented in the future, and of course libisofs does not depend on it.

Applications may decide whether to depend on it or not.

Note that if in a future we have hundreds of filters, it may be a good idea even to create several libraries.

  Changed 4 years ago by scdbackup

One year later (April 2009):

libisofs-0.6.18 introduced built-in filters for zisofs and gzip compression, and an API for running external filter processes.


About eventual implementation of more filters:

The interface of IsoStream is already public in libisofs.h. There are API functions meanwhile to inspect and and to remove filters.

What appears to be missing is a way how an application can install an own built-in filter on a IsoFile object. I think we should provide a base class for IsoStream.type == "user". It should offer a structured namespace wherein applications can choose their own filter class identifiers. We would implement an API function which retrieves that identifier from any stream of type "user".

  Changed 4 years ago by jing

if you prefer to drink your breakfast to reach  weight loss, fruit juice is a good choice.

  Changed 16 months ago by scdbackup

  • status changed from assigned to closed
  • resolution set to fixed

Various filters have been implemented with libisofs-0.6.18.

Note: See TracTickets for help on using tickets.