Welcome to JiKe DevOps Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
745 views
in Technique[技术] by (71.8m points)

performance - Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example.

The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like:

parse :: Lazy.ByteString -> [LogEntry]

Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as:

totalEntries = length
nrBots = sum . map fromEnum . map isBotEntry
averageTimeOfDay = histogram . map extractHour

All of these are of the form foldl' k z . map f.

The problem is that if I try to use them in the most natural way, like

main = do
    input <- Lazy.readFile "input.txt"
    let logEntries = parse input
        totalEntries' = totalEntries logEntries
        nrBots' = nrBots logEntries
        avgTOD = averageTimeOfDay logEntries
    print totalEntries'
    print nrBots'
    print avgTOD

This will allocate the whole list in memory, which is not what I want. I want the folds to be done synchronously, so that the cons cells can be garbage collected. If I compute only a single statistic, this is what happens.

I can write a single big function that does this, but it is non-composable code.

Alternatively, which is what I have been doing, I run each pass separately, but this reloads & uncompresses the file each time.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

Please log in or register to answer this question.

1 Answer

0 votes
by (71.8m points)

To process lazy data muiltiple times, in constant space, you can do three things:

  • re-build the lazy list from scratch n times
  • fuse n passes into a single sequential fold that does each step, in lock step.
  • use par to do n parallel traversals at the same time

Those are your options. The last one is the coolest :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to JiKe DevOps Community for programmer and developer-Open, Learning and Share
...