Max area under histogram
Posted on June 26, 2015Preface
While in university I came across this problem which had, in my opinion, a very elegant solution (even though algorithms definitely weren’t my favorite course). It was also one of the 2 instances I remember using property based testing felt really rewarding.
Introduction to the problem
A histogram is what most people call a “bar graph”. It is a graphic representation of a sequence of numbers. What we want to find is the rectangle with the biggest area we can fit in it.
More precisely we want a function that given a sequence of numbers returns a set of pairs where the elements of the pair are respectively the start and end index of the maximal rectangles (there can be more than one of the same area).
Example
Given the histogram we want to get back since the rectangle starting at position 4 and ending at 6 has 3 bars: giving it a width of 3 and a height of 5 that is an area of 15.
Formulated
We will work with the following definition of :
- is a histogram represented by a sequence of natural numbers.
- is a set of pairs of numbers where and are respectively the start and end index of a rectangle.
- is a function that given an element of returns the area of the rectangle it represents.
- is a set containing all the rectangles of the same area such that there is no bigger rectangle contained under the same histogram (i.e. our result).
Correct solution
The above model can be directly used to create a program. An implementation in haskell could look something like:
h = [3, 1, 1, 5, 7, 9]
n = length h
allRectangles = [ (x,y) | x <- [1..n], y <- [1..n], x < y ]
allAreas = [ getArea r | r <- allRectangles ]
maxArea = maximum allAreas
results = [ r | r <- allRectangles, getArea r == maxArea ]
getArea :: (Int, Int) -> Int
getArea (start, end) = let width = end - start + 1
subhistogram = (take width (drop (start - 1) h))
height = minimum subhistogram
in height * width
print results
This solution although clearly correct is very slow. The asymptotic complexity of this algorithm is since the first list comprehension generates a set of pairs of all the numbers smaller than and all the following operations are just folds over the resulting list which have the same complexity.
A solution of this type is still very valuable even if we don’t want to use it directly. A faster solution will likely result in code that is much harder to read and reason about. Having a slow-but-correct implemenation as the one above allows us to write faster implementations and compare their behaviour on a subset of all the possible inputs.
Faster implementations
There are a few ways we can speed the program up according to the internet.
Divide and conquer
Divide and conquer is an option which as usually results in an solution. This solution has the advantage that it can be parallelized. This can be handy if we get a really big histogram. We will assume the inputs are reasonably small and so we will focus on an implementation that, though constrained to be run sequentially, is faster and more memory efficient. The modern hardware can then be leveraged by solving multiple histograms in parallel.
The “Incomplete problems” algorithm
The idea is quite verbously described here. From the amount of text it might seem like a complicated algorithm but it’s an exceptionaly nice one actually. The following widget illustrates its progress.
The algorithm makes use of a stack and simply moves over the histogram from left to right. Every time we encounter a column higher than the one we have on top of the stack we push it to the stack since it is the potential start of a rectangle (1. new arrow). A column of the same size as the top of the stack just means that all the rectangles in the stack can be one bar wider (2. arrow crosses to next bar). A column smaller than the top signals the end of a rectangle (3. this is where the arrows disappear in the above widget).
Each iteration falls into one of these cases:
- the height is bigger than the one on top of the stack or the stack is empty
- the height is equal to the one on top of the stack
- the height is smaller than the one on top of the stack
Haskell implementation:
h = [3, 1, 1, 5, 7, 9]
n = length h
(maxArea', stack') = foldr stepFn (0, []) h
stepFn :: (Int, [(Int, Int)]) -> (Int, Int) -> (Int, [Int])
-- CASE 1
stepFn (curMax, stack@[]) (i, x) = (curMax, [(i, x)])
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x > topHeight = (curMax, (i, x):stack)
-- CASE 2
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x == topHeight = (curMax, stack)
-- CASE 3
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x < topHeight =
let biggerElements = takeWhile (\(_, y) -> y > x) stack
curMax' = foldr (\(curMax, (j, y)) -> max curMax ((i - j) * y)) curMax biggerElements
in (curMax', rest)
print results
Seeing our implementation it might be clear why it is called the “Incomplete problems” algorithm. In case 1 we just push open rectangles representing potential solutions on the stack which we expand in case 2 and close in case 3.
Testing it
module Main (main) where
import Test.Hspec (describe, it, hspec, shouldBe)
import Test.QuickCheck (Gen, sized, choose, forAll, (==>))
import Data.List
import Control.Monad
import Histogram (histogramCorrect, histogramFast)
-- we define a histogram generator that accepts the width of the histogram as the argument `n`
histograms :: Gen [Int]
histograms = sized $ \n -> mapM (const $ choose (0, 100)) [1..n] -- it generates a histogram of `n` columns of height ranging from 0 to 100
main :: IO ()
main = hspec $ do
describe "Histogram" $ do
it "Quickcheck comparison" $
forAll histograms $ \xs -> (not $ all (==0) xs) ==> -- pick random histograms but skip the ones of all zeroes
sort (histogramCorrect xs) == sort (histogramFast xs) -- the results of the two implementations should always match
Running the tests through cabal test --show-details=direct --test-option=--maximum-generated-tests=10000
gives us
Running 1 test suites... Test suite histogram-testsuite: RUNNING... Histogram Quickcheck comparison Finished in 13.6331 seconds 1 example, 0 failures Test suite histogram-testsuite: PASS Test suite logged to: dist/test/histogram-0.1.0.0-histogram-testsuite.log 1 of 1 test suites (1 of 1 test cases) passed.
This generated 10000 histograms of various sizes, ran both the algorithms on them and compared their outputs. The --show-details=direct
tells cabal to directly attach the testing framework to the terminal so you can actually see the count of how many instances we already tested live updated in the terminal (it also allows it to use terminal colors).
Performance
TODO
Code
Links to solutions
More on Quickcheck
- intro-to-quickcheck
- the original paper
- longer tutorial - shows how to run QuickCheck without Hspec
Comments