< Back home

Max area under histogram

Posted on June 26, 2015

Preface

While in university I came across this problem which had, in my opinion, a very elegant solution (even though algorithms definitely weren’t my favorite course). It was also one of the 2 instances I remember using property based testing felt really rewarding.

Introduction to the problem

A histogram is what most people call a “bar graph”. It is a graphic representation of a sequence of numbers. What we want to find is the rectangle with the biggest area we can fit in it.

More precisely we want a function that given a sequence of numbers returns a set of pairs where the elements of the pair are respectively the start and end index of the maximal rectangles (there can be more than one of the same area).

Example

Given the histogram h = \{ 3, 1, 1, 5, 7, 9 \} we want to get back \{ (4,6) \} since the rectangle starting at position 4 and ending at 6 has 3 bars: 5, 7, 9 giving it a width of 3 and a height of 5 that is an area of 15.

Formulated

We will work with the following definition of \mathcal{H}:


\begin{aligned}
    allRectangles = & \{ (x,y) : x \leq y \land x, y \in \{ 1 \dots n \} \} \\
    area(r) = & \alpha \cdot (y-x+1) : r = (x,y), \alpha = \min_{i \in x \dots y}(h_i) \\
    maxArea = & \max_{s \in allRectangles}{area(s)} \\
    \mathcal{H} = & \{ r : area(r) = maxArea, r \in allRectangles \}
\end{aligned}

Correct solution

The above model can be directly used to create a program. An implementation in haskell could look something like:

h = [3, 1, 1, 5, 7, 9]
n = length h

allRectangles = [ (x,y) | x <- [1..n], y <- [1..n], x < y ]
allAreas = [ getArea r | r <- allRectangles ]
maxArea = maximum allAreas
results = [ r | r <- allRectangles, getArea r == maxArea ]

getArea :: (Int, Int) -> Int
getArea (start, end) = let width = end - start + 1
                           subhistogram = (take width (drop (start - 1) h))
                           height = minimum subhistogram
                       in height * width
                       
print results

This solution although clearly correct is very slow. The asymptotic complexity of this algorithm is \mathcal{O}(n^2) since the first list comprehension generates a set of pairs of all the numbers smaller than n and all the following operations are just folds over the resulting list which have the same complexity.

A solution of this type is still very valuable even if we don’t want to use it directly. A faster solution will likely result in code that is much harder to read and reason about. Having a slow-but-correct implemenation as the one above allows us to write faster implementations and compare their behaviour on a subset of all the possible inputs.

Faster implementations

There are a few ways we can speed the program up according to the internet.

Divide and conquer

Divide and conquer is an option which as usually results in an \mathcal{O}(n \log(n)) solution. This solution has the advantage that it can be parallelized. This can be handy if we get a really big histogram. We will assume the inputs are reasonably small and so we will focus on an \mathcal{O}(n) implementation that, though constrained to be run sequentially, is faster and more memory efficient. The modern hardware can then be leveraged by solving multiple histograms in parallel.

The “Incomplete problems” algorithm

The idea is quite verbously described here. From the amount of text it might seem like a complicated algorithm but it’s an exceptionaly nice one actually. The following widget illustrates its progress.

The algorithm makes use of a stack and simply moves over the histogram from left to right. Every time we encounter a column higher than the one we have on top of the stack we push it to the stack since it is the potential start of a rectangle (1. new arrow). A column of the same size as the top of the stack just means that all the rectangles in the stack can be one bar wider (2. arrow crosses to next bar). A column smaller than the top signals the end of a rectangle (3. this is where the arrows disappear in the above widget).

Each iteration falls into one of these cases:

  1. the height is bigger than the one on top of the stack or the stack is empty
  2. the height is equal to the one on top of the stack
  3. the height is smaller than the one on top of the stack

Haskell implementation:

h = [3, 1, 1, 5, 7, 9]
n = length h

(maxArea', stack') = foldr stepFn (0, []) h

stepFn :: (Int, [(Int, Int)]) -> (Int, Int) -> (Int, [Int])
-- CASE 1
stepFn (curMax, stack@[]) (i, x) = (curMax, [(i, x)])
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x > topHeight = (curMax, (i, x):stack)
-- CASE 2
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x == topHeight = (curMax, stack)
-- CASE 3
stepFn (curMax, stack@((topIndex, topHeight):rest)) (i, x) | x < topHeight =
    let biggerElements = takeWhile (\(_, y) -> y > x) stack
        curMax' = foldr (\(curMax, (j, y)) -> max curMax ((i - j) * y)) curMax biggerElements
    in (curMax', rest)

print results

Seeing our implementation it might be clear why it is called the “Incomplete problems” algorithm. In case 1 we just push open rectangles representing potential solutions on the stack which we expand in case 2 and close in case 3.

Testing it
module Main (main) where

import Test.Hspec (describe, it, hspec, shouldBe)
import Test.QuickCheck (Gen, sized, choose, forAll, (==>))

import Data.List
import Control.Monad

import Histogram (histogramCorrect, histogramFast)

-- we define a histogram generator that accepts the width of the histogram as the argument `n`
histograms :: Gen [Int]
histograms = sized $ \n -> mapM (const $ choose (0, 100)) [1..n] -- it generates a histogram of `n` columns of height ranging from 0 to 100

main :: IO ()
main = hspec $ do

  describe "Histogram" $ do

    it "Quickcheck comparison" $
      forAll histograms $ \xs -> (not $ all (==0) xs) ==> -- pick random histograms but skip the ones of all zeroes
        sort (histogramCorrect xs) == sort (histogramFast xs) -- the results of the two implementations should always match

Running the tests through cabal test --show-details=direct --test-option=--maximum-generated-tests=10000 gives us

Running 1 test suites...
Test suite histogram-testsuite: RUNNING...

Histogram
Quickcheck comparison

Finished in 13.6331 seconds
1 example, 0 failures
Test suite histogram-testsuite: PASS
Test suite logged to: dist/test/histogram-0.1.0.0-histogram-testsuite.log
1 of 1 test suites (1 of 1 test cases) passed.

This generated 10000 histograms of various sizes, ran both the algorithms on them and compared their outputs. The --show-details=direct tells cabal to directly attach the testing framework to the terminal so you can actually see the count of how many instances we already tested live updated in the terminal (it also allows it to use terminal colors).

Performance

TODO

Code

More on Quickcheck

Comments