Friday, February 6, 2009

Counting lines in your files

Recently, a friend of mine wanted show everyone the size of various websites on servers he admins.


So he tried...

But the numbers started to come out a bit funny.

First 336M, then 123M.
That's size of PHP source code.

No way, the sites are too small for that, and... wait, 123M of PHP code? Show me such code tree :P
For example, 500 A4 pages full of text is under 2M (plaintext). So that's 336 * 250 = 84000 pages.
So, in this case, the 123M is highly doubtful.

That came from:
du -h --max-depth=1 --exclude=*\.svn

It turned out that it isn't really only a PHP code. There was also something else. And after all, it's better to include things we want to count, not exclude things we don't want (amd pray that exclude list is complete).

Then he got count of lines like 131407 and 316800 and asked if the following command line is ok:

export total=0; for c in `find . -name '*.php' | xargs wc -l | grep -Ev "total$" | awk '{ print $1 }'`; do total=$( echo "$total + $c" | bc); echo $total; done

Well, I thought that it might even work (like you might be able turn screw with pliers), but there are surely better ways.

So I went to FreeBSD shell (csh) and after testing a while (I'm bit rusty on that - it turned out that two or three years away from daily coding and fiddling with FreeBSD makes you forget a lot), I came out with that:

find . -name '*.php' | xargs wc -l

Or, if you have very large number of files and you need to execute wc more than one time, you need:

echo `find . -name '*.php' | xargs wc -l | grep 'total$' | awk '{print $1}'` | sed 's| |+|g' | bc

FreeBSD's xargs defaults to 5000 arguments per command. You can increase that, but eventually the argument list gets too long and you get error. So keep it reasonably small.

Friend was suprised that this code gives exactly same result as his one. You see, it IS possible to turn screws with pliers, but it's better to have a screwdriver. So you can work properly and don't ruin you job.

I'm pretty obsessive about source code quality and large command line (or shell script) is pretty much same thing. I recenty found a some article that's actually about Perl, but has one very good quote in it:

"some developers have adopted the style of writing their code as compact and "elegantly" as they can. The results can sometimes be programs that look more like dialup line noise than supportable code."

Well, it might be harder to write "dialup line noise" (@%/,$) in, for example, C, but it's surely possible to write unreadable code there too. And when code is unreadable, it can be insecure too. So shame on you, programmer, if you write code like that!

No really, is it so hard to add more whitespace? Put "{" and "}" on individual lines? Using meaningful function and variable names. Even file names... Add comments if/where you can.


Damn... This started as tip about shell and ended with coding styles.

Anyway, friend got finally some sane byte counts too (with wc -c): 4.2M and 11M.