1. Nestoria Summer Team Day

    Some photos from our Summer Team Day. A chance for us to step away from our terminals, and look at our challenges from a slightly different perspective.

    The morning of our team days is usually taken up with the “work part”, where our management team shares some of the ups and downs of the last few months. However we’ve started having much more regular meetings on the business performance this year, so the work half of the team day was free for some new exercises.

    We gathered at the Dial Arch pub in Woolwich and took part in two exercises: a Nestoria Metrics themed quiz, and a team brainstorming exercise.

    I’m quite proud of the quiz, because I made it. It was called “Flags ‘n’ Numbers” and involved matching different metrics to different countries that we operate in. It proved to be quite challenging, and hopefully opened some eyes to the vast array of metrics we have available through our internal dashboards.

    After those two exercises, and some very tasty pizza provided by the pub, we headed off to our first of two afternoon “fun part” activities: Laser Tag at Bunker 51!

    This was absolutely fantastic, and I highly recommend it to anybody too wussy to try paintball (myself included!) We played four different 15 minute games - teams, agents, one hit kill free for all, and regular free for all. My favourite was agents, where two players are being scored by how long they stay alive - but if you kill an agent you become one, so the teams are constantly changing every 10 seconds or so. We explored every corner of that bunker trying to hide and/or find those with red LEDs flashing on their chests.

    Our second afternoon activity was at Up at the O2, a company that literally allows you to climb up and over the O2 in Greenwich!

    The climb takes about an hour, but in reality it’s more like 15 minutes each side and 30 minutes at the top on the viewing platform. I would definitely recommend it to tourists only in London for a few days - much more hands on than the London Eye.

    To finish up the day we had drinks and dinner together at the Blueprint Cafe, which is above the Design Museum. The food was amazing… so amazing that I forgot to take any photos of it! We were too busy eating ham, risotto, lamb, salmon, brownies, crumble and drinking delicious wine and beer I’m afraid.

    You come away from a day like that with two things: a stronger camaraderie with your colleagues, and an attitude that you’re ready for anything! A great way to embark upon the remainder of 2014 if you ask me!

     
  2. Comments
  3. 7 strategies to quickly become productive in an unfamiliar codebase

    When starting a new job or on to a new project you will rarely be working on a completely greenfield codebase. Getting to grips with unfamiliar code is a difficult process and the amount of new information to take in can feel overwhelming. Coming to Nestoria from a Ruby background this was doubly so for me, I was not only learning a new codebase but also learning Perl at the same time. Here are seven strategies that I used to get productive as quickly as possible.

    Be humble

    Humility might not be the first thing you think of when it comes to programming. After all hubris is one of the Three Virtues of a Great Programmer. However when confronted with unfamiliar legacy code you are likely to get demoralised by how often you come across things you don’t understand and the number of mistakes you make. Humility is required to accept this and just get stuck in. Sometimes you will not understand a piece of code because it is hacky and badly written and sometimes you will not understand it because you are not familiar with the domain, or simply because the underlying algorithm is inherently complex. Mistaking the latter for the former can waste a lot of time and will probably annoy your coworkers as well! Have the humility to assume that the original developer knew what they were doing and wait until you really understand something before being too critical or making significant changes.

    Test first

    If your codebase is perfectly covered with unit and integration tests then you probably don’t need this guide. However most likely there are areas of the code that could do with better test coverage, or more reliable tests. By adding or improving tests you can improve the reliability of the code base without the risk of breaking production code. Writing or rewriting tests forces you to really understand code in a way that is not possible when just reading it. Whatever you will be working on next, first spend some time reading the tests and then adding new tests where you see gaps. It might be boring but it will help a lot when you start writing production code. A good resource on how to work with unfamiliar code in this way is Therapeutic Refactoring by Katrina Owen.

    Make something

    As soon as possible get started shipping code. There is nothing that will get you familiar with the code quicker than working on it. Perhaps start with a small project and don’t worry too much about how long it takes to complete, concentrating just as much on learning as on completing the task at hand.

    Ask questions

    You could struggle along trying to work everything out for yourself but you will get up to speed much quicker by simply asking questions. Try to do a little research first, but asking dumb questions is better than struggling unnecessarily. The rest of the team can help by being open to questions and responding promptly. Even if a coworker’s question seems trivial, or you are not sure of the answer, new team members are more likely to ask questions if you respond to them in an encouraging way. It might be frustrating to be interrupted but getting new team members up to speed quickly is beneficial for everyone in the long term. If you adopt a RTFM culture in your team, people are more likely to struggle in silence, when a simple question could have saved them hours of work.

    Pair with someone

    Even better than asking questions is to pair with someone who is already familiar with the code. You get instant feedback and will pick up the numerous conventions that can be hard to pickup just from reading code. This requires discipline to be effective and experienced team members have to make sure they don’t end up just taking over. This is particularly useful for areas that you are completely unfamiliar with. For example I didn’t know very much about Linux sysadmin, so when doing tasks that required Linux knowledge other team members would sit with me as I completed them. Code reviews can also provide useful feedback.

    Write the docs

    Once you have some experience with the code start writing documentation when you feel it is needed. At first do this just for yourself, perhaps in a personal wiki page or even just in a text file. Other forms of documentation like creating and answering StackOverflow questions and then linking back in your documentation can also be useful. Once you start to feel sure your documentation is correct and would be useful to others then start adding it to the code and/or official documentation.

    Zoom out

    Right at the beginning, it is beneficial to get a high level overview of the architecture of your system. Get someone to draw and explain a diagram of the architecture of the new system you will be working on and then try to map the different modules in the code to the diagram yourself.

    After a few months you will probably start to feel quite comfortable in your new codebase. You might only have touched a fraction of it, but you understand the conventions that the team uses and where all the most important pieces are. Now is the time to zoom out again and think about whether the architecture makes sense. How might you have done it differently and how can you use your previous experience to improve it? Is the architecture as explained to you in the beginning actually how the code works in your experience, how would you explain it differently?

    I’m sure there are many other tactics to help get to know a code base, but these are the ones that have helped me the most. Let us know about yours in the comments.

     
  4. Comments
  5. Nestoria Devs at YAPC

    YAPC::EU, the European edition of the Yet Another Perl Conference, was this past weekend. As mentioned in a previous post we sent along four of our developers: Alex, Sam, Ignacio and Tim. Here’s a brief (and photo filled) summary of our time in София България (that’s Sofia Bulgaria for those of you who don’t read Cyrillic.)

    I (Alex, Nestoria CTO) have been to a lot of YAPCs, but for my team mates Sam, Tim and Ignacio it was their first one. A great opportunity for them to dive deep into the Perl community and learn a huge amount in a short time.

    Thursday

    Unfortunately this year our flight was too late in the evening, and we ended up missing the traditional pre-conference drinks. I won’t be making this mistake again with any future YAPCs we attend - it sounded like we definitely missed out on a fun night.

    On the plus side our flight from London to Sofia was pleasantly uneventful, and our hotel - 10 minutes from the airport, 1 minute from the conference venue - was very nice. I think we all slept well and were ready for the conference to begin on Friday morning.

    Friday

    Our 1 minute walk from the hotel to the conference venue was nice - no danger of getting lost, just follow the nerdy T-Shirts. Unsurprisingly a lot of other Perl Mongers were staying at our hotel, and the hotel breakfasts got more social as the week went on.

    The venue was nice, especially the large room set aside for keynotes, lightning talks and the talks expected to be the most popular. Good chairs, good audio/visual equipment, and very helpful conference staff.

    A huge thank you and shout out to Marian Marinov and his team!

    And a smaller, but more personal, thank you to Marian for getting our banner printed in time for Friday despite me emailing him the PDF on Thursday morning :-) What do you think? I’m quite proud of it.

    Speaking of things I’m proud of, on Friday I spoke about the Nestoria Geocoder and the new OpenCage Data API that allows people outside Nestoria to take advantage of it. I think the talk was quite well received, although everybody’s geocoding challenges are a bit different so some audience members who wanted exact house-number addressing were disappointed.

    The scheduling committee had done a nice job this year of grouping together similar talks, which meant that my talk kicked off an afternoon of Geo-related presentations. I particularly enjoyed Hakim Cassimally’s talk on Civic Hacking. I hadn’t realised that MySociety’s projects were being used in Africa and Asia as well as within the UK - very cool!

    As usual after the main tracks ended we had the lightning talks. I spoke again - this time about Test Kit 2.0, a slightly shorter version of a talk I gave at a recent London.pm Technical Meeting. Hopefully I convinced a few other developers to delete all the boilerplate from their .t files.

    After the lightning talks, Curtis “Ovid” Poe gave a fantastic key note about managerless companies. He started out comparing the extremely hierarchical companies of the 90s and 00s with feudal society centuries ago in Britain, and then went on to give some great real world examples of companies being run differently and how they are succeeding. As well as the usual tech examples of Valve Software and Github he mentioned some non-tech companies, such as Semco in Brazil, which was certainly eye-opening for me. At Nestoria we are pretty good at hiring smart people and giving them the freedom to solve problems in whatever way they see fit; but going truly managerless is a big step up from that, and lead to some great discussions between me and my devs.

    Friday ended with the traditional conference dinner, with the traditional challenges of getting a few hundred developers onto a few coaches and to a very very large restaurant. The food was very tasty, and very plentiful; we had fresh bread rolls, two starters, then some Bulgarian folk dance as entertainment, followed by a large main and a very tasty dessert. But the food was definitely topped by the view: the restaurant was on a lake in the Bulgarian countryside, and the sight was stunning.

    Saturday

    Saturday morning started out with a small Dev Ops track for me, while Sam and Ignacio went to some Web talks, and Tim saw some presentations about search and data.

    For my part I really enjoyed Marian’s talk about creating Linux containers with Perl, and look forward to his libraries being finished and up on CPAN.

    After lunch was pretty much an MST-fest, as Matt S Trout gave a 50 minute talk on Devops Logique and a 50 minute keynote on The State of the Velociraptor. Both were very interesting, and I had to smile when the topic of Prolog came up - back in university we studied Prolog and Haskell in our first year, quite an unusual introduction to programming I think.

    Before the keynote came the second day of lightning talks, and the second day where I gave a talk. This time around I talked about this very blog - and announced live that this month’s Module of the Month winner was Tim Bunce for Devel::NYTProf. Unsurprisingly Tim got a thunderous round of applause, despite not being there this year.

    Dan Muey of cPanel gave a great talk about Unicode and Perl which definitely resonated with me; by which I mean it exactly matched our unicode style guide :-)

    Sunday

    Sam and Ignacio were very excited for Sunday, as that seemed to be where all the web related talks went. Sawyer X gave a particularly good introduction to Plack and PSGI, and then went on to share how Booking.com has managed to gradually shift over to PSGI running on uWSGI. I learned a huge amount, and I hope we can make a similar shift at Nestoria sometime soon.

    Susanne Schmidt (Su-Shee) also gave a wonderful introduction to the wide and not-so-varied world of web frameworks. In preparation she had built the same application - a cat GIF browser, naturally - in about 10-20 different frameworks across 5-10 different languages! Unsurprisingly a lot of them are almost identical - Dancer, Sinatra, Django, Rails, Mojolicious all seem to have borrowed ideas from one another over the years. I had no idea though that R (yes, the statistics language) has a web framework! And it’s pretty nice too, you can produce some really great graphs and charts with it with very little code.

    I’d also like to shout out Tatsuro Hisamori (aka まいんだー) for coming over from Japan and to tell us about how he sped up his test suite from 40 minutes to 3 minutes. We actually have a pretty similar set up here at Nestoria - spreading different groups of tests over different VMs, with lots of parallelisation and a a home grown web interface to the results. Their project Ukigumo’s web interface looks scarily similar to ours.

    To round out the day Sawyer reprised his The Joy In What We Do keynote from YAPC::NA. It’s a touching tale of how he learned programming, and Perl, and how we should all take time to reflect how fun programming can be. The talk ends with some of the Perl language features and CPAN libraries we should be proud of, and we should be talking about in the wider programming community. All in all I left feeling pretty happy to be a Perl dev.


    So that was YAPC::EU 2014! It was an absolute blast, and we can’t wait to sponsor and send some devs over to Granada for YAPC::EU 2015 next September.

    Of course we don’t have to wait that long for the next Perl event. We’re sponsoring and attending The London Perl Workshop 2014 in November - hope to see you there!

     
  6. Comments
  7. Open Geo interview series

    A quick post to let you know that your sister brand, OpenCage Data, has launched an interview series with thought leaders from the Open Geo world over on their blog. It’s a fast moving space and they’re hoping to provide a forum to feature some of the various innovations. Give it a read (and check out OpenCage in general), and let them know who else should be interviewed.

     
  8. Comments
  9. Module of the month August 2014: Devel::NYTProf

    Welcome to another Module of the Month blog post, a recurring post in which we highlight particular modules, project or tools that we use here at Nestoria.

    This month’s award goes to the amazing Devel::NYTProf, simply the best Perl code profiler there is and one of the most powerful tools you can reach for when working on a large and complex code base.

    Let’s start out with some quotes from some rather intelligent blokes:

    Prove where the bottleneck is

    "Bottlenecks occur in surprising places, so don’t try to second guess and put in a speed hack until you have proven that’s where the bottleneck is." — Rob Pike

    Don’t do it yet

    "The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization (for experts only!): Don’t do it yet." — Michael A. Jackson

    … but only after the code has been identified

    "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified" — Donald Knuth

    All three of these quotes point towards a single truth: when your code is slow, and you suspect it could be faster, reach for your profiler!

    And of course, the more powerful and feature-rich your profiler is, and the more code you can easily point it at, the more bottlenecks and potential optimization sites you will find.


    At Nestoria we have wielded the great Devel::NYTProf against most areas of our code. It’s helped us get our internal geocoding down to 100ms per listing, which means we can re-geocode an entire country of listings in less than 24 hours. It’s helped us respond to over 90% website requests in less than 200ms. And it’s helped us process all of our metrics logs before 8am every day, so that our commercial team can quickly act on those numbers and do their jobs well.

    Most recently we have been using Devel::NYTProf::Apache (also from TIMB!) to profile our website in production. By using the addpid option we have each Apache child process write its own nytprof.out file, which we can then merge together the files with nytprofmerge. We have around 30 Apache children at any given time so we end up with 5 hours of information from only 10 minutes of real time where our site is slower for our users.

    (Note: we do turn off statement level profiling with stmts=0 and make sure to write the nytprof.out files to a ramdisk. Without those two hacks the site falls over.)


    So thank you Tim, for Devel::NYTProf and for everything else you’ve done, and for being one of the nicest people in the Perl community :-)

    Enjoy your $1 per week Gittip donation from us!

     
  10. Comments
  11. Happy CPAN Day!

    Saturday August 16th 2014 is CPAN Day, 19 years since Andreas König uploaded Symdump 1.20 to our favourite comprehensive archive network.

    Neil Bowers has been writing all about it on blogs.perl.org over the last few weeks. His posts have been about improving your CPAN distributions with better documentation, test coverage, and community involvement with his “thank a CPAN author” suggestion.

    I thought we could join in by giving a quick run down of some of the distributions we’ve released to CPAN over the years. These are often released as different CPAN authors, but they can be found in one place on our company github page: https://github.com/lokku.

    Geo::What3Words

    As we wrote about just a couple of days ago, we wrote the Perl library for interfacing with the What3Words API.

    Geo::Coder::OpenCage

    The Perl interface to our sister company OpenCage Data's geocoder API.

    I will be talking about this in detail at YAPC::EU in Sofia next Friday :-)

    Geo::Coder::Many

    Can you tell we’re big geo nerds?

    Geo::Coder::Many is a way to multiplex requests between multiple remote geocoding APIs, such as Yahoo!’s PlaceFinder, Google’s Geocoder v3, and of course our own OpenCage geocoder.

    It can handle caching, and adjust the likelihood of hitting a particular API based on that API’s daily limits.

    File::CleanupTask

    This is a very powerful configuration-based tool for handling “cleaning up” (moving, archiving, or just deleting) of old unwanted files. It’s great for keeping all your logs around for as long as you need them, but automating the job of tidying them up when they are no longer useful.

    We like to use symbolic links a lot to handle atomic changes of code, configuration, and data and so we built into File::CleanupTask the ability to avoid deleting a file if it is symlinked from another directory, no matter how old it is. That way you get the automated cleanup without the danger of deleting something which is still in use.

    CSS::SpriteMaker

    Create sprite images and their associated CSS files to speed up your website and make your users happier. This is a great technique which can shave a huge amount of per-image overhead off your file sizes and your request times.

    Savio Dimatteo, who has sadly since left Nestoria and London (we miss you Savio!), spoke about this at YAPC::EU 2013 in Kiev. Here are some slides and a video.

    Algorithm::DependencySolver

    This is a very abstract and algorithmic module which takes operations with dependencies, things which will be affected by the operation, and prerequisites for the operation, and then attempts to automatically derive an ordering for those operations to be run in.

    It can be very useful when dealing with long complicated pipelines which manipulate objects. We used to use it in our Geobuild to help determine that the URL Creator needed to be run before the URL Deduplicator which needed to be run before the Place Deduplicator.

    It greatly aids debugging by outputting ASCII or PNG graphs of the operations.

    Number::Format::SouthAsian

    Did you know that in South Asia the number ten million is called “one crore” and is written “1,00,00,000”?

    We didn’t before we launched Nestoria India, but when we found out it immediately seemed like something that belongs on the CPAN.

    Big thanks to Wikipedia for basically writing my tests for me, to CPAN Testers for catching 32 bit bugs in earlier versions, and to Larry for making sure numbers like 1000000000000000000000000000000000000000 just work in my favourite language ;-)

    WebService::Nestoria::Search

    Last but not least, we of course have our module for interfacing with the Nestoria Search API.

    This gives you access to our property listings, of course, but also has an interface to our average house price data. Do with it what you like :-) More details at http://www.nestoria.co.uk/help/api-tech.


    So my suggestion, adding to Neil’s long list, is: open up some source code and release a CPAN distribution based from your work’s code base this CPAN Day!

     
  12. Comments
  13. Supporting searching using What3Words

    This week we’ve rolled out a new feature on Nestoria - users can now search for a precise location using What3Words codes. All the details from the user perspective are over on the main Nestoria blog, this post is a bit more about the how.

    First up, what is What3Words? Basically w3w is a much more human friendly longitude and latitude. The team at What3Words has chopped up the globe into 3m by 3m squares and assigned each square a unique sequence of three words. By using their app or website, people can figure out the exact sequence for a specific location and then, rather than sharing an address share that three word sequence. Using words has a few advantages: humans are much better at remembering words than long strings of numbers, and voice recognition systems are better at correctly identifying unique combinations of three words than lengthy and sometimes ambiguous addresses. A three word code is particularly useful if you’re trying to refer to a location that has no good address, for example inviting your friends to a picnic in the middle of a large park, or if you’re in a country without good addresses, which is the case in some of the countries like India and Brazil where we operate Nestoria. Finally it’s very easy to share via a short link. The w3w team of course does a good job explaining the whole thing over on their site, which I won’t rehash here.

    What3Words has a well documented API, but when we started the project there was no Perl library for interacting with their API. I’m pleased to say, there now is. Meet Geo::What3Words, written by our very own Marc Tobias, who these days spends most of his time working on all things geo over at our sister project OpenCage. BTW - OpenCage has its own blog which you should also be reading as we’re doing cool stuff over there.

    So now it was just a matter of extending our query analysis code to identify w3w queries. Happily this is pretty easy as there are just two formats - a query can be three words joined by a full stop (ie of the format word1.word2.word3) or it can be what they call a OneWord which start with an asterisk. In either case, we identify the query as being a w3w query, pass the request on to the w3w API via Geo::What3Words, get back the coordinates so that we can then query our database of properties, and finally display the result to the user. In practice there are a few other considerations and edge cases, for example w3w’s can be in various languages, which is handy for us as we’re now operating in six languages, but on the whole the process was straight forward. It’s hard to predict if using w3ws to refer to locations will take off, but it was a fairly straight-forward implementation, and we’re proud to be leading the way on anything that makes searching for property easier.

    For a final bit of fun w3w our very own Ignacio also whipped up a node wrapper to the API. Enjoy (and by enjoy I of course mean we look forward to your pull requests).

    Many thanks to all the team members who worked on this project, and happy searching,

    Ed (@freyfogle), written from surround.bonkers.inside.

     
  14. Comments
  15. The Unicode section of our style guide

    Most software companies will have a list of code conventions. Our one is based on Perl Best Practices by Damien Conway, with some modifications.

    Unicode is one area that Conway’s book doesn’t delve in to much. Unicode in Perl can be quite a jumble of knots to the uninitiated, but a few well-chosen conventions can solve the most common problems. With them in place, it’s easier to feel confident about implementing internationalisation correctly, rather than shying away from even attempting it.

    Our Unicode style guide is deliberately much shorter than Tom Christiansen’s infamous StackOverflow post on Perl and Unicode. At Nestoria, we only have to deal with Latin-like languages (so far), and for the most part, we just need to avoid mojibake. The fewer rules there are, the less intimidating, and the more likely they’ll get understood and followed.

    So without much further ado, here it is!

    Perl Unicode style guide, by Lokku:

    The Unicode style guide consists of two parts:

    1. Definitions
    2. Rules and conventions

    Definitions:

    It’s important that we use terms like “character” and “string” consistently throughout our codebase to avoid confusion. In the wild, programmers can use quite different meanings for these terms, so we need to define them here.

    Strings:

    A string in Perl is an ordered sequence of string elements. Each element has an ordinal value, which is always a positive integer (possibly zero).

    To split a string into an array of its elements, call the built-in subroutine split, passing it an empty regex as its first parameter.

    my $string = "\x{00}\x{01}\x{10}\x{266A}";
    
    my @elements = split //, $string;
    
    say scalar(@elements);
    # prints 4
    
    say join(", ", map { ord($_) } @elements);
    # prints 0, 1, 16, 9834
    
    say join(", ", map { sprintf("%02x", ord($_)) } @elements);
    # prints 00, 01, 10, 266a
    

    An octet string is a sequence of string elements, where each element represents an octet (an 8-bit byte). As a consequence of this, the ordinal value of each string element is between 0 and 255 inclusive.

    Here’s an example of obtaining an octet string:

    open my $fh, "<:raw", $image_filename or die $!;
    my $octet_line = readline($fh);
    close $fh or die $!;
    

    A character string is a sequence of string elements, where each element represents a Unicode code point. As a result of this, the ordinal value of each string element can range from 0 to in the millions.

    Here’s an example of a character string:

    my $character_string = "I\x{2661}U"; # I♡U
    
    my @elements = split //, $character_string;
    say scalar(@elements);
    # prints 3
    
    say join(", ", map { uc(sprintf("U+%04x", ord($_))) } @elements);
    # prints U+0049, U+2661, U+0075
    

    To differentiate between octet strings and character strings, name octet string variables with the suffix or prefix octet. For character strings, do not add a suffix or a prefix.

    my $line_octet; # the name indicates that this is an octet string
    
    my $line; # the name indicates that this variable is a character string,
              # but check to see if the style guide was followed.
    

    Character:

    The Lokku convention is to define the term character so that it corresponds exactly to the term Unicode code point.

    For example, this string:

    I♡U
    

    contains three characters or code points: U+0049, U+2661 and U+0075.

    Unicode has the concept of combining characters, which when combined with other normal characters, form a single glyph:

    my $string = "\x{61}\x{0300}\x{0320}";
    
    # This is three Unicode characters, but it is displayed as single glyph:
    # à̠
    

    So it’s important to understand the difference between octets, characters, and glyphs.

    Other codebases may use differing definitions for the word character. It is common for C programs to refer to octets as characters for example, or for graphic designers to refer to glyphs as characters.

    Wide character:

    A wide character is a character whose code point is greater than U+00FF.

    Character repertoire:

    A character repertoire is a defined set of characters. For example, you could define a character repertoire “X1” as being the Unicode characters U+0030 to U+0039 (which are the digits 0 to 9).

    Other people also use the term “character set” to describe this. (Sadly, some people (and specs) mistakenly use the term “character set” to mean “character encoding”, so repertoire is the unambiguous term to be used.)

    Character encoding:

    A character encoding is an algorithm that translates characters into octets, and vice-versa. Some encodings support all the characters defined by the Unicode standard, some only support a specific character repertoire.

    An example of a well-known character encoding is ASCII. It only supports the Unicode characters U+0000 to U+007F, and each character is encoded to exactly one corresponding octet.

    Another example of a well-known character encoding is UTF-8. It supports all the Unicode characters, and each character is encoded using one, two or more octets.

    When you are in the position to choose which character encoding to use, choose UTF-8.

    The internal UTF-8 flag on a Perl string:

    In Perl’s C implementation behind the scenes, Perl may store each string in memory using one of two encodings: iso-8859-1, or UTF-8.

    The change is only visible to a Perl program by accessing a string’s UTF-8 flag, or by using special hackery like Devel::Peek.

    It’s important to realise that a string’s internal UTF-8 flag does not necessarily imply whether that string is an octet string or a character string. Only the way a string is used can determine whether it is an octet string or a character string.

    However, many libraries do incorrectly use the internal UTF-8 flag to indicate whether the string is meant to be a character string or an octet string. Because of this, Perl programmers need to know what the UTF-8 flag is and how to handle it.

    Because of the limitations of iso-8859-1, it’s simply impossible to store an element whose ordinal value is greater than 255 in a Perl string while keeping its flag off. Perl strings with the UTF-8 flag on can contain elements of any ordinal value.

    # This returns whether the $string has the UTF-8 flag turned on:
    # (Note that "use utf8" is not required to run "utf8::is_utf8")
    utf8::is_utf8($string);
    
    $string = "caf\x{e9}";
    
    # The result of the following operation is that $string's UTF-8 flag is on,
    # and the string elements remain the same. If the flag was on already, this
    # line is a no-op. If it was off, then Perl internally converts the bytes
    # stored in memory, and sets the flag to true.
    utf8::upgrade $string;
    
    say join(", ", map { ord($_) } split(//, $string));
    # prints 99, 97, 102, 233
    
    say length($string);
    # prints 4
    
    
    # The result of the following operation is that $string's UTF-8 flag is
    # off, and the elements remain the same (or it throws an exception). If the
    # flag was already off, this operation would be a no-op. Since it was on,
    # Perl internally converts the bytes stored in memory, and it sets the flag
    # to true. Had the string contained an element whose ordinal value was
    # greater than 255, this operation would have died:
    utf8::downgrade $string; 
    
    # As you can see, the string's internal UTF-8 flag does not affect the
    # sequence of elements visible to the Perl script:
    say join(", ", map { ord($_) } split //, $string)
    # prints 99, 97, 102, 233
    
    say length($string);
    # prints 4
    

    To check whether a string scalar has the internal UTF-8 flag set, call utf8::is_utf8. You do not need to include use utf8; for this to work, which is important because including use utf8; does have a side-effect (it declares that the source code is encoded in UTF-8).

    Rules and conventions:

    • When naming octet string variables, use the prefix or the suffix octet.

      open my $fh, "<:raw", $filename or die $!;
      my $octet_line = readline($fh);
      
    • When encoding a character string into an octet string in UTF-8, use Encode::encode with the FB_CROAK and the LEAVE_SRC flags set. The first argument should be "UTF-8", not "utf8", because the latter’s behaviour is problematic.

      my $octet_string = Encode::encode(
          "UTF-8", $character_string, Encode::FB_CROAK | Encode::LEAVE_SRC
      );
      
    • When decoding an octet string encoded in UTF-8 to a character string, use Encode::decode with the FB_CROAK and the LEAVE_SRC flags set. The first argument should be "UTF-8", not "utf8", because the latter’s behaviour is problematic.

      my $character_string = Encode::decode(
          "UTF-8", $octet_string, Encode::FB_CROAK | Encode::LEAVE_SRC
      );
      
    • For textual data, use character strings, not octet strings.

    • When reading and writing octets to a file, use the :raw I/O layer, which is preferable to the default layer, as it doesn’t treat newlines specially.

      use autodie;
      
      open my $fh_write, ">:raw", $filename;
      print $fh_write $octet_string;
      close $fh_write;
      
    • When reading and writing text to a UTF-8 encoded file, take advantage of the :encoding(UTF-8) I/O layer, which is faster than calling Encode's subroutines manually.

      use autodie;
      
      open my $fh_read, "<:encoding(UTF-8)", $filename;
      while my ($line = readline($fh_read) { ... }
      close $fh_read;
      
      open my $fh_write, ">:encoding(UTF-8)", $filename;
      print $fh_write $character_string;
      close $fh_write;
      
    • Do not use the default I/O layer, but always specify one (such as :raw or :encoding(UTF-8) or even :encoding(ASCII):crlf).

    • Never use the :utf8 I/O layer, but always use the :encoding(UTF-8) I/O layer instead. This Perl Critic policy enforces this rule, and its POD contains a good explanation for the reasons why.

      # Good:
      open my $fh, "<:encoding(UTF-8)", $filename or die $!;
      
      # Bad:
      open my $fh, "<:utf8", $filename or die $!;
      
    • When uppercasing, lowercasing, or performing a regex on a character string, make sure use feature "unicode_strings" pragma is enabled.

      use feature "unicode_strings";
      
      $character_string = lc($character_string);
      $character_string = uc($character_string);
      $character_string =~ s/\w//;
      

      Note: In older versions of Perl (prior to v5.16), this pragma was only partially effective, and you would have to use a work-around. This is known as the “Unicode bug” in Perl’s documentation.

    • When dumping a data structure to an evalable string, always turn on $Data::Dumper::Useqq.

      my $string = "caf\x{e9}"; # café
      
      # Bad:
      print Dumper($string);
      # prints $VAR1 = "café", which could cause bugs when being evaled.
      
      # Good:
      local $Data::Dumper::Useqq = 1;
      print Dumper($string);
      # prints $VAR1 = "caf\351", which is always evaled correctly.
      
    • When evaluating a character string, always enable use feature "unicode_eval" (available from Perl 5.16). If you don’t, you could cause Perl to crash in the right circumstances.

      use feature "unicode_eval";
      eval $character_string;
      
    • When evaluating an octet string, always use the evalbytes function (available from Perl 5.16), not eval.

      use feature "evalbytes";
      evalbytes $octet_string;
      
    • Never use Encode::encode_utf8 and Encode::decode_utf8. Use, Encode::encode("UTF-8",...) and Encode::decode("UTF-8",...) instead.

    • Never use Encode::_utf8_on or Encode::_utf8_off. You might be looking for utf8::upgrade or utf8::downgrade instead.

    • When creating new text files, make sure they are encoded in UTF-8, unless you have a good reason not to.

    • When creating a new Perl file, always add use utf8; to the top of the file. This indicates that the source code should be interpreted as UTF-8, not Latin-1, which is good since it’s most text editors default encoding.

      use utf8;
      
    • When creating a new Perl script, add the following at the top. This will encode strings automatically using UTF-8 before printing to STDOUT and STDERR, which is what most terminals expect nowadays. (Make sure you’re printing character strings, not octet strings, though.)

      use utf8; # see previous rule
      
      binmode STDOUT, ":encoding(UTF-8)";
      binmode STDERR, ":encoding(UTF-8)";
      
    • When creating a new test, add the following at the top. This will encode test messages in UTF-8 automatically. You can get Test::Kit to do this automatically.

      use utf8; # see previous rule
      
      binmode Test::More->builder->output, ":encoding(UTF-8)";
      binmode Test::More->builder->failure_output, ":encoding(UTF-8)"; 
      binmode Test::More->builder->todo_output, ":encoding(UTF-8)";
      
    • Always pass UTF-8 to the Encode module, not utf8. As a rule of thumb, always prefer capitalising the term UTF-8, and including the dash. It’s the proper name for the encoding, and every time the difference matters, experience has shown that this spelling is best.

    • When including special characters in the source code of a file, always escape them using the \x{..} notation. Control characters and U+0080 and above are all considered special characters. Using the \x{..} notation ensures the same behaviour regardless of whether use utf8 was specified. If this rule is followed, then the use utf8 rule is just a safeguard.

      # Good:
      my $message = "I\x{2661}U"; # I♡U
      
      # Bad:
      my $message = "I♡U";
      
    • When creating an HTML file, make sure that the encoding of the HTML file is specified.

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      
    • When serving an HTTP request, make sure that the charset specified in the Content-Type header specifies the correct encoding, hopefully UTF-8. The default charset is not UTF-8, so do specify it for both HTML and JSON responses.

    • Make sure your terminal is configured to use UTF-8.

      $ locale
      LANG=en_GB.utf8
      
    • Include special characters in the data for your tests.

      Here are three good test strings:

      $test_string1 = "D\x{c9}COR NA\x{cf}VE"; # DÉCOR NAÏVE
      utf8::upgrade $test_string1;
      
      $test_string2 = $test_string1;
      utf8::downgrade $test_string2;
      
      # The two previous strings don't contain wide characters, and are good
      # for exposing weird behaviour, including what MySQL means by latin1.
      # It's important to check that the code's behaviour does not change
      # depending on whether the internal UTF-8 flag is set.
      
      $test_string3 = "I\x{2661}U"; # I♡U
      
      # $test_string3 does contain a wide character. (Consequently, its
      # internal UTF-8 flag is set.)
      
    • Always use ASCII for identifiers, variable names and URLs under your control.

    • Don’t rely on the documentation of CPAN’s modules when it comes to Unicode. The documentation is often wrong, or uses different definitions of words that what you might expect. Create a test for the module to find out exactly what happens.

    David Lowe

     
  16. Comments
  17. Geomob tonight!

    This evening we’ll once again be heading off to #geomob, the regular geo innovation event Lokku (Nestoria’s parent company) organises.

    What is #geomob?

    The evening features 5-6 speakers. Each speaker has 10-15 minutes to convey his or her story and then a few minutes for questions from the audience. Anyone with an interest in anything geo is welcome to attend. The event is free and there is no need to register. Our target audience of speakers and attendees is anyone doing interesting things in the geo / location based services space - with a strong emphasis on the “doing”. Geomob is not an event for PR or marketing people, it’s much more aimed at those actually getting their hands dirty.

    The full line-up for tonight is detailed over on the #geomob blog. As usual the speakers are a good mix of big companies (this time the biggest name on the roster is the Ordinance Survey), academics, hobbyists, start-ups. The event format has been very consistent the last few years. Each speaker has 10-15 minutes to convey his or her story and then a few minutes for questions from the audience. I’m proud to say that the list of past geomob speakers is now not only lengthy, but also impressive, it has become a who’s who of the geo world in London.

    As important as the talks and questions is the discussion in the pub following the end of the talks. I strongly encourage all attendees to stay for that portion of the evening. It’s where the real interaction happens, and to encourage a free flow of ideas the first few rounds of drinks are free.

    So why do we do this?

    Geomob was started by Chris Osborne (one time Lokku intern!) in 2008. We took over the operation in 2010 when Chris got too busy, and I’m proud to report that it has grown and grown. We’re currently at six events a year each with five or six speakers, which says a lot about the sheer quantity of people doing interesting things in geo in and around London.

    The whole thing is only made possible by the sponsors. We’re very fortunate to have found great venue partners, we rotate between UCL in Bloomsbury, Campus in Shoreditch, and the BCS in Covent Garden. The drinks are paid for by Lokku (via our OpenCage brand), knowwhere consulting, and SplashMaps, many thanks to them.

    What does the future hold for #geomob?

    One recent addition to the format has been the introduction of the SplashMaps best speaker award, voted on by the crowd via show of hands at the end of the evening. Interestingly it’s not always the speaker with the most innovative topic - I’m often impressed how someone can make seemingly unimpressive material into a riveting talk (on the flipside I also concede that I’m occasionally disappointed to see a speaker turn great material into a boring session). In terms of focus #geomob will very much be more of the same; anyone doing something interesting in geo is welcome to speak, and I see no reason to fiddle with the format. 10-15 minutes per speaker has proven to be just enough to cover a topic and is short enough that it leaves the listener wanting more. It allows us to feature 5 or 6 speakers in an evening which leads to a good mix of topics. London continues to produce a never ending stream of interesting location based topics, though we would also very much welcome any speakers from further afield who happen to be in town. One weakness of the geo community is that it has not traditionally been a very diverse one, particularly with regards to gender diversity. The past #geomob speaker list reflects this unfortunate reality, but I can say we are making a strong effort to address it, and would particularly welcome, and will seek out, speaker applications from outside the traditional “neogeo” demographic. It is not a problem we will solve in one day, but we are working on it. All help or contributions are welcome. One thing that will stay the same is the focus on doers, and we will do our best to void blatant marketing or PR pitches.

    Hopefully this post has got you interested. The best way to get a sense of it all is of course to simply come along tonight or to future geomobs (the next one after this evening is 16th of September).

    The best (and indeed currently only) way to stay up to date is to follow the geomob twitter account or read the geomob blog.

    We look forward to seeing you at future #geomobs.

    Ed (@freyfogle)

     
  18. Comments
  19. Module of the month July 2014: Sort::Key

    Welcome to another Module of the Month blog post, a recurring post in which we highlight particular modules, project or tools that we use here at Nestoria.

    This month’s module is Sort::Key which is a fantastic tool to add to your toolbox. It follows the UNIX philosophy of doing one thing and doing it very very well.

    The tag-line is “the fastest way to sort anything in Perl” and that pretty much sums it up. This is Schwarzian Transform on steroids ;-)

    The distribution also ships with Sort::Key::Multi, Sort::Key::Maker, and Sort::Key::Natural which are very useful for creating more complicated sorting functions.

    For example for dealing with our average house price data, which has keys like 2014_m2 we can use:

    use Sort::Key::Maker i_i_keysort => qw(integer integer);
    
    my @sorted = i_i_keysort { m/^(\d\d\d\d)_\w(\d+)$/ } @keys;
    

    This code is not only faster than any other method, it’s also very readable. I love it!

    To be honest though, a big part of why we chose it to be this month’s winner was because we wanted some way to give back to Salvador Fandiño. SALVA recently released his 100th CPAN distribution!

    Salvador’s recent post on blogs.perl.org called On Giving Back is a great read. He spells out exactly the reason for our company sponsored Git Tip tipping, and our sponsorship of events like YAPC::EU.

    So congratulations Salvador Fandiño, and thanks for your fantastic work on CPAN :-)

     
  20. Comments