Wednesday, December 12, 2007

Do names belong in a URL?

Dear Lazyweb.

Imagine a nice RESTful interface for working with Tags. The URL:
/tags/
will return a list of all the tags.

The URL:
/tags/foo/
will return a list of all the items that are associated with the tag "foo".

Or should it?

What happens when you may have tags in different languages? Is something like this:
/tags/日本語のタグ/
possible or even desirable? (These characters were copied from a spam email, I have no idea what it says.)

Should the tag collection be accessed by id, rather than name? Eg:
/tags/1/
This is uglier, but more usable across languages and character sets.

Hmmm. What do I do....?

13 comments:

Anonymous said...

do it both

Ferry Boender said...

Wether Unicode or any other encoding than latin-1 (ISO-8859-1) are allowed in URL's is dubious at best.

"In the case of non-ISO-8859-1 characters (characters above FF hex/255 decimal in the Unicode set), they just can not be used in URLs, because there is no safe way to specify character set information in the URL content yet [RFC2396.]" -- blooberry.com - URL encoding

However, the RFC doesn't dismiss other encodings outright. It merely says it's undefined how it should behave:

"Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used." -- RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax

An updated RFC says:

"In some cases, the internal interface between a URI component and the identifying data that it has been crafted to represent is much less direct than a character encoding translation. For example, portions of a URI might reflect a query on non-ASCII data, or numeric coordinates on a map. Likewise, a URI scheme may define components with additional encoding requirements that are applied prior to forming the component and producing the URI.

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2"."
-- RFC 3986: Uniform Resource Identifier (URI): Generic Syntax

The big question is.. how will different browsers/clients behave, since they don't adhere strictly to most standards anyway. And is having octet-encoded URLs really that much cleaner than ids?

Personally, I'd go with ids; especially if they are the primary key for the data.

Harry Fuecks said...

> What happens when you may have tags in different languages?

IMO it's still unwise to put any encoding other than ASCII (7) in a URL.

For starts, client support is mixed - the newest browsers are getting better at supporting Unicode in URLs while helping to prevent phishing but there's still tricky issues e.g. when you type a URL into your address bar, your browser has no idea what encoding the site it's about to visit is using, so will likely select it's default encoding, which may be wrong. Meanwhile popular email clients seem to lag browsers by a generation, so likelihood that a URL containing Unicode chars survives copying, pasting, sending and receiving are quite low.

It's also possible that putting user-generated Unicode in URLs is opening yourself to potential XSS exploits and perhaps server side command injection (does a filtering regex designed for ASCII input really protect you?)

Lennart Regebro said...

I'd use the ID. But I'd generate the ID in a URL-friendly way from the name. In the case of names with purely non-ascii characters, that means the names probably would end up purely numeric, of course.

Marius said...

Use names, if you care about human friendliness. del.icio.us manages, here's an example: All items tagged математика.

Dalius said...

Use names. Usually language specific characters are important only to natives and they can type them. It easier to tell your friend "check tag Klaipėda at del.icio.us" (http://del.icio.us/tag/Klaip%C4%97da) than check tag id 1456481. There is no problem typing international characters in URL using firefox or IE. Not natives will do copy-paste anyway.

Matt said...

What you're talking about, /tags/foo, is more about "pretty" URLs then being RESTful.

Why not allow both ids and names with parameters: "/tags?name=foo", "/tags?id=3201"?

I'm not saying pretty URLs are bad; but there are problems with them. They tend to be interface limiters since they only work when the number of arguments and their positions are enough to distinguish between the possible handlers.

An example of this ambiguity: let's assume Alice tags some items by area code 540. When she navigates to /tags/540 should she be given all items tagged as 540 or all items tagged by the tag with an id of 540?

Steve Lewis said...

or /tags/?encoding/?name.

giyokun said...

I would advise you remove the bit of japanese you added in that post as it is highly offensive and therefore your pagerank could suffer and your page might get added to automatic filtering software.

Instead use the following for example:
日本語のタグ 
which means
tag in japanese

Simon Wittber said...

giyokun: thanks for the tip.

Anonymous said...

I take the unicode, convert to utf8 and then urlencode the result (so you end up with %xx for many bytes). Fortunately I am in complete control of both ends so I can ensure it works well and has extensive test coverage. Unfortunately there are differences between what the standards say and what browsers do (try entering unicode characters in a form!) If you want to be ascii friendly then I'd recommend a guid being returned as part of lookups and let things be accesses by name or by guid.

Glyph Lefkowitz said...

Consider using internationalized resource identifiers (IRIs):

http://www.ietf.org/rfc/rfc3987.txt

mksoft said...

Problem is that those URL links are quote ugly, and pasting them in various pages/emails breaks layout.

For an app I'm developing (Lahak), the tags have two fields:

id : integer
tag_name: the name in the native language (for display purposes)
url_name: url friendly name (for links). Only english, digits, underscores and dashes allowed.

Popular Posts