tag:blogger.com,1999:blog-8935780327334775165.post6572644343707146868..comments2023-08-07T22:48:57.800+08:00Comments on Entity Crisis: Do names belong in a URL?Unknownnoreply@blogger.comBlogger13125tag:blogger.com,1999:blog-8935780327334775165.post-30489941322644743562007-12-17T20:48:00.000+09:002007-12-17T20:48:00.000+09:00Problem is that those URL links are quote ugly, an...Problem is that those URL links are quote ugly, and pasting them in various pages/emails breaks layout.<BR/><BR/>For an app I'm developing (Lahak), the tags have two fields:<BR/><BR/>id : integer<BR/>tag_name: the name in the native language (for display purposes)<BR/>url_name: url friendly name (for links). Only english, digits, underscores and dashes allowed.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-51442887920322343672007-12-13T16:11:00.000+09:002007-12-13T16:11:00.000+09:00Consider using internationalized resource identifi...Consider using internationalized resource identifiers (IRIs):<BR/><BR/>http://www.ietf.org/rfc/rfc3987.txtglyphhttps://www.blogger.com/profile/07021175796928101086noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-32983673693760684822007-12-13T11:35:00.000+09:002007-12-13T11:35:00.000+09:00I take the unicode, convert to utf8 and then urlen...I take the unicode, convert to utf8 and then urlencode the result (so you end up with %xx for many bytes). Fortunately I am in complete control of both ends so I can ensure it works well and has extensive test coverage. Unfortunately there are differences between what the standards say and what browsers do (try entering unicode characters in a form!) If you want to be ascii friendly then I'd recommend a guid being returned as part of lookups and let things be accesses by name or by guid.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-11531333783899562402007-12-13T09:21:00.000+09:002007-12-13T09:21:00.000+09:00giyokun: thanks for the tip.giyokun: thanks for the tip.Simon Wittberhttps://www.blogger.com/profile/02730025645144151014noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-90710840420242908852007-12-13T05:12:00.000+09:002007-12-13T05:12:00.000+09:00I would advise you remove the bit of japanese you ...I would advise you remove the bit of japanese you added in that post as it is highly offensive and therefore your pagerank could suffer and your page might get added to automatic filtering software.<BR/><BR/>Instead use the following for example: <BR/>日本語のタグ <BR/>which means <BR/>tag in japaneseAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-37635058110368444922007-12-13T00:55:00.000+09:002007-12-13T00:55:00.000+09:00or /tags/?encoding/?name.or /tags/?encoding/?name.Steve Lewishttps://www.blogger.com/profile/02582792183339096661noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-13770714985526954032007-12-12T23:18:00.000+09:002007-12-12T23:18:00.000+09:00What you're talking about, /tags/foo, is more abou...What you're talking about, <I>/tags/foo</I>, is more about "pretty" URLs then being <A HREF="http://en.wikipedia.org/wiki/Representational_State_Transfer" REL="nofollow">RESTful</A>. <BR/><BR/>Why not allow both ids and names with parameters: "/tags?name=foo", "/tags?id=3201"?<BR/><BR/>I'm not saying pretty URLs are bad; but there are problems with them. They tend to be interface limiters since they only work when the number of arguments and their positions are enough to distinguish between the possible handlers.<BR/><BR/>An example of this ambiguity: let's assume Alice tags some items by area code <I>540</I>. When she navigates to <I>/tags/540</I> should she be given all items tagged as <I>540</I> or all items tagged by the tag with an id of <I>540</I>?Matthttps://www.blogger.com/profile/00294597858824231202noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-2547977964219029932007-12-12T21:30:00.000+09:002007-12-12T21:30:00.000+09:00Use names. Usually language specific characters ar...Use names. Usually language specific characters are important only to natives and they can type them. It easier to tell your friend "check tag Klaipėda at del.icio.us" (http://del.icio.us/tag/Klaip%C4%97da) than check tag id 1456481. There is no problem typing international characters in URL using firefox or IE. Not natives will do copy-paste anyway.Daliushttps://www.blogger.com/profile/04656237796151685377noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-88220230968614898082007-12-12T19:38:00.000+09:002007-12-12T19:38:00.000+09:00Use names, if you care about human friendliness. ...Use names, if you care about human friendliness. del.icio.us manages, here's an example: <A HREF="http://del.icio.us/tag/%D0%BC%D0%B0%D1%82%D0%B5%D0%BC%D0%B0%D1%82%D0%B8%D0%BA%D0%B0" REL="nofollow">All items tagged математика</A>.Marius Gedminashttps://www.blogger.com/profile/15155998626202067226noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-10860946499928197352007-12-12T19:04:00.000+09:002007-12-12T19:04:00.000+09:00I'd use the ID. But I'd generate the ID in a URL-f...I'd use the ID. But I'd generate the ID in a URL-friendly way from the name. In the case of names with purely non-ascii characters, that means the names probably would end up purely numeric, of course.Lennart Regebrohttps://www.blogger.com/profile/08337807480455483637noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-47440799473196178892007-12-12T18:14:00.000+09:002007-12-12T18:14:00.000+09:00> What happens when you may have tags in different...> What happens when you may have tags in different languages?<BR/><BR/>IMO it's still unwise to put any encoding other than ASCII (7) in a URL.<BR/><BR/>For starts, client support is mixed - the newest browsers are getting better at supporting Unicode in URLs while helping to prevent phishing but there's still tricky issues e.g. when you type a URL into your address bar, your browser has no idea what encoding the site it's about to visit is using, so will likely select it's default encoding, which may be wrong. Meanwhile popular email clients seem to lag browsers by a generation, so likelihood that a URL containing Unicode chars survives copying, pasting, sending and receiving are quite low.<BR/><BR/>It's also possible that putting user-generated Unicode in URLs is opening yourself to potential XSS exploits and perhaps server side command injection (does a filtering regex designed for ASCII input really protect you?)hfueckshttps://www.blogger.com/profile/12313879751453031158noreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-89539740546811688212007-12-12T18:13:00.000+09:002007-12-12T18:13:00.000+09:00Wether Unicode or any other encoding than latin-1 ...Wether Unicode or any other encoding than latin-1 (ISO-8859-1) are allowed in URL's is dubious at best. <BR/><BR/><I>"In the case of non-ISO-8859-1 characters (characters above FF hex/255 decimal in the Unicode set), they just can not be used in URLs, because there is no safe way to specify character set information in the URL content yet [RFC2396.]"</I> -- <A HREF="http://www.blooberry.com/indexdot/html/topics/urlencoding.htm" REL="nofollow">blooberry.com - URL encoding</A><BR/><BR/>However, the RFC doesn't dismiss other encodings outright. It merely says it's undefined how it should behave:<BR/><BR/><I>"Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used."</I> -- RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax<BR/><BR/>An updated RFC says:<BR/><BR/><I>"In some cases, the internal interface between a URI component and the identifying data that it has been crafted to represent is much less direct than a character encoding translation. For example, portions of a URI might reflect a query on non-ASCII data, or numeric coordinates on a map. Likewise, a URI scheme may define components with additional encoding requirements that are applied prior to forming the component and producing the URI.<BR/><BR/>When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2"."</I> -- RFC 3986: Uniform Resource Identifier (URI): Generic Syntax<BR/><BR/>The big question is.. how will different browsers/clients behave, since they don't adhere strictly to most standards anyway. And is having octet-encoded URLs really that much cleaner than ids?<BR/><BR/>Personally, I'd go with ids; especially if they are the primary key for the data.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8935780327334775165.post-8316729427592650792007-12-12T16:47:00.000+09:002007-12-12T16:47:00.000+09:00do it bothdo it bothAnonymousnoreply@blogger.com