Archive for the ‘unicode’ Category.

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

Hi,

I have taken a short break! There are many more error messages, which are keep on accumulating behind the screens. But I am so lazy to put them in grassfield.

Anyway, today I got interrupted with an interesting exception

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

The scenario was, I am trying to parse an xml string. I am taking the byte array from the xml string, and give that array as input to xml reader stream. I have used java.lang.String.getBytes() for this.

Unfortunately, I got a chinese (or any other funny) characters as a value of one node in the xml. Ooof. I got up with the above error. Later, I found that getBytes() method supports only the western encoding, not UTF-8. So by using java.lang.String.getBytes("UTF-8") method, we solved the issue! nice na!

Make fun with Text : java.util.Scanner


One of my friend came with the sun’s newsletter today morning. I was wondering about their demo on a new class, java.util.Scanner
See, parsing the string becomes very simple, like iterating a list.

Scanner accepts streams, file and other string input mechanisms and parses the string and give is the tokens. (It also allows the user to specify using which encoding the text has been built. goo news for localisation guyz like me). By default, whatever you have given, it is tokenized by having the default delimiter, space, See the following example,

import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

The output is

C:\>java test
import
java.util.*;
import
java.io.*;
public
class
test
{
public
static
void
main(String
[]
args)
throws
FileNotFoundException
{
File
f
=
new
File(“test.java”);
Scanner
scanner
=
new
Scanner(f);
while
(scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

funny, isnt it!!!

We can also change the delimiter, see the following example

import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
scanner.useDelimiter(“\n”);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

The output is same as that of above code, want to see that one also?

C:\>java test
import java.util.*;
import java.io.*;
public class test
{
public static void main(String [] args) throws FileNotFoundException
{
File f = new File(“test.java”);
Scanner scanner = new Scanner(f);
scanner.useDelimiter(“\n”);
while (scanner.hasNext())
{
System.out.println(scanner.next());
}
scanner.close();
}
}

Really good one! But I really miss the fun of using the streams :(

BOM – Byte Order Mark

Hi,
When files are saved from notepad like MS applications, a byte order mark may be added in the first line of your file. This is handled in editors so that you cant view them. But they may put you in trouble when you are reading those files (like IniFile, or reading from any other static files). either you have to use *writeUTF() *and *readUTF() *methods in Data I/O Streams or check for character *(char) 65279* (if you use UTF-8, for other unicode encodings the character may not be the same) at the beginning of you file.

for more info, pl. search for BYTE ORDER MARK in google :)

You may know already, when you open a files containing unicode character, they may be displayed as question marks or junk characters in Editplus, Textpad or any other unicode not-ready editors. Notepad, word pad is working fine

—————————————————
*Free* software is a matter of liberty not price. You should think of “free” as in “free speech”.

Chris Holland: The Blog.: ServletRequest.getParameter and UTF-8

How to display tamil text with java5 (applets and java swing based applications)

Java 1.3+ should display tamil without altering anything. but jre is
not configured for tamil officially till now. (Devanagari is added).
So you will see only boxes in applets or any other swing applications.
here is the procedure to see tamil characters.

It is considered you have Latha font installed.

Open java directory then go into ‘jre’ directory. Copy-Paste the
“fontconfig.properties.src” file, and rename it as
“fontconfig.properties”.This file then should be the one Java will use
by default. now open the “fontconfig.properties” file.

Find this line :

# Component Font Mappings

Then add :

allfonts.tamil=Latha

You must then find

sequence.allfonts=alphabetic

/default,dingbats,symbol

and add tamil: (am not sure, whether it is really to be done. it
worked without this entry)
sequence.allfonts=alphabetic/default,dingbats,symbol,tamil

add the line

sequence.allfonts.UTF-8.ta=alphabetic/1252,tamil,dingbats,symbol

restart if you are using any java applications. and restart the
browser (deleting temporary files may help if it dint reflect)

post a message, if anything dint work as expected.

-p.