Project Development: Reading and Writing Chinese Characters with Java

Wednesday, October 2, 2013

Reading and Writing Chinese Characters with Java

Today, after I found a page on a website has tremendous information I need, I fetched it by viewing its HTML source code and then extract the information from it.

Obviously, the first job of succeeding on this task is capability of reading this source code, then write the extracted information to the target file. Since the information mainly focused on Chinese speaker viewer, most of it are written in Chinese with some encoding I barely take care with.

I firstly read the source file, simply using BufferedReader with FileReader, then print it with system print function. What I got is few English characters with question marks and other symbols. After few google search, other programmers suggest the default output encoding for Eclipse (The IDE I use mostly for Java) is "MacRoman", not what I did expected "UTF-8". Then I got the same thing with weird symbols(some of them I've never seen in my life) still even I changed the output encoding by "run-configurations -> Common -> Encoding". After this, I realize the way I read Chinese characters from file or console already produced those messy symbols, So I started to use:

   BufferedReader in = new BufferedReader(new InputStreamReader(
    new FileInputStream("input.html"), "UTF-8"));

Problem solved.

For the similar idea, I chose PrintStream to write extracted information to file.

        PrintStream out = new PrintStream("output.html", "UTF-8");

Needless to mention, PrintWriter doesn't work here for this purpose.
Key Points:

  • If something weird printed, there must be encoding.
  • Choose the way could specify encoding will be more easily handle this problem.