利用tesseract-ocr进行中文识别

https://github.com/tesseract-ocr/tesseract

介绍

tesseract是用c++编写的OCR engine- libtesseract 并且带有命令执行文件 - tesseract 未来可以集成在系统中进行信息转换和数据抓取，封装成api对外提供服务，目前只是短暂使用，本文的目的是实现中文识别为目标。

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from master branch on GitHub. Open issues can be found in issue tracker, and Planning wiki.

The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).

See Release Notes and Change Log for more details of the releases.

安装

Supported Compilers are:

GCC 4.8 and above
Clang 3.4 and above
MSVC 2015, 2017, 2019

Other compilers might work, but are not officially supported.

预编译二进制安装

macOS Homebrew

brew install tesseract

Training directories can be found using brew list tesseract Possible location can be /usr/local/Cellar/tesseract/3.05.02/share/tessdata/

Ubuntu

1
2


sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Note for Ubuntu users: In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.

1
2
3
4
5
6


sudo vi /etc/apt/sources.list

Copy the first line "deb http://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line.
If you are using a different release of ubuntu, then replace bionic with the respective release name.

deb http://archive.ubuntu.com/ubuntu bionic universe

Raspbian

Centos8

1
2
3
4


dnf config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_8/
rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
dnf install tesseract
dnf install tesseract-langpack-deu

Centos7

1
2
3
4
5


yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract 
yum install tesseract-langpack-deu

使用Tesseract命令行进行文字识别和提取

tesseract imagename outputbase [-l lang] [–oem ocrenginemode] [–psm pagesegmode] [configfiles…]

简单图片识别命令

注意：

默认使用英语作为识别语言
3 as the Page Segmentation Mode
默认输出格式：text

tesseract imagename outputb

指定语言识别库

文字训练模型库 https://github.com/tesseract-ocr/tessdata/
中文简体模型: chi_sim.traineddata

tesseract –tessdata-dir /usr/share imagename outputbase -l eng –psm 3

语言解析指定

单语言

tesseract testing/eurotext.png testing/eurotext-eng -l eng

多语言

tesseract testing/eurotext.png testing/eurotext-engdeu -l eng+deu

pdf输出格式

tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf

html格式输出

tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr

OutPut

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.05.00dev' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "./testing/eurotext.png"; bbox 0 0 1024 800; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661">
     <span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1_1' title='bbox 105 66 178 97; x_wconf 90'>The</span> <span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106; x_wconf 87'><strong>(quick)</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109; x_wconf 89'>[brown]</span> <span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110; x_wconf 89'>{fox}</span> <span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113; x_wconf 89'>jumps!</span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

TSV output (Currently available in 3.05-dev in master branch on github)

tesseract testing/eurotext.png testing/eurotext-eng -l eng tsv

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	1024	800	-1	
2	1	1	0	0	0	98	66	821	596	-1	
3	1	1	1	0	0	98	66	821	596	-1	
4	1	1	1	1	0	105	66	719	48	-1	
5	1	1	1	1	1	105	66	74	32	90	The
5	1	1	1	1	2	205	67	143	40	87	(quick)
5	1	1	1	1	3	376	69	153	41	89	[brown]
5	1	1	1	1	4	559	71	105	40	89	{fox}
5	1	1	1	1	5	687	73	137	41	89	jumps!
4	1	1	1	2	0	104	115	784	51	-1	
5	1	1	1	2	1	104	115	96	33	91	Over
5	1	1	1	2	2	224	117	60	32	89	the
5	1	1	1	2	3	310	117	224	39	88	$43,456.78
5	1	1	1	2	4	561	121	136	42	92	<lazy>
5	1	1	1	2	5	722	123	70	32	92	#90
5	1	1	1	2	6	818	125	70	41	89	dog

基于tesseract-OCR进行中文识别

链接地址