【注意】最后更新于 January 16, 2020,文中内容可能已过时,请谨慎使用。
https://github.com/tesseract-ocr/tesseract
介绍
tesseract是用c++编写的OCR engine- libtesseract
并且带有命令执行文件 - tesseract
未来可以集成在系统中进行信息转换和数据抓取,封装成api对外提供服务,目前只是短暂使用,本文的目的是实现中文识别为目标。
Brief history
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019.
Latest source code is available from master branch on GitHub.
Open issues can be found in issue tracker,
and Planning wiki.
The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).
See Release Notes and Change Log for more details of the releases.
安装
Supported Compilers are:
- GCC 4.8 and above
- Clang 3.4 and above
- MSVC 2015, 2017, 2019
Other compilers might work, but are not officially supported.
预编译二进制安装
macOS Homebrew
brew install tesseract
Training directories can be found using brew list tesseract
Possible location can be /usr/local/Cellar/tesseract/3.05.02/share/tessdata/
Ubuntu
1
2
|
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
|
Note for Ubuntu users: In case apt is unable to find the package try adding universe
entry to the sources.list
file as shown below.
1
2
3
4
5
6
|
sudo vi /etc/apt/sources.list
Copy the first line "deb http://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line.
If you are using a different release of ubuntu, then replace bionic with the respective release name.
deb http://archive.ubuntu.com/ubuntu bionic universe
|
Raspbian
Centos8
1
2
3
4
|
dnf config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_8/
rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
dnf install tesseract
dnf install tesseract-langpack-deu
|
Centos7
1
2
3
4
5
|
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract
yum install tesseract-langpack-deu
|
tesseract imagename outputbase [-l lang] [–oem ocrenginemode] [–psm pagesegmode] [configfiles…]
简单图片识别命令
注意:
- 默认使用英语作为识别语言
- 3 as the Page Segmentation Mode
- 默认输出格式:text
tesseract imagename outputb
指定语言识别库
tesseract –tessdata-dir /usr/share imagename outputbase -l eng –psm 3
语言解析指定
tesseract testing/eurotext.png testing/eurotext-eng -l eng
tesseract testing/eurotext.png testing/eurotext-engdeu -l eng+deu
pdf输出格式
tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf
html格式输出
tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
OutPut
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract 3.05.00dev' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "./testing/eurotext.png"; bbox 0 0 1024 800; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661">
<span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1_1' title='bbox 105 66 178 97; x_wconf 90'>The</span> <span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106; x_wconf 87'><strong>(quick)</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109; x_wconf 89'>[brown]</span> <span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110; x_wconf 89'>{fox}</span> <span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113; x_wconf 89'>jumps!</span>
</span>
</p>
</div>
</div>
</body>
</html>
|
TSV output (Currently available in 3.05-dev in master branch on github)
tesseract testing/eurotext.png testing/eurotext-eng -l eng tsv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 1024 800 -1
2 1 1 0 0 0 98 66 821 596 -1
3 1 1 1 0 0 98 66 821 596 -1
4 1 1 1 1 0 105 66 719 48 -1
5 1 1 1 1 1 105 66 74 32 90 The
5 1 1 1 1 2 205 67 143 40 87 (quick)
5 1 1 1 1 3 376 69 153 41 89 [brown]
5 1 1 1 1 4 559 71 105 40 89 {fox}
5 1 1 1 1 5 687 73 137 41 89 jumps!
4 1 1 1 2 0 104 115 784 51 -1
5 1 1 1 2 1 104 115 96 33 91 Over
5 1 1 1 2 2 224 117 60 32 89 the
5 1 1 1 2 3 310 117 224 39 88 $43,456.78
5 1 1 1 2 4 561 121 136 42 92 <lazy>
5 1 1 1 2 5 722 123 70 32 92 #90
5 1 1 1 2 6 818 125 70 41 89 dog
|
基于tesseract-OCR进行中文识别
链接地址
文章作者
lixueping
上次更新
2020-01-16
(e69668a)
许可协议
MIT