python - Increase C++ regex replace performance -
i'm beginner c++ programmer working on small c++ project have process number of relatively large xml files , remove xml tags out of them. i've succeeded doing using c++0x regex library. however, i'm running performance issues. reading in files , executing regex_replace function on contents takes around 6 seconds on pc. can bring down 2 adding compiler optimization flags. using python, however, can done less 100 milliseconds. obviously, i'm doing inefficient in c++ code. can speed bit?
my c++ code:
std::regex xml_tags_regex("<[^>]*>"); (std::vector<std::string>::iterator = _files.begin(); != _files.end(); it++) { std::ifstream file(*it); file.seekg(0, std::ios::end); size_t size = file.tellg(); std::string buffer(size, ' '); file.seekg(0); file.read(&buffer[0], size); buffer = regex_replace(buffer, xml_tags_regex, ""); file.close(); }
my python code:
regex = re.compile('<[^>]*>') filename in filenames: open(filename) f: content = f.read() content = regex.sub('', content)
p.s. don't care processing complete file @ once. found reading file line line, word word or character character slowed down considerably.
i don't think you're doing "wrong" per-say, c++ regex library isn't fast python 1 (for use case @ time @ least). isn't surprising, keeping in mind python regex code c/c++ under hood well, , has been tuned on years pretty fast that's important feature in python, naturally going pretty fast.
but there other options in c++ getting things faster if need. i've used pcre ( http://pcre.org/ ) in past great results, though i'm sure there other ones out there these days well.
for case in particular however, can achieve you're after without regexes, in quick tests yielded 10x performance improvement. example, following code scans input string copying new buffer, when hits <
starts skipping on characters until sees closing >
std::string buffer(size, ' '); std::string outbuffer(size, ' '); ... read in buffer file size_t outbuffer_len = 0; (size_t i=0; < buffer.size(); ++i) { if (buffer[i] == '<') { while (buffer[i] != '>' && < buffer.size()) { ++i; } } else { outbuffer[outbuffer_len] = buffer[i]; ++outbuffer_len; } } outbuffer.resize(outbuffer_len);
Comments
Post a Comment