Code Tokenizers 0.0.5 | Coderz Repository

code-tokenizers 0.0.5

Last updated:

0 purchases

code-tokenizers 0.0.5 Image
code-tokenizers 0.0.5 Images

Free

Languages

Categories

Add to Cart

Description:

codetokenizers 0.0.5

code_tokenizers

This library is built on top of the awesome
transformers and
tree-sitter libraries.
It provides a simple interface to align the tokens produced by a BPE
tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers

How to use
The main interface of code_tokenizers is the
CodeTokenizer
class. You can use a pretrained BPE tokenizer from the popular
transformers
library, and a tree-sitter parser from the
tree-sitter
library.
To specify a
CodeTokenizer
using the gpt2 BPE tokenizer and the python tree-sitter parser, you
can do:
from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface
hub or a local directory and the language to parse the
AST for.
Now, we can tokenize some code:
from pprint import pprint

code = """
def foo():
print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)

{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'is_builtins': [...],
'is_internal_methods': [...],
'merged_ast': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}

And we can print out the associated AST types:


Note
Note: Here the N/As are the tokens that are not part of the AST, such
as the spaces and the newline characters. Their IDs are set to -1.


for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")

N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files In This Product: (if this is empty don't purchase this product)

Customer Reviews

There are no reviews.